§01 What an Embedding Actually Is
An embedding is a dense numerical vector — an array of floating-point numbers — that represents the meaning of a piece of content. Text, images, audio, code: all of it can be embedded. The critical property is that semantically similar content produces geometrically close vectors. "Motivated seller wants fast close" and "owner needs to sell quickly" — different words, similar meaning — will cluster near each other in vector space. "The cat sat on the mat" will be nowhere nearby.
This is the capability that makes semantic search possible. Traditional keyword search matches exact terms. Embedding-based search matches meaning. The query "what did the buyer say about the inspection?" finds the relevant conversation even if the word "inspection" never appears in the stored text, because the embedding model has learned that "inspection," "walkthrough," "property condition," and "disclosed defect" share semantic space.
§02 The Dimensionality Decision
Every embedding has a dimensionality — the length of the vector. A 384-dimension embedding represents content as 384 numbers. A 3,072-dimension embedding uses 3,072. Higher dimensions can carry more semantic nuance. But the relationship between dimensions and quality is not linear — and the infrastructure cost definitely is.
E5-small, BGE-small. ~118M params. CPU-viable. Best for real-time, high-volume retrieval where latency budget is tight. Often outperforms models with 10× more parameters on production tasks. The e5-small achieved best-in-tier results despite being the smallest model tested.
Voyage-3-large, BGE-large, Cohere Embed v3. The production sweet spot: captures rich semantic structure without the storage overhead of 3K-dim vectors. Voyage-3-large outperforms OpenAI's 3072-dim model by 9.74% while requiring 3× less storage. 1024 is the right default for most RAG systems.
OpenAI text-embedding-3-large, Google Gemini Embedding. Highest semantic fidelity on benchmark tasks. Also the highest storage cost: 1B vectors at 3072 dims requires ~12TB. Most production systems don't need this. Use when you're running enterprise-scale semantic search where the additional fidelity demonstrably affects retrieval quality on your specific corpus.
§03 The Current Landscape
| Model | Dims | Context | Cost / 1M tokens | Relative Accuracy | Best For |
|---|---|---|---|---|---|
| Voyage-3-large | 1024 | 32K | $0.06 | +9.74% over OpenAI | Production RAG, long documents, best accuracy/cost ratio |
| OpenAI text-emb-3-large | 3072 | 8K | $0.13 | Baseline benchmark | Battle-tested, managed infra, most integrations |
| E5-small (open source) | 384 | 512 | Self-hosted | Best-in-tier Top-5 accuracy | High-throughput, CPU-viable, cost-sensitive |
| NVIDIA NV-Embed-Nemotron | 4096 | 8K | Self-hosted | 62% Top-1 accuracy (highest tested) | Enterprise RAG, GPU infra available, max precision |
| Cohere Embed v3 | 1024 | 512 | $0.10 | Solid, multilingual | Multilingual deployments, Cohere ecosystem |
| pgvector + any model | Any | Any | Storage only | — | Already running Postgres (Supabase)? Start here. |
§04 Infrastructure Optimization
At scale, embedding infrastructure is not an afterthought — it's a cost center. A billion-item corpus on a single GPU takes over five days to embed. API costs at $0.13/M tokens for 100M monthly queries is $13,000/month before you factor in retrieval. The optimizations that matter in production:
Reduce 32-bit float vectors to 1-bit binary representations. 32× compression with only ~5% quality loss. A 4TB vector store becomes 128GB. Trade a small amount of precision for an enormous infrastructure cost reduction. For most production RAG systems, the 5% quality loss is invisible to end users.
32× compression · ~5% quality lossCache embedding results and check semantic similarity before re-embedding. If an incoming query is >0.85 cosine similarity to a cached query, return the cached result. Production systems achieve 80–95% cache hit rates on stable domains — which means 80–95% of your embedding compute budget disappears. The threshold (0.85–0.95) controls the precision/savings tradeoff.
80–95% compute savings · sub-ms cache retrievalEmbeddings trained with the Matryoshka Representation Learning objective can be truncated at inference time without retraining. A 1536-dimension embedding can be shortened to 256 dimensions and still be useful — just less precise. Use full dimensions for high-stakes retrieval, truncated for fast first-pass filtering. OpenAI's text-embedding-3 family supports this natively.
Flexible dim reduction · no retraining requiredSort your embedding batch by token length before processing. This minimizes padding — short documents padded to match long ones waste GPU cycles. Sorting by length so that each batch contains similarly-sized inputs delivers 20–40% compute savings on embedding generation jobs with no quality impact. A near-free optimization that most teams never implement.
20–40% GPU compute savings · zero quality costVector similarity search retrieves a first-pass candidate set. A cross-encoder reranking model then scores each candidate against the original query — more expensive but far more precise. Reranking adds 50–100ms latency but delivers 10–30% precision improvement over vector-only retrieval. For high-stakes retrieval (legal, medical, financial), this is non-negotiable.
10–30% precision improvement · +50–100ms latencyEvery RAG response, every memory retrieval, every semantic search result flows through the embedding layer first. The quality of the embedding model determines the ceiling of what the generative model can say. You can upgrade the LLM. If the retrieval substrate is wrong, nothing above it improves.