Embedding Models · Hidden Infrastructure Layer

Embedding Models The invisible substrate that everything else runs on — and why nobody talks about it.

RAG, agent memory, semantic search, recommendations — they all depend on one thing: the quality of your embeddings. Embeddings are the wiring inside the wall. You don't see them. You notice them when they fail. And most teams spend almost no time thinking about them until something goes wrong.

4TB
Storage for 1B vectors at 1024 dimensions before indexing overhead
5.8d
Time to embed 1B documents on a single L4 GPU at 2,000 tok/sec
32×
Compression from binary quantization with only ~5% quality loss
80%
Compute savings from semantic caching at 0.85+ similarity threshold
// Vector Space Diagram · Semantic Proximity · Text → Embedding → Retrieval SCHEMA v3

§01 What an Embedding Actually Is

An embedding is a dense numerical vector — an array of floating-point numbers — that represents the meaning of a piece of content. Text, images, audio, code: all of it can be embedded. The critical property is that semantically similar content produces geometrically close vectors. "Motivated seller wants fast close" and "owner needs to sell quickly" — different words, similar meaning — will cluster near each other in vector space. "The cat sat on the mat" will be nowhere nearby.

This is the capability that makes semantic search possible. Traditional keyword search matches exact terms. Embedding-based search matches meaning. The query "what did the buyer say about the inspection?" finds the relevant conversation even if the word "inspection" never appears in the stored text, because the embedding model has learned that "inspection," "walkthrough," "property condition," and "disclosed defect" share semantic space.

Why this matters for agents: Every memory architecture, every RAG pipeline, every semantic retrieval system that powers modern AI agents sits on top of an embedding model. The model's quality determines whether the agent retrieves the right context. Swap in a better embedding model and your agent gets smarter without changing a line of agent code. Embeddings are leverage.

§02 The Dimensionality Decision

Every embedding has a dimensionality — the length of the vector. A 384-dimension embedding represents content as 384 numbers. A 3,072-dimension embedding uses 3,072. Higher dimensions can carry more semantic nuance. But the relationship between dimensions and quality is not linear — and the infrastructure cost definitely is.

384
dims
Compact / High-Throughput

E5-small, BGE-small. ~118M params. CPU-viable. Best for real-time, high-volume retrieval where latency budget is tight. Often outperforms models with 10× more parameters on production tasks. The e5-small achieved best-in-tier results despite being the smallest model tested.

1024
dims
Balanced / Production Standard

Voyage-3-large, BGE-large, Cohere Embed v3. The production sweet spot: captures rich semantic structure without the storage overhead of 3K-dim vectors. Voyage-3-large outperforms OpenAI's 3072-dim model by 9.74% while requiring 3× less storage. 1024 is the right default for most RAG systems.

3072
dims
Maximum Fidelity / High-Cost

OpenAI text-embedding-3-large, Google Gemini Embedding. Highest semantic fidelity on benchmark tasks. Also the highest storage cost: 1B vectors at 3072 dims requires ~12TB. Most production systems don't need this. Use when you're running enterprise-scale semantic search where the additional fidelity demonstrably affects retrieval quality on your specific corpus.

§03 The Current Landscape

// Embedding Model Comparison · Production Characteristics · December 2025
ModelDimsContextCost / 1M tokensRelative AccuracyBest For
Voyage-3-large 1024 32K $0.06 +9.74% over OpenAI Production RAG, long documents, best accuracy/cost ratio
OpenAI text-emb-3-large 3072 8K $0.13 Baseline benchmark Battle-tested, managed infra, most integrations
E5-small (open source) 384 512 Self-hosted Best-in-tier Top-5 accuracy High-throughput, CPU-viable, cost-sensitive
NVIDIA NV-Embed-Nemotron 4096 8K Self-hosted 62% Top-1 accuracy (highest tested) Enterprise RAG, GPU infra available, max precision
Cohere Embed v3 1024 512 $0.10 Solid, multilingual Multilingual deployments, Cohere ecosystem
pgvector + any model Any Any Storage only Already running Postgres (Supabase)? Start here.
The non-obvious finding: The e5-small model — 118M parameters, 384 dimensions — achieved best-in-tier Top-5 retrieval accuracy in independent benchmarking, outperforming models with 70× more parameters. Bigger is not better in embeddings. What matters is whether the model's training distribution matches your retrieval task. Always benchmark on your actual data.

§04 Infrastructure Optimization

At scale, embedding infrastructure is not an afterthought — it's a cost center. A billion-item corpus on a single GPU takes over five days to embed. API costs at $0.13/M tokens for 100M monthly queries is $13,000/month before you factor in retrieval. The optimizations that matter in production:

MQ
Binary Quantization
Compression · Storage reduction

Reduce 32-bit float vectors to 1-bit binary representations. 32× compression with only ~5% quality loss. A 4TB vector store becomes 128GB. Trade a small amount of precision for an enormous infrastructure cost reduction. For most production RAG systems, the 5% quality loss is invisible to end users.

32× compression · ~5% quality loss
SC
Semantic Caching
Compute savings · Latency reduction

Cache embedding results and check semantic similarity before re-embedding. If an incoming query is >0.85 cosine similarity to a cached query, return the cached result. Production systems achieve 80–95% cache hit rates on stable domains — which means 80–95% of your embedding compute budget disappears. The threshold (0.85–0.95) controls the precision/savings tradeoff.

80–95% compute savings · sub-ms cache retrieval
MR
Matryoshka Embeddings
Flexible dimensionality · OpenAI text-emb-3

Embeddings trained with the Matryoshka Representation Learning objective can be truncated at inference time without retraining. A 1536-dimension embedding can be shortened to 256 dimensions and still be useful — just less precise. Use full dimensions for high-stakes retrieval, truncated for fast first-pass filtering. OpenAI's text-embedding-3 family supports this natively.

Flexible dim reduction · no retraining required
BQ
Batch Sort by Length
GPU efficiency · 20–40% compute savings

Sort your embedding batch by token length before processing. This minimizes padding — short documents padded to match long ones waste GPU cycles. Sorting by length so that each batch contains similarly-sized inputs delivers 20–40% compute savings on embedding generation jobs with no quality impact. A near-free optimization that most teams never implement.

20–40% GPU compute savings · zero quality cost
RE
Reranking Pass
Precision improvement · 10–30% recall boost

Vector similarity search retrieves a first-pass candidate set. A cross-encoder reranking model then scores each candidate against the original query — more expensive but far more precise. Reranking adds 50–100ms latency but delivers 10–30% precision improvement over vector-only retrieval. For high-stakes retrieval (legal, medical, financial), this is non-negotiable.

10–30% precision improvement · +50–100ms latency
The model that talks to your users is visible. The model that decides what it knows is not.

Every RAG response, every memory retrieval, every semantic search result flows through the embedding layer first. The quality of the embedding model determines the ceiling of what the generative model can say. You can upgrade the LLM. If the retrieval substrate is wrong, nothing above it improves.

JE
Justin Erickson · PropTechUSA.ai
pgvector · Supabase · proptechusa-memory worker · 87 Cloudflare Workers · March 2026
Continue Reading · Series 3