Embedding Models: The Invisible Infrastructure Layer

§01 What an Embedding Actually Is

An embedding is a dense numerical vector — an array of floating-point numbers — that represents the meaning of a piece of content. Text, images, audio, code: all of it can be embedded. The critical property is that semantically similar content produces geometrically close vectors. "Motivated seller wants fast close" and "owner needs to sell quickly" — different words, similar meaning — will cluster near each other in vector space. "The cat sat on the mat" will be nowhere nearby.

This is the capability that makes semantic search possible. Traditional keyword search matches exact terms. Embedding-based search matches meaning. The query "what did the buyer say about the inspection?" finds the relevant conversation even if the word "inspection" never appears in the stored text, because the embedding model has learned that "inspection," "walkthrough," "property condition," and "disclosed defect" share semantic space.

◈

Why this matters for agents: Every memory architecture, every RAG pipeline, every semantic retrieval system that powers modern AI agents sits on top of an embedding model. The model's quality determines whether the agent retrieves the right context. Swap in a better embedding model and your agent gets smarter without changing a line of agent code. Embeddings are leverage.

§02 The Dimensionality Decision

Every embedding has a dimensionality — the length of the vector. A 384-dimension embedding represents content as 384 numbers. A 3,072-dimension embedding uses 3,072. Higher dimensions can carry more semantic nuance. But the relationship between dimensions and quality is not linear — and the infrastructure cost definitely is.

384

dims

Compact / High-Throughput

E5-small, BGE-small. ~118M params. CPU-viable. Best for real-time, high-volume retrieval where latency budget is tight. Often outperforms models with 10× more parameters on production tasks. The e5-small achieved best-in-tier results despite being the smallest model tested.

1024

dims

Balanced / Production Standard

Voyage-3-large, BGE-large, Cohere Embed v3. The production sweet spot: captures rich semantic structure without the storage overhead of 3K-dim vectors. Voyage-3-large outperforms OpenAI's 3072-dim model by 9.74% while requiring 3× less storage. 1024 is the right default for most RAG systems.

3072

dims

Maximum Fidelity / High-Cost

OpenAI text-embedding-3-large, Google Gemini Embedding. Highest semantic fidelity on benchmark tasks. Also the highest storage cost: 1B vectors at 3072 dims requires ~12TB. Most production systems don't need this. Use when you're running enterprise-scale semantic search where the additional fidelity demonstrably affects retrieval quality on your specific corpus.

§03 The Current Landscape

// Embedding Model Comparison · Production Characteristics · December 2025

Model	Dims	Context	Cost / 1M tokens	Relative Accuracy	Best For
Voyage-3-large	1024	32K	$0.06	+9.74% over OpenAI	Production RAG, long documents, best accuracy/cost ratio
OpenAI text-emb-3-large	3072	8K	$0.13	Baseline benchmark	Battle-tested, managed infra, most integrations
E5-small (open source)	384	512	Self-hosted	Best-in-tier Top-5 accuracy	High-throughput, CPU-viable, cost-sensitive
NVIDIA NV-Embed-Nemotron	4096	8K	Self-hosted	62% Top-1 accuracy (highest tested)	Enterprise RAG, GPU infra available, max precision
Cohere Embed v3	1024	512	$0.10	Solid, multilingual	Multilingual deployments, Cohere ecosystem
pgvector + any model	Any	Any	Storage only	—	Already running Postgres (Supabase)? Start here.

★

The non-obvious finding: The e5-small model — 118M parameters, 384 dimensions — achieved best-in-tier Top-5 retrieval accuracy in independent benchmarking, outperforming models with 70× more parameters. Bigger is not better in embeddings. What matters is whether the model's training distribution matches your retrieval task. Always benchmark on your actual data.

§04 Infrastructure Optimization

At scale, embedding infrastructure is not an afterthought — it's a cost center. A billion-item corpus on a single GPU takes over five days to embed. API costs at $0.13/M tokens for 100M monthly queries is $13,000/month before you factor in retrieval. The optimizations that matter in production:

Binary Quantization

Compression · Storage reduction

Reduce 32-bit float vectors to 1-bit binary representations. 32× compression with only ~5% quality loss. A 4TB vector store becomes 128GB. Trade a small amount of precision for an enormous infrastructure cost reduction. For most production RAG systems, the 5% quality loss is invisible to end users.

32× compression · ~5% quality loss

Semantic Caching

Compute savings · Latency reduction

Cache embedding results and check semantic similarity before re-embedding. If an incoming query is >0.85 cosine similarity to a cached query, return the cached result. Production systems achieve 80–95% cache hit rates on stable domains — which means 80–95% of your embedding compute budget disappears. The threshold (0.85–0.95) controls the precision/savings tradeoff.

80–95% compute savings · sub-ms cache retrieval

Matryoshka Embeddings

Flexible dimensionality · OpenAI text-emb-3

Embeddings trained with the Matryoshka Representation Learning objective can be truncated at inference time without retraining. A 1536-dimension embedding can be shortened to 256 dimensions and still be useful — just less precise. Use full dimensions for high-stakes retrieval, truncated for fast first-pass filtering. OpenAI's text-embedding-3 family supports this natively.

Flexible dim reduction · no retraining required

Batch Sort by Length

GPU efficiency · 20–40% compute savings

Sort your embedding batch by token length before processing. This minimizes padding — short documents padded to match long ones waste GPU cycles. Sorting by length so that each batch contains similarly-sized inputs delivers 20–40% compute savings on embedding generation jobs with no quality impact. A near-free optimization that most teams never implement.

20–40% GPU compute savings · zero quality cost

Reranking Pass

Precision improvement · 10–30% recall boost

Vector similarity search retrieves a first-pass candidate set. A cross-encoder reranking model then scores each candidate against the original query — more expensive but far more precise. Reranking adds 50–100ms latency but delivers 10–30% precision improvement over vector-only retrieval. For high-stakes retrieval (legal, medical, financial), this is non-negotiable.

10–30% precision improvement · +50–100ms latency

The model that talks to your users is visible. The model that decides what it knows is not.

Every RAG response, every memory retrieval, every semantic search result flows through the embedding layer first. The quality of the embedding model determines the ceiling of what the generative model can say. You can upgrade the LLM. If the retrieval substrate is wrong, nothing above it improves.

Justin Erickson · PropTechUSA.ai

pgvector · Supabase · proptechusa-memory worker · 87 Cloudflare Workers · March 2026

Continue Reading · Series 3

Research

AI Memory Architectures

Research

Context Windows at Scale

Research

MCP: The Tool Use Protocol

Research

The Agent Coordination Problem

Embedding Models The invisible substrate that everything else runs on — and why nobody talks about it.

§01 What an Embedding Actually Is

§02 The Dimensionality Decision

§03 The Current Landscape

§04 Infrastructure Optimization