"We're spending $8K a month on Claude." I'm running 10 production AI domain experts, an 11th orchestrator worker, real-time streaming, and multi-model fallback logic for a fraction of what most teams with a single chatbot spend. The gap isn't model choice or compute tier. It's one concept most developers skim past in the docs.
It's called prompt caching. And this post is the technical breakdown I wish existed when I started building at scale.
The Problem No One Talks About
Every AI API call costs tokens — input and output, billed separately. Most developers obsess over output tokens because that's the part that changes. What they ignore is the input side, specifically the part that doesn't change.
Your system prompt. The one with the persona, the behavioral rules, the relationship maps, the anti-patterns, the domain knowledge. That system prompt goes into every single call, gets processed from scratch every single time, and you pay full price for it every single time.
2,000 tokens × 500 calls/day = 1,000,000 tokens of identical, unchanging instructions processed every 24 hours. That's not engineering. That's waste with an API key attached.
What Prompt Caching Actually Does
Anthropic's prompt caching lets you mark specific input blocks with a cache_control parameter. When Claude processes that block for the first time, it stores the computed KV (key-value) cache state. On subsequent calls within the cache window, instead of reprocessing those tokens, it reads from cache at a fraction of the cost.
Anthropic vs. OpenAI: Two Approaches to Caching
OpenAI also offers prompt caching — automatic, zero config. The platform decides what to cache and you see the impact after the fact. Simpler to start with, less useful at the architecture level.
The Implementation
// System prompt lives at MODULE LEVEL — outside the fetch handler // Static. Never rebuilt. Same reference every call. const SYSTEM_PROMPT = `You are Dr. Marcus Webb, Chief Economist... [2,200 tokens: epistemic fingerprint, anti-patterns, relationship map, fault lines, output style]`; export default { async fetch(req: Request, env: Env) { const { userMessage, history } = await req.json(); const res = await fetch('https://api.anthropic.com/v1/messages', { method: 'POST', headers: { 'x-api-key': env.ANTHROPIC_KEY, 'anthropic-version': '2023-06-01', 'anthropic-beta': 'prompt-caching-2024-07-31', // enable cache 'content-type': 'application/json', }, body: JSON.stringify({ model: 'claude-sonnet-4-20250514', max_tokens: 2048, system: [{ type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } // ← entire game right here }], messages: [...history, { role: 'user', content: userMessage }] }) }); const data = await res.json(); // cache_creation_input_tokens > 0 → wrote cache (first call, slight premium) // cache_read_input_tokens > 0 → hit cache (90% cheaper) return new Response(JSON.stringify(data)); } };
The Real Math — My System
| Scenario | Daily Calls | Sys Prompt Tokens | Daily Input Tokens | Est. Daily Cost |
|---|---|---|---|---|
| Without caching | 400 | 2,000 | 800,000 | $2.40 |
| With caching (reads) | 400 | 2,000 | 80,000 effective | $0.24 |
| Monthly savings | — | — | — | ~$64/agent/mo |
The Architecture That Makes It Sing
Combining With Auto-Fallback
async function callWithFallback(payload) { const models = [ 'claude-sonnet-4-20250514', // primary — cache reads are model-specific 'claude-haiku-4-5-20251001', // fallback — separate cache, slightly cheaper ]; for (const model of models) { try { const res = await callClaude({ ...payload, model }); if (res.ok) return res; } catch (e) { if (e.status === 429 || e.status >= 500) continue; throw e; // non-retryable — surface it } } throw new Error('All models exhausted'); }
What The Usage Object Tells You
// First call — writes cache. Slightly above standard rate. { "input_tokens": 245, "cache_creation_input_tokens": 2048, // wrote to cache — one-time premium "cache_read_input_tokens": 0, "output_tokens": 512 } // Every call within 5 minutes — reads cache. 90% cheaper. { "input_tokens": 245, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 2048, // ← this is money saved on every call "output_tokens": 489 } // Hit rate formula — track this religiously: // cache_read / (cache_read + cache_creation) = hit rate // Production target: > 85% | Below 60%: something's broken
Common Mistakes That Kill Your Cache Hit Rate
1. Dynamic Content Before The Cache Breakpoint
Cache is keyed on prefix. Everything before cache_control must be byte-for-byte identical between calls. A timestamp, a session ID, any variable injection before the system prompt — the API won't error, it'll just silently bill you at full rate on every call. Static content first. Always.
2. Under the Minimum Token Threshold
Sonnet requires 1,024 tokens minimum. Haiku requires 2,048. Below threshold, caching doesn't activate. No error — just standard billing. Check cache_creation_input_tokens on your first call. If it's 0 and the beta header is on, you're under the floor.
3. Low-Traffic Systems With No Keepalive
Ephemeral cache expires in 5 minutes. Irregular call patterns = paying write premium every call with zero reads to offset it. The fix is a lightweight background keepalive:
// Fires every 4 minutes via Cloudflare Cron Triggers // Resets TTL before expiry. Cost: fractions of a cent. async function keepCacheWarm(env: Env) { const res = await fetch('https://api.anthropic.com/v1/messages', { method: 'POST', headers: { 'x-api-key': env.ANTHROPIC_KEY, 'anthropic-version': '2023-06-01', 'anthropic-beta': 'prompt-caching-2024-07-31', 'content-type': 'application/json', }, body: JSON.stringify({ model: 'claude-sonnet-4-20250514', max_tokens: 1, // minimal output — we only want the cache write/read system: [{ type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } }], messages: [{ role: 'user', content: 'ping' }] }) }); const { usage } = await res.json(); // cache_creation > 0: cache was cold, re-warmed — small write cost // cache_read > 0: cache was warm, ping was nearly free return usage; } // wrangler.toml: // [triggers] // crons = ["*/4 * * * *"]
A max_tokens: 1 ping with a warm cache costs only standard rate on ~50 dynamic tokens — fractions of a cent. One cache miss on a 2,200-token system prompt costs orders of magnitude more. The keepalive pays for itself on the first avoided miss.
Cache write calls cost ~25% above standard input rates. The math only works when reads substantially outnumber writes. For genuinely low-frequency batch systems, evaluate your actual call pattern before assuming caching helps.
The Bigger Picture
Prompt caching is one part of a broader principle: understand what you're paying for at the infrastructure level. Most teams treat AI APIs as a black box — input money, receive intelligence. That works until scale. Then every unoptimized call compounds.
Caching is the most immediate lever. Temperature tuning per agent is next. Model routing by task complexity after that. SSE streaming for perceived latency. Fallback hierarchies for reliability without overpaying.
I built PropTechUSA.ai's entire stack self-taught. No CS degree. Learned through logs, iteration, and reading the documentation past the quickstart. That discipline — understanding the machine you're running — is what turns a $10K/month API budget into a $1K/month one with better output.
The game is optimization. Start with caching. Don't stop there.