AI Infrastructure · Deep Dive · Real Numbers

Prompt Caching:
The API Cost Kill Shot
Most Builders Are Sleeping On

If you're running AI at scale and not using prompt caching, you're paying full price for zero marginal value on every single call. Here's the full technical breakdown — architecture, real implementation, and the math behind an 11-worker system that costs less than your Netflix.

Without Caching · Per Day
0
input tokens processed (10 agents + orchestrator)
With Caching · Per Day
0
effective billed tokens
saving 90%

"We're spending $8K a month on Claude." I'm running 10 production AI domain experts, an 11th orchestrator worker, real-time streaming, and multi-model fallback logic for a fraction of what most teams with a single chatbot spend. The gap isn't model choice or compute tier. It's one concept most developers skim past in the docs.

It's called prompt caching. And this post is the technical breakdown I wish existed when I started building at scale.

The Problem No One Talks About

Every AI API call costs tokens — input and output, billed separately. Most developers obsess over output tokens because that's the part that changes. What they ignore is the input side, specifically the part that doesn't change.

Your system prompt. The one with the persona, the behavioral rules, the relationship maps, the anti-patterns, the domain knowledge. That system prompt goes into every single call, gets processed from scratch every single time, and you pay full price for it every single time.

The Core Insight

2,000 tokens × 500 calls/day = 1,000,000 tokens of identical, unchanging instructions processed every 24 hours. That's not engineering. That's waste with an API key attached.

What Prompt Caching Actually Does

Anthropic's prompt caching lets you mark specific input blocks with a cache_control parameter. When Claude processes that block for the first time, it stores the computed KV (key-value) cache state. On subsequent calls within the cache window, instead of reprocessing those tokens, it reads from cache at a fraction of the cost.

5min Ephemeral Cache TTL
90% Cost Reduction on Reads
1024 Min Tokens · Sonnet
4 Max Cache Breakpoints
Cache Flow — Live Simulation
System Prompt
2,200 tokens
Cache State
Writing...
Token Cost
$0.0066
Initializing... calculating savings

Anthropic vs. OpenAI: Two Approaches to Caching

OpenAI also offers prompt caching — automatic, zero config. The platform decides what to cache and you see the impact after the fact. Simpler to start with, less useful at the architecture level.

OpenAI
Automatic Caching
Zero configuration
No code changes
Platform decides what caches
No explicit boundaries
Less predictable at scale
Anthropic ← preferred for production
Explicit Caching
You define exact cache boundaries
Full observability via usage object
Architecture designed around it
Predictable cost at scale
Two extra lines to enable

The Implementation

economist-worker/index.ts Production
// System prompt lives at MODULE LEVEL — outside the fetch handler
// Static. Never rebuilt. Same reference every call.
const SYSTEM_PROMPT = `You are Dr. Marcus Webb, Chief Economist...
[2,200 tokens: epistemic fingerprint, anti-patterns,
relationship map, fault lines, output style]`;

export default {
  async fetch(req: Request, env: Env) {
    const { userMessage, history } = await req.json();

    const res = await fetch('https://api.anthropic.com/v1/messages', {
      method: 'POST',
      headers: {
        'x-api-key': env.ANTHROPIC_KEY,
        'anthropic-version': '2023-06-01',
        'anthropic-beta': 'prompt-caching-2024-07-31', // enable cache
        'content-type': 'application/json',
      },
      body: JSON.stringify({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 2048,
        system: [{
          type: 'text',
          text: SYSTEM_PROMPT,
          cache_control: { type: 'ephemeral' } // ← entire game right here
        }],
        messages: [...history, { role: 'user', content: userMessage }]
      })
    });

    const data = await res.json();
    // cache_creation_input_tokens > 0 → wrote cache (first call, slight premium)
    // cache_read_input_tokens > 0 → hit cache (90% cheaper)
    return new Response(JSON.stringify(data));
  }
};

The Real Math — My System

Daily Token Cost — 10 Agents · 400 Calls
Without caching 3,960,000 tokens · ~$11.88/day
With caching 396,000 effective · ~$1.19/day
ScenarioDaily CallsSys Prompt TokensDaily Input TokensEst. Daily Cost
Without caching4002,000800,000$2.40
With caching (reads)4002,00080,000 effective$0.24
Monthly savings~$64/agent/mo

The Architecture That Makes It Sing

LAYER 1 — STATIC (cache everything) ├── Agent persona + epistemic fingerprint ├── Domain knowledge base ├── Anti-patterns + behavioral guardrails └── Relationship map to other agents ↑ cache_control: ephemeral on this entire block ↑ processed once, read 100x per 5min window LAYER 2 — SESSION CONTEXT (cache if long) ├── Conversation history (last N turns) └── Session-specific injections ↑ cache_control on this block if history > 1024 tokens ↑ resets per session but cached within it LAYER 3 — DYNAMIC (never cache) └── Current user message ↑ always fresh, always billed at standard rate ↑ typically 50–200 tokens — small and fast

Combining With Auto-Fallback

shared/claude-client.ts Shared Infrastructure
async function callWithFallback(payload) {
  const models = [
    'claude-sonnet-4-20250514',   // primary — cache reads are model-specific
    'claude-haiku-4-5-20251001',  // fallback — separate cache, slightly cheaper
  ];

  for (const model of models) {
    try {
      const res = await callClaude({ ...payload, model });
      if (res.ok) return res;
    } catch (e) {
      if (e.status === 429 || e.status >= 500) continue;
      throw e; // non-retryable — surface it
    }
  }
  throw new Error('All models exhausted');
}

What The Usage Object Tells You

API Response · usage object Log Everything
// First call — writes cache. Slightly above standard rate.
{
  "input_tokens": 245,
  "cache_creation_input_tokens": 2048, // wrote to cache — one-time premium
  "cache_read_input_tokens": 0,
  "output_tokens": 512
}

// Every call within 5 minutes — reads cache. 90% cheaper.
{
  "input_tokens": 245,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 2048,   // ← this is money saved on every call
  "output_tokens": 489
}

// Hit rate formula — track this religiously:
// cache_read / (cache_read + cache_creation) = hit rate
// Production target: > 85%  |  Below 60%: something's broken

Common Mistakes That Kill Your Cache Hit Rate

1. Dynamic Content Before The Cache Breakpoint

Cache is keyed on prefix. Everything before cache_control must be byte-for-byte identical between calls. A timestamp, a session ID, any variable injection before the system prompt — the API won't error, it'll just silently bill you at full rate on every call. Static content first. Always.

2. Under the Minimum Token Threshold

Sonnet requires 1,024 tokens minimum. Haiku requires 2,048. Below threshold, caching doesn't activate. No error — just standard billing. Check cache_creation_input_tokens on your first call. If it's 0 and the beta header is on, you're under the floor.

3. Low-Traffic Systems With No Keepalive

Ephemeral cache expires in 5 minutes. Irregular call patterns = paying write premium every call with zero reads to offset it. The fix is a lightweight background keepalive:

keepalive/cron-trigger.ts Cloudflare Cron · Every 4min
// Fires every 4 minutes via Cloudflare Cron Triggers
// Resets TTL before expiry. Cost: fractions of a cent.

async function keepCacheWarm(env: Env) {
  const res = await fetch('https://api.anthropic.com/v1/messages', {
    method: 'POST',
    headers: {
      'x-api-key': env.ANTHROPIC_KEY,
      'anthropic-version': '2023-06-01',
      'anthropic-beta': 'prompt-caching-2024-07-31',
      'content-type': 'application/json',
    },
    body: JSON.stringify({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1, // minimal output — we only want the cache write/read
      system: [{ type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } }],
      messages: [{ role: 'user', content: 'ping' }]
    })
  });
  const { usage } = await res.json();
  // cache_creation > 0: cache was cold, re-warmed — small write cost
  // cache_read > 0: cache was warm, ping was nearly free
  return usage;
}

// wrangler.toml:
// [triggers]
// crons = ["*/4 * * * *"]
Cost of the Keepalive

A max_tokens: 1 ping with a warm cache costs only standard rate on ~50 dynamic tokens — fractions of a cent. One cache miss on a 2,200-token system prompt costs orders of magnitude more. The keepalive pays for itself on the first avoided miss.

Reality Check

Cache write calls cost ~25% above standard input rates. The math only works when reads substantially outnumber writes. For genuinely low-frequency batch systems, evaluate your actual call pattern before assuming caching helps.

The Bigger Picture

Prompt caching is one part of a broader principle: understand what you're paying for at the infrastructure level. Most teams treat AI APIs as a black box — input money, receive intelligence. That works until scale. Then every unoptimized call compounds.

Caching is the most immediate lever. Temperature tuning per agent is next. Model routing by task complexity after that. SSE streaming for perceived latency. Fallback hierarchies for reliability without overpaying.

I built PropTechUSA.ai's entire stack self-taught. No CS degree. Learned through logs, iteration, and reading the documentation past the quickstart. That discipline — understanding the machine you're running — is what turns a $10K/month API budget into a $1K/month one with better output.

The game is optimization. Start with caching. Don't stop there.

// More From PropTechUSA.ai
Engineering Series · Part 2
Epistemic Fingerprints: Why Your AI Agents All Sound The Same
Editorial
Open Source Won. Everyone Gets To Win.
Editorial
Bootstrapped Founder's Revenge
Editorial
51,000 Tech Layoffs in 2026. Blame AI.
Explore The Consilium