How do you detect a cache hit rate problem in a multi-agent system?

Log cache_read_input_tokens and cache_creation_input_tokens from the Anthropic API usage object on every response. Cache hit rate = cache_read_input_tokens / (cache_read_input_tokens + input_tokens). Track this per agent, not just in aggregate — a single agent with a broken cache is invisible in aggregate metrics if the other nine are healthy. Alert when any agent drops below 70% over a 1-hour window.

What does the Round 2 trigger rate tell you about agent quality?

Round 2 triggers when the orchestrator detects material unresolved clashes between agents. A healthy rate is 20-40% of queries. Below 20%: agents are converging too easily — their epistemic fingerprints may not be differentiated enough, or the orchestrator's clash threshold is too high. Above 50%: agents are generating noise disagreements, not substantive ones — fingerprints may be too extreme or the orchestrator is misclassifying agreement as conflict.

AI Observability: What to Log When Your System Has Eleven Brains

Q: What metrics matter most for a multi-agent AI system?

Five metrics that compound: (1) Cache hit rate per agent — below 70% means you're overpaying on every call. (2) Time to first token (TTFT) by agent — outliers reveal prompt size or model issues. (3) Round 2 trigger rate — should be 20-40%; too low means agents aren't actually disagreeing, too high means fingerprints aren't differentiated enough. (4) Per-agent cost curve over time — drift indicates system prompt growth or cache degradation. (5) Error rate by failure type — 429s vs 5xxs vs timeouts need different responses.

Q: What is Cloudflare Analytics Engine and why use it for AI logging?

Cloudflare Analytics Engine is a time-series data store built into the Workers platform — no external service, no egress cost, queryable via SQL through the Analytics Engine API. For multi-agent AI systems running on Workers, it's the natural logging target: write data points from within the Worker at zero latency, query them from a dashboard or monitoring script. It handles high write volume without rate limiting concerns and retains data for 90 days on paid plans.

Q: What was the most valuable insight from adding observability?

Discovering that one agent had a 12% cache hit rate while all others were above 85%. The investigation traced it to a dynamic timestamp field accidentally placed before the cache_control breakpoint in that agent's system prompt — causing the entire prompt to be treated as dynamic and re-processed from scratch on every call. Without per-agent cache metrics, this would have been invisible in the aggregate cost numbers for months.

Most multi-agent AI builders are flying blind. They know what their monthly API bill is. They don't know which agent is responsible for 40% of it, which one has a cache hit rate of 12%, or what their Round 2 trigger rate is actually telling them about agent quality. Observability isn't a nice-to-have at this layer. It's how you find the bugs that don't throw errors.

The Five Metrics That Matter

Most AI logging starts and ends with total tokens and total cost. That's like monitoring a server farm by checking the total electricity bill. You know you're paying, but you don't know why, and you can't find the problem. For a multi-agent system, five metrics compound into actual insight:

01Cache Hit Rate / Agent

02TTFT Percentiles

03Round 2 Trigger Rate

04Cost Curve / Agent

05Error Type Breakdown

Cache hit rate per agent is the most valuable single metric in the system. Below 70% for any agent means you're paying full input price on most calls — the 90% discount that makes the system viable is gone. It must be tracked per agent, not in aggregate. A single agent with a broken cache is invisible in aggregate numbers if the other nine are healthy.

Round 2 trigger rate is a quality signal, not just a cost signal. Healthy rate: 20–40% of queries. Below 20% means agents are converging too easily — their epistemic fingerprints aren't differentiated enough. Above 50% means agents are generating noise disagreements rather than substantive ones — the orchestrator is misreading the tension map.

The Analytics Engine Schema

Cloudflare Analytics Engine is a time-series store built into the Workers platform. No external service, no egress cost, queryable via SQL. For a system already running on Workers, it's the natural logging target — write a data point from inside the Worker with zero added latency, query it from anywhere.

shared/logger.ts — Analytics Engine data point schema

Production Schema

// Called after every agent response, before streaming closes
// One data point per agent call — 11 per full Consilium query

export interface AgentMetric {
  // BLOBS — string dimensions for filtering/grouping
  blobs: [
    agentId,          // 'vasquez' | 'webb' | 'chen' | ... — per-agent filtering
    queryId,          // UUID — correlate all 11 calls for one user query
    model,            // which model actually responded (fallback detection)
    errorType,        // '' | '429' | '500' | 'timeout' | 'parse_error'
    round,            // '1' | '2' — was this a round 2 re-query?
  ];

  // DOUBLES — numeric metrics for aggregation
  doubles: [
    inputTokens,             // usage.input_tokens
    outputTokens,            // usage.output_tokens
    cacheReadTokens,         // usage.cache_read_input_tokens ?? 0
    cacheCreationTokens,     // usage.cache_creation_input_tokens ?? 0
    ttftMs,                  // ms from request start to first token chunk
    totalLatencyMs,          // ms from request start to stream close
    estimatedCostUsd,        // computed: see cost formula below
    round2Triggered,         // 1 | 0 — orchestrator field only
  ];
}

export function logAgentCall(data: AgentMetric, env: Env) {
  // Non-blocking — fire and forget, don't await
  env.AI_ANALYTICS.writeDataPoint(data);
}

// Cost formula for Sonnet 4 with caching:
// Regular input:   $3.00 / 1M tokens
// Cache write:     $3.75 / 1M tokens  (1.25× premium)
// Cache read:      $0.30 / 1M tokens  (90% discount)
// Output:         $15.00 / 1M tokens
function estimateCost(u: Usage): number {
  return (
    (u.input_tokens / 1_000_000) * 3.00 +
    (u.cache_creation_input_tokens ?? 0) / 1_000_000 * 3.75 +
    (u.cache_read_input_tokens ?? 0) / 1_000_000 * 0.30 +
    (u.output_tokens / 1_000_000) * 15.00
  );
}

Why queryId Is Non-Negotiable

Each full Consilium query generates 11 data points — one per worker. Without a shared queryId UUID, those 11 calls are unrelated rows in your analytics store. With it, you can GROUP BY queryId to see total cost and latency for a single user query, correlate which agents triggered Round 2, and see the full cost breakdown of any specific session. Generate the UUID in the client, pass it to the orchestrator, orchestrator fans it out to all agents.

The SQL Queries That Actually Tell You Something

analytics/queries.sql — the dashboard queries

Analytics Engine SQL

-- 1. Cache hit rate per agent (last 24h) — the most important query
-- Alert if any agent drops below 0.70
SELECT
  blob1 AS agent_id,
  SUM(double3) AS cache_reads,
  SUM(double1) AS regular_inputs,
  SUM(double3) / (SUM(double3) + SUM(double1)) AS cache_hit_rate,
  COUNT(*) AS total_calls
FROM ai_analytics
WHERE timestamp > NOW() - INTERVAL '24' HOUR
GROUP BY agent_id
ORDER BY cache_hit_rate ASC;

-- 2. TTFT percentiles per agent (last 1h)
-- P95 outlier = system prompt too large OR model under load
SELECT
  blob1 AS agent_id,
  QUANTILE(double5, 0.50) AS p50_ttft_ms,
  QUANTILE(double5, 0.95) AS p95_ttft_ms,
  QUANTILE(double5, 0.99) AS p99_ttft_ms
FROM ai_analytics
WHERE timestamp > NOW() - INTERVAL '1' HOUR
  AND blob4 = ''  -- exclude errors
GROUP BY agent_id;

-- 3. Round 2 trigger rate (rolling 7 days)
-- Healthy: 0.20 - 0.40. Below: agents agree too much. Above: noise.
SELECT
  SUM(double8) AS round2_triggers,
  COUNT(*) AS total_queries,
  SUM(double8) / COUNT(*) AS trigger_rate
FROM ai_analytics
WHERE blob1 = 'orchestrator'
  AND timestamp > NOW() - INTERVAL '7' DAY;

-- 4. Cost breakdown by agent (MTD)
SELECT
  blob1 AS agent_id,
  SUM(double7) AS total_cost_usd,
  AVG(double7) AS avg_cost_per_call,
  COUNT(*) AS calls
FROM ai_analytics
WHERE timestamp > DATE_TRUNC('month', NOW())
GROUP BY agent_id
ORDER BY total_cost_usd DESC;

-- 5. Error breakdown by type (last 6h)
SELECT
  blob4 AS error_type,
  blob1 AS agent_id,
  COUNT(*) AS count
FROM ai_analytics
WHERE blob4 != ''
  AND timestamp > NOW() - INTERVAL '6' HOUR
GROUP BY error_type, agent_id
ORDER BY count DESC;

The Live Dashboard

// Consilium · Agent Metrics Dashboard loading...

Cache Hit Rate by Agent (24h)

TTFT Distribution (p50 / p95)

Cost per Agent (MTD, USD)

Round 2 Trigger Rate (7-day rolling)

Agent Health At a Glance

Agent	Cache Hit Rate	P50 TTFT	P95 TTFT	Avg Cost/Call	Status

Where To Instrument

workers/vasquez/index.ts — full instrumented handler

Instrumentation Points

export default {
  async fetch(req: Request, env: Env) {
    if (!verifyInternalAuth(req, env)) return new Response('Unauthorized', { status: 401 });

    const { message, queryId } = await req.json();
    const requestStart = Date.now(); // ← T0: request received
    let ttftMs = 0;
    let firstChunk = true;

    const { readable, writable } = new TransformStream();
    const writer = writable.getWriter();

    // Upstream fetch — don't await, pipe through transform
    callAnthropicStream(message, env)
      .then(upstream => {
        const reader = upstream.body!.getReader();
        const decoder = new TextDecoder();
        let buffer = '', usage = null;

        async function pump() {
          const { done, value } = await reader.read();
          if (done) {
            // T3: stream closed — log everything
            logAgentCall({
              blobs: ['vasquez', queryId, 'claude-sonnet-4-20250514', '', '1'],
              doubles: [
                usage?.input_tokens ?? 0,
                usage?.output_tokens ?? 0,
                usage?.cache_read_input_tokens ?? 0,
                usage?.cache_creation_input_tokens ?? 0,
                ttftMs,                          // T1 - T0
                Date.now() - requestStart,      // T3 - T0
                estimateCost(usage),
                0, // round2Triggered — orchestrator field only
              ]
            }, env);
            writer.close(); return;
          }

          buffer += decoder.decode(value, { stream: true });
          const events = buffer.split('\n\n');
          buffer = events.pop() ?? '';

          for (const event of events) {
            const dataLine = event.split('\n').find(l => l.startsWith('data: '));
            if (!dataLine) continue;
            try {
              const evt = JSON.parse(dataLine.slice(6));
              if (firstChunk && evt.type === 'content_block_delta') {
                ttftMs = Date.now() - requestStart; // ← T1: first token
                firstChunk = false;
              }
              if (evt.type === 'message_delta' && evt.usage) usage = evt.usage;
            } catch {}
            writer.write(new TextEncoder().encode(event + '\n\n'));
          }
          pump();
        }
        pump();
      });

    return new Response(readable, {
      headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache', 'X-Accel-Buffering': 'no' }
    });
  }
};

Usage Object Timing — Where to Find It

The Anthropic usage object (input_tokens, cache_read_input_tokens, etc.) arrives in the message_delta event near the end of the stream — not at the beginning. This means you can't log token counts until the stream closes. The pattern: accumulate the usage object as it arrives during streaming, then log everything in the done callback. TTFT is measured independently at first content_block_delta event.

// War Story · The 12% Cache Hit Rate

The Bug That Didn't Throw An Error

Three weeks after adding observability, the per-agent cache hit rate query flagged something: Dr. Vivienne Cross had a 12% cache hit rate. Every other agent was above 85%.

The agent was responding correctly. No errors. No timeout spikes. No cost anomaly visible in the aggregate numbers — Cross is one of ten agents, so her cost was 10% of total, and the total looked normal. Without per-agent metrics, this runs for months at 8× the correct input cost.

The investigation: pulled the Cross system prompt and ran it through Anthropic's token counter with cache breakpoint analysis. Found it immediately. Six weeks earlier, during a persona refinement, a dynamic field had been inserted near the top of the system prompt — a timestamp placeholder left in from a debugging session. Current date: {{TODAY}}. It was being interpolated at request time, which meant it changed on every call. The cache breakpoint was set after this field, so the entire prompt was being treated as dynamic and re-cached from scratch on every single call.

Fix: moved the dynamic field below the cache breakpoint, or removed it (the agent didn't need it). Cache hit rate went from 12% to 91% within the next hour of traffic. Monthly cost for that one agent dropped by 88%.

The lesson: anything placed before your cache_control: ephemeral breakpoint that changes between calls destroys the cache. Timestamps, session IDs, user names, request-time variables — all of them have to live below the breakpoint or in the messages array, never in the static system prompt. You will not catch this without per-agent cache metrics.

The Alert Thresholds

monitoring/alerts.ts — what to watch and when to fire

Alert Config

const ALERT_THRESHOLDS = {

  // Cache: below this per agent → 10× cost multiplier on input tokens
  cacheHitRate: { warn: 0.80, critical: 0.70 },

  // TTFT: above this = prompt size issue or model degradation
  ttftP95Ms: { warn: 1200, critical: 2500 },

  // Round 2: outside this band = fingerprint quality problem
  round2TriggerRate: { low: 0.15, high: 0.50 },

  // Error rate: consecutive failures = fallback not working
  errorRatePerAgent: { warn: 0.03, critical: 0.10 },

  // Cost: daily spend exceeds budget signal
  dailyCostUsd: { warn: 15, critical: 25 },
};

async function runAlertChecks(env: Env) {
  const results = await queryCacheHitRates(env);

  for (const agent of results) {
    if (agent.cache_hit_rate < ALERT_THRESHOLDS.cacheHitRate.critical) {
      await sendSlackAlert({
        level: 'CRITICAL',
        message: `Agent ${agent.agent_id} cache hit rate: ${(agent.cache_hit_rate*100).toFixed(1)}%`,
        action: 'Check system prompt for dynamic fields before cache breakpoint'
      }, env);
    }
  }
}

// Run via Cloudflare Cron Trigger every 15 minutes
// wrangler.toml: [triggers] crons = ["*/15 * * * *"]

Frequently Asked

What metrics matter most for a multi-agent AI system? +

Five that compound into actual insight: (1) Cache hit rate per agent — below 70% means you're overpaying 10× on input tokens. (2) TTFT by agent — P95 outliers reveal prompt size or model issues. (3) Round 2 trigger rate — healthy at 20–40%; outside that band signals fingerprint quality problems. (4) Per-agent cost curve — drift indicates system prompt growth or cache degradation. (5) Error rate by type — 429s vs 5xxs vs timeouts need different responses and can't be collapsed into one number.

What is Cloudflare Analytics Engine and why use it? +

Analytics Engine is a time-series data store built into the Workers platform — no external service, no egress cost, queryable via SQL. For a system already running on Workers, it's the natural target: write a data point from inside the Worker with zero added latency, query via the Analytics Engine API from any dashboard. Handles high write volume without rate limiting concerns and retains data for 90 days on paid plans.

How do you detect a cache hit rate problem? +

Log cache_read_input_tokens and cache_creation_input_tokens from the Anthropic usage object on every response. Cache hit rate = cache_read / (cache_read + input_tokens). Track per agent, not in aggregate — a single agent with a broken cache is invisible in aggregate if the other nine are healthy. Alert when any agent drops below 70% over a 1-hour window. The most common cause: a dynamic field placed before the cache breakpoint in the system prompt.

What does Round 2 trigger rate tell you about quality? +

It's a quality signal for your epistemic fingerprints, not just a cost metric. Healthy: 20–40% of queries trigger Round 2. Below 20%: agents are converging too easily — fingerprints not differentiated enough, or orchestrator clash threshold too high. Above 50%: agents are generating noise disagreements rather than substantive ones — fingerprints may be too extreme or the orchestrator is misclassifying agreement as conflict. Track it over time, not just in aggregate.

How do you log across 11 Workers without duplicating code? +

Put the logging function in your shared Workers package — the same one that contains the streaming handler and auth verifier. Each Worker imports logAgentCall() from shared and calls it after every response. The shared function writes to Analytics Engine with the agent ID as a blob field, enabling per-agent filtering in every query. One change to the logging schema propagates to all 11 Workers on next deploy.

What was the most valuable insight from adding observability? +

Discovering that Dr. Vivienne Cross had a 12% cache hit rate while all others were above 85%. A dynamic timestamp field had been placed before the cache breakpoint during a prompt editing session, causing the entire system prompt to be re-processed from scratch on every call. Without per-agent cache metrics, this would have run for months at 8× the correct input cost. The fix was two minutes. Finding it without observability would have been impossible.

// More From PropTechUSA.ai

Engineering Series · Part 1

Prompt Caching: The API Cost Kill Shot Most Builders Are Sleeping On

Engineering Series · Part 2

Epistemic Fingerprints: Why Your AI Agents All Sound The Same

Engineering Series · Part 3

SSE Streaming: Eleven Workers. One Response. Zero Wait.

Engineering Series · Part 4

I Deleted 10,000 Lines of Code. The System Got Better.

Editorial

Bootstrapped Founder's Revenge

Editorial