Most multi-agent AI builders are flying blind. They know what their monthly API bill is. They don't know which agent is responsible for 40% of it, which one has a cache hit rate of 12%, or what their Round 2 trigger rate is actually telling them about agent quality. Observability isn't a nice-to-have at this layer. It's how you find the bugs that don't throw errors.
The Five Metrics That Matter
Most AI logging starts and ends with total tokens and total cost. That's like monitoring a server farm by checking the total electricity bill. You know you're paying, but you don't know why, and you can't find the problem. For a multi-agent system, five metrics compound into actual insight:
Cache hit rate per agent is the most valuable single metric in the system. Below 70% for any agent means you're paying full input price on most calls — the 90% discount that makes the system viable is gone. It must be tracked per agent, not in aggregate. A single agent with a broken cache is invisible in aggregate numbers if the other nine are healthy.
Round 2 trigger rate is a quality signal, not just a cost signal. Healthy rate: 20–40% of queries. Below 20% means agents are converging too easily — their epistemic fingerprints aren't differentiated enough. Above 50% means agents are generating noise disagreements rather than substantive ones — the orchestrator is misreading the tension map.
The Analytics Engine Schema
Cloudflare Analytics Engine is a time-series store built into the Workers platform. No external service, no egress cost, queryable via SQL. For a system already running on Workers, it's the natural logging target — write a data point from inside the Worker with zero added latency, query it from anywhere.
// Called after every agent response, before streaming closes // One data point per agent call — 11 per full Consilium query export interface AgentMetric { // BLOBS — string dimensions for filtering/grouping blobs: [ agentId, // 'vasquez' | 'webb' | 'chen' | ... — per-agent filtering queryId, // UUID — correlate all 11 calls for one user query model, // which model actually responded (fallback detection) errorType, // '' | '429' | '500' | 'timeout' | 'parse_error' round, // '1' | '2' — was this a round 2 re-query? ]; // DOUBLES — numeric metrics for aggregation doubles: [ inputTokens, // usage.input_tokens outputTokens, // usage.output_tokens cacheReadTokens, // usage.cache_read_input_tokens ?? 0 cacheCreationTokens, // usage.cache_creation_input_tokens ?? 0 ttftMs, // ms from request start to first token chunk totalLatencyMs, // ms from request start to stream close estimatedCostUsd, // computed: see cost formula below round2Triggered, // 1 | 0 — orchestrator field only ]; } export function logAgentCall(data: AgentMetric, env: Env) { // Non-blocking — fire and forget, don't await env.AI_ANALYTICS.writeDataPoint(data); } // Cost formula for Sonnet 4 with caching: // Regular input: $3.00 / 1M tokens // Cache write: $3.75 / 1M tokens (1.25× premium) // Cache read: $0.30 / 1M tokens (90% discount) // Output: $15.00 / 1M tokens function estimateCost(u: Usage): number { return ( (u.input_tokens / 1_000_000) * 3.00 + (u.cache_creation_input_tokens ?? 0) / 1_000_000 * 3.75 + (u.cache_read_input_tokens ?? 0) / 1_000_000 * 0.30 + (u.output_tokens / 1_000_000) * 15.00 ); }
Each full Consilium query generates 11 data points — one per worker. Without a shared queryId UUID, those 11 calls are unrelated rows in your analytics store. With it, you can GROUP BY queryId to see total cost and latency for a single user query, correlate which agents triggered Round 2, and see the full cost breakdown of any specific session. Generate the UUID in the client, pass it to the orchestrator, orchestrator fans it out to all agents.
The SQL Queries That Actually Tell You Something
-- 1. Cache hit rate per agent (last 24h) — the most important query -- Alert if any agent drops below 0.70 SELECT blob1 AS agent_id, SUM(double3) AS cache_reads, SUM(double1) AS regular_inputs, SUM(double3) / (SUM(double3) + SUM(double1)) AS cache_hit_rate, COUNT(*) AS total_calls FROM ai_analytics WHERE timestamp > NOW() - INTERVAL '24' HOUR GROUP BY agent_id ORDER BY cache_hit_rate ASC; -- 2. TTFT percentiles per agent (last 1h) -- P95 outlier = system prompt too large OR model under load SELECT blob1 AS agent_id, QUANTILE(double5, 0.50) AS p50_ttft_ms, QUANTILE(double5, 0.95) AS p95_ttft_ms, QUANTILE(double5, 0.99) AS p99_ttft_ms FROM ai_analytics WHERE timestamp > NOW() - INTERVAL '1' HOUR AND blob4 = '' -- exclude errors GROUP BY agent_id; -- 3. Round 2 trigger rate (rolling 7 days) -- Healthy: 0.20 - 0.40. Below: agents agree too much. Above: noise. SELECT SUM(double8) AS round2_triggers, COUNT(*) AS total_queries, SUM(double8) / COUNT(*) AS trigger_rate FROM ai_analytics WHERE blob1 = 'orchestrator' AND timestamp > NOW() - INTERVAL '7' DAY; -- 4. Cost breakdown by agent (MTD) SELECT blob1 AS agent_id, SUM(double7) AS total_cost_usd, AVG(double7) AS avg_cost_per_call, COUNT(*) AS calls FROM ai_analytics WHERE timestamp > DATE_TRUNC('month', NOW()) GROUP BY agent_id ORDER BY total_cost_usd DESC; -- 5. Error breakdown by type (last 6h) SELECT blob4 AS error_type, blob1 AS agent_id, COUNT(*) AS count FROM ai_analytics WHERE blob4 != '' AND timestamp > NOW() - INTERVAL '6' HOUR GROUP BY error_type, agent_id ORDER BY count DESC;
The Live Dashboard
Agent Health At a Glance
| Agent | Cache Hit Rate | P50 TTFT | P95 TTFT | Avg Cost/Call | Status |
|---|
Where To Instrument
export default { async fetch(req: Request, env: Env) { if (!verifyInternalAuth(req, env)) return new Response('Unauthorized', { status: 401 }); const { message, queryId } = await req.json(); const requestStart = Date.now(); // ← T0: request received let ttftMs = 0; let firstChunk = true; const { readable, writable } = new TransformStream(); const writer = writable.getWriter(); // Upstream fetch — don't await, pipe through transform callAnthropicStream(message, env) .then(upstream => { const reader = upstream.body!.getReader(); const decoder = new TextDecoder(); let buffer = '', usage = null; async function pump() { const { done, value } = await reader.read(); if (done) { // T3: stream closed — log everything logAgentCall({ blobs: ['vasquez', queryId, 'claude-sonnet-4-20250514', '', '1'], doubles: [ usage?.input_tokens ?? 0, usage?.output_tokens ?? 0, usage?.cache_read_input_tokens ?? 0, usage?.cache_creation_input_tokens ?? 0, ttftMs, // T1 - T0 Date.now() - requestStart, // T3 - T0 estimateCost(usage), 0, // round2Triggered — orchestrator field only ] }, env); writer.close(); return; } buffer += decoder.decode(value, { stream: true }); const events = buffer.split('\n\n'); buffer = events.pop() ?? ''; for (const event of events) { const dataLine = event.split('\n').find(l => l.startsWith('data: ')); if (!dataLine) continue; try { const evt = JSON.parse(dataLine.slice(6)); if (firstChunk && evt.type === 'content_block_delta') { ttftMs = Date.now() - requestStart; // ← T1: first token firstChunk = false; } if (evt.type === 'message_delta' && evt.usage) usage = evt.usage; } catch {} writer.write(new TextEncoder().encode(event + '\n\n')); } pump(); } pump(); }); return new Response(readable, { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache', 'X-Accel-Buffering': 'no' } }); } };
The Anthropic usage object (input_tokens, cache_read_input_tokens, etc.) arrives in the message_delta event near the end of the stream — not at the beginning. This means you can't log token counts until the stream closes. The pattern: accumulate the usage object as it arrives during streaming, then log everything in the done callback. TTFT is measured independently at first content_block_delta event.
Three weeks after adding observability, the per-agent cache hit rate query flagged something: Dr. Vivienne Cross had a 12% cache hit rate. Every other agent was above 85%.
The agent was responding correctly. No errors. No timeout spikes. No cost anomaly visible in the aggregate numbers — Cross is one of ten agents, so her cost was 10% of total, and the total looked normal. Without per-agent metrics, this runs for months at 8× the correct input cost.
The investigation: pulled the Cross system prompt and ran it through Anthropic's token counter with cache breakpoint analysis. Found it immediately. Six weeks earlier, during a persona refinement, a dynamic field had been inserted near the top of the system prompt — a timestamp placeholder left in from a debugging session. Current date: {{TODAY}}. It was being interpolated at request time, which meant it changed on every call. The cache breakpoint was set after this field, so the entire prompt was being treated as dynamic and re-cached from scratch on every single call.
Fix: moved the dynamic field below the cache breakpoint, or removed it (the agent didn't need it). Cache hit rate went from 12% to 91% within the next hour of traffic. Monthly cost for that one agent dropped by 88%.
The lesson: anything placed before your cache_control: ephemeral breakpoint that changes between calls destroys the cache. Timestamps, session IDs, user names, request-time variables — all of them have to live below the breakpoint or in the messages array, never in the static system prompt. You will not catch this without per-agent cache metrics.
The Alert Thresholds
const ALERT_THRESHOLDS = { // Cache: below this per agent → 10× cost multiplier on input tokens cacheHitRate: { warn: 0.80, critical: 0.70 }, // TTFT: above this = prompt size issue or model degradation ttftP95Ms: { warn: 1200, critical: 2500 }, // Round 2: outside this band = fingerprint quality problem round2TriggerRate: { low: 0.15, high: 0.50 }, // Error rate: consecutive failures = fallback not working errorRatePerAgent: { warn: 0.03, critical: 0.10 }, // Cost: daily spend exceeds budget signal dailyCostUsd: { warn: 15, critical: 25 }, }; async function runAlertChecks(env: Env) { const results = await queryCacheHitRates(env); for (const agent of results) { if (agent.cache_hit_rate < ALERT_THRESHOLDS.cacheHitRate.critical) { await sendSlackAlert({ level: 'CRITICAL', message: `Agent ${agent.agent_id} cache hit rate: ${(agent.cache_hit_rate*100).toFixed(1)}%`, action: 'Check system prompt for dynamic fields before cache breakpoint' }, env); } } } // Run via Cloudflare Cron Trigger every 15 minutes // wrangler.toml: [triggers] crons = ["*/15 * * * *"]
cache_read_input_tokens and cache_creation_input_tokens from the Anthropic usage object on every response. Cache hit rate = cache_read / (cache_read + input_tokens). Track per agent, not in aggregate — a single agent with a broken cache is invisible in aggregate if the other nine are healthy. Alert when any agent drops below 70% over a 1-hour window. The most common cause: a dynamic field placed before the cache breakpoint in the system prompt.logAgentCall() from shared and calls it after every response. The shared function writes to Analytics Engine with the agent ID as a blob field, enabling per-agent filtering in every query. One change to the logging schema propagates to all 11 Workers on next deploy.The Consilium runs all 11 workers with full observability — cache rates, TTFT, cost curves, and tension map quality signals logged on every call.
Open The Consilium