Question 1

What is a circuit breaker pattern in AI systems?

Accepted Answer

A circuit breaker monitors failure rate for a dependency (like an API) and trips to 'open' state when failures exceed a threshold, short-circuiting calls to that dependency for a cool-down period rather than letting each call fail slowly. For AI Workers, a circuit breaker tracks consecutive errors and timeouts per model. When the circuit trips, it immediately returns a fallback response rather than waiting for a timeout on every request. This prevents cascading slowdowns where one failing dependency degrades all dependent systems.

Question 2

How does multi-model fallback work in a Cloudflare Worker?

Accepted Answer

Each Worker maintains a priority-ordered list of models: primary (claude-sonnet-4-20250514), fallback (claude-haiku-4-5-20251001), and optionally a third-tier. On each call, it tries the primary. If the primary returns a 429, 500, or times out, it immediately retries on the fallback model — same prompt, same parameters, different model. The failover adds roughly 200-400ms latency (a new request to Anthropic) but keeps the system operational. The circuit breaker layer sits above this: if the primary model has tripped its circuit, the Worker skips straight to the fallback without even attempting the primary.

Question 3

How do you deploy to 11 Cloudflare Workers without downtime?

Accepted Answer

Deploy Workers one at a time, not simultaneously. Cloudflare's deployment is atomic per Worker — the old version handles requests until the new version is deployed, then traffic switches instantly. For the Consilium, the deploy order matters: deploy agent Workers first (in any order), then deploy the orchestrator last. This way, if an agent deploy fails, the orchestrator is still calling the old working version. If you deploy the orchestrator first and it references an API change in an agent that hasn't deployed yet, you get a brief window of broken calls.

Question 4

What happens when one agent Worker fails mid-query?

Accepted Answer

The orchestrator uses Promise.allSettled() not Promise.all(). A failed agent returns a rejected promise — the orchestrator catches it, marks that agent as failed in the tension map, and proceeds with the responses from the other nine. The synthesis prompt is explicitly instructed not to draw conclusions from absent agent domains. The client receives a tension_map with a failedAgents field listing which agents didn't respond. One agent failure reduces output quality but doesn't break the query. Two or more agent failures trigger a degraded_mode flag in the response, visible to the client.

Question 5

What is graceful degradation in a multi-agent system?

Accepted Answer

Graceful degradation means the system continues to function at reduced quality when components fail, rather than returning an error. For the Consilium: one agent down → 9-agent synthesis with failedAgents notation. Orchestrator in circuit-open state → fallback to parallel agent responses without tension map. All Anthropic calls failing → queued responses with estimated wait time. Each degradation level is explicitly designed and tested — not an accident of how Promise.allSettled works, but an intentional capability with its own UI state on the client.

Question 6

How did the 429 storm happen and what stopped it from taking down the system?

Accepted Answer

A spike in concurrent users hit during peak hours, causing all 10 agent Workers to simultaneously rate-limit against Anthropic's API. Without a circuit breaker, each request would attempt the primary model, wait for the 429, then retry the fallback — multiplying latency by the number of concurrent requests. With the circuit breaker, after the first few 429s per Worker, the circuit tripped to open state and all subsequent requests routed directly to the fallback model (Haiku) without waiting for the primary to fail. The fallback model handled the load at slightly reduced quality. Total visible downtime: 0 seconds. User-facing latency impact: +340ms average during the 11-minute incident.

99.97%
Uptime On
An 11-Worker
AI System.

Circuit Breakers: The Pattern That Saved Us

The Fallback Chain

Four Degradation Tiers

Zero-Downtime Deploys

99.97% Uptime On An 11-Worker AI System.

Circuit Breakers: The Pattern That Saved Us

The Fallback Chain

Four Degradation Tiers

Zero-Downtime Deploys

99.97%
Uptime On
An 11-Worker
AI System.