Most multi-agent systems aggregate. They collect responses from multiple models and merge them into a summary. That's not orchestration — that's averaging. Real orchestration means understanding where agents agree, where they're in genuine tension, why the tension exists, and what should be done about it before presenting anything to the human.
Posts 1–5 reference the orchestrator constantly. This one opens it up completely. The tension map schema, the clash scoring algorithm, the exact condition that fires Round 2, the synthesis prompt, the streaming architecture, and the failure mode that makes the whole system worse if you get it wrong.
Aggregation vs. Orchestrated Disagreement
Here's the difference in concrete terms. You ask ten domain experts whether a real estate deal is sound. Aggregation returns: "Most agents found merit in the deal. Some concerns were raised around financing." Orchestrated disagreement returns: "The economist and the risk officer have a material irresolvable clash on whether the cap rate assumption is realistic. The legal analyst flags a title issue none of the other agents addressed. Eight of ten agents agree on the exit timeline. Round 2 has been triggered for the financing disagreement."
The first output sounds like a conclusion. The second output is a decision map. For high-stakes decisions, the clash between two agents on a specific causal claim is more valuable than ten agreeing paragraphs. The orchestrator's job is to surface that clash, not paper over it.
The Tension Map Schema
The tension map is a required output field — not optional, not conditional. If the orchestrator returns a response without it, the client treats the response as invalid and retries. This structural constraint prevents the most common failure mode: the orchestrator summarizing without mapping.
interface TensionMap { // Required — validated on every orchestrator response version: string; // schema version for backwards compat queryId: string; // correlates with agent call logs generatedAt: number; // unix timestamp round: 1 | 2; // which synthesis pass produced this // Consensus zones — where agents meaningfully agree consensus: { claim: string; // the agreed-upon assertion supportingAgents: string[]; // which agents hold this view confidence: 0..1; // orchestrator's confidence in the consensus loadBearing: boolean; // does this affect the conclusion? }[]; // Tension entries — the core value of the system tensions: { id: string; // unique — used in Round 2 targeting agentA: string; // first agent in conflict agentB: string; // second agent in conflict claimA: string; // agentA's specific position claimB: string; // agentB's specific position type: 'factual'|'interpretive'|'emphasis'; // clash type severity: 1..10; // 6+ triggers Round 2 loadBearing: boolean; // affects conclusion? resolvable: boolean; // can more info resolve it? recommendation: string; // how should the human weigh this? }[]; // Synthesis — the orchestrator's reading of the full landscape synthesis: { headline: string; // one-sentence summary of state of play majorFindings: string[]; // top 3-5 substantive conclusions openQuestions: string[]; // unresolved after Round 2 (if applicable) confidenceProfile: { // not a single score — per-domain confidence [agentId: string]: 0..1 }; }; // Round 2 targeting — populated before R2 call, nulled after round2Target?: { tensionId: string; // which tension to resolve agents: [string, string]; // only these two agents are re-queried prompt: string; // the specific clash framed as a question }; }
A single confidence score on multi-agent output is meaningless — it collapses ten different epistemic contexts into one number. The orchestrator might be highly confident in the legal analysis (clear statute, unambiguous application) and deeply uncertain in the economic forecast (contested empirical assumptions). Separate confidence scores per agent expose the actual distribution of certainty. The human can then decide which domains to weight more heavily for this specific decision.
The Clash Detection Algorithm
Clash detection is the hardest part of the orchestrator to get right. Every pair of ten agents touching the same complex question will have hundreds of surface-level differences — different word choices, different emphasis, different framings. The algorithm has to distinguish those from genuine material disagreements.
interface ClashScore { agentPair: [string, string]; severity: number; // 1-10 type: 'factual' | 'interpretive' | 'emphasis'; loadBearing: boolean; } async function detectClashes( responses: AgentResponse[], env: Env ): Promise<ClashScore[]> { // Phase 1: Extract all claims from each response // The orchestrator reads each response and identifies // load-bearing assertions (claims that affect the conclusion) const claims = await extractClaims(responses, env); // Phase 2: Compare claims across agent pairs // Only compare load-bearing claims — peripheral diffs are noise const clashes: ClashScore[] = []; for (let i = 0; i < responses.length; i++) { for (let j = i + 1; j < responses.length; j++) { const pairClash = await scorePairClash({ agentA: responses[i].agentId, agentB: responses[j].agentId, claimsA: claims[i].loadBearing, claimsB: claims[j].loadBearing, }, env); // Severity scoring by type: // Factual contradiction: 8-10 (direct truth claim conflict) // Interpretive divergence: 4-7 (same facts, different meaning) // Emphasis difference: 1-3 (same view, different priority) if (pairClash.severity > 0) clashes.push(pairClash); } } return clashes.sort((a, b) => b.severity - a.severity); } function shouldTriggerRound2(clashes: ClashScore[]): boolean { // Round 2 condition: 2+ load-bearing clashes scoring ≥ 6/10 const highSeverity = clashes.filter(c => c.severity >= 6 && c.loadBearing && c.type !== 'emphasis' ); return highSeverity.length >= 2; } // If Round 2 triggers, only the two conflicting agents are re-queried // Not all ten — targeted, not expensive function buildRound2Prompt(clash: ClashScore, responses: AgentResponse[]): string { const [a, b] = clash.agentPair; return ` ${responses[a].agentId} argued: "${clash.claimA}" ${responses[b].agentId} argued: "${clash.claimB}" These claims are in direct conflict on a load-bearing point. Address the opposing argument specifically. Do not restate your original position without engaging the challenge. `; }
Two agents can both agree that a risk exists but disagree on how prominently to flag it. That's an emphasis difference — score 1–3, never triggers Round 2. It belongs in the tension map for human visibility but it's not a factual or interpretive disagreement. The algorithm must classify before scoring. Collapsing emphasis differences with factual contradictions produces a R2 trigger rate that's too high and burns unnecessary API cost on noise.
The Synthesis Prompt
The synthesis prompt is the most carefully engineered part of the orchestrator. It has to produce structured JSON output, preserve unresolved conflicts, avoid false consensus, and render a useful decision map — all in one pass. Here's the exact production prompt:
const ORCHESTRATOR_SYSTEM_PROMPT = ` You are the Consilium Orchestrator. You receive responses from 10 domain expert AI agents and produce a structured tension map. YOUR CARDINAL RULES: 1. PRESERVE DISAGREEMENT. Do not synthesize away genuine conflict. If two agents disagree on a load-bearing claim, that conflict must appear in the tensions array regardless of how uncomfortable it is. 2. CLASSIFY BEFORE SCORING. Every disagreement is one of: - factual: directly contradictory truth claims (score 8-10) - interpretive: same facts, different meaning (score 4-7) - emphasis: same view, different priority (score 1-3) 3. STRUCTURED OUTPUT REQUIRED. Your entire response must be valid JSON matching the TensionMap schema. No prose, no preamble, no markdown. A response without a tensions array will be treated as invalid. 4. CONFIDENCE IS PER-DOMAIN. Do not produce a single confidence score. Rate each agent's domain contribution independently. 5. ROUND 2 ONLY FOR LOAD-BEARING FACTUAL CLASHES. Emphasis differences do not trigger Round 2. Cost is real. WHAT A GOOD SYNTHESIS LOOKS LIKE: - tensions array has 3-8 entries for a complex query - At least one consensus entry per major topic area - openQuestions lists what Round 2 did NOT resolve (honesty) - headline is one sentence, no hedging, no "it depends" WHAT A BAD SYNTHESIS LOOKS LIKE: - Empty or single-item tensions array on a complex topic - Headline that begins with "It depends" or "Both perspectives..." - Confidence scores all above 0.85 on contested empirical claims - openQuestions is empty after a contested Round 2 `;
The Streaming Architecture
The orchestrator can't start streaming until it has all 10 agent responses. That's a hard dependency — you can't detect clashes without all the inputs. But you also can't make the user wait 8–12 seconds staring at a blank screen. The two-phase streaming approach solves this without changing the underlying computation:
async function streamOrchestratedResponse(message: string, env: Env) { const { readable, writable } = new TransformStream(); const writer = writable.getWriter(); const enc = new TextEncoder(); const emit = (event: string, data: any) => writer.write(enc.encode(`event: ${event}\ndata: ${JSON.stringify(data)}\n\n`)); // PHASE 1: Fan out to all 10 agents, stream summaries as they arrive // User sees content within ~300ms — not blank screen for 10 seconds const agentPromises = AGENTS.map(id => callAgent(id, message, env).then(response => { emit('agent_complete', { agentId: id, summary: response.summary }); return response; }) ); // Wait for all 10 — allSettled so one failure doesn't block synthesis const settled = await Promise.allSettled(agentPromises); const responses = settled .filter(r => r.status === 'fulfilled') .map(r => (r as any).value); // PHASE 2: Clash detection + synthesis — begins after all 10 complete await emit('orchestrating', { message: 'Mapping tensions...', agentCount: responses.length }); const clashes = await detectClashes(responses, env); let finalResponses = responses; if (shouldTriggerRound2(clashes)) { await emit('round2_triggered', { clashes: clashes.filter(c => c.severity >= 6) }); finalResponses = await runRound2(responses, clashes, env); } // Stream the tension map as it generates (SSE from Claude) const tensionMap = await synthesize(finalResponses, env, (chunk: string) => { writer.write(enc.encode(`event: synthesis_chunk\ndata: ${chunk}\n\n`)); }); // Final emit: validated tension map JSON if (!validateTensionMap(tensionMap)) { await emit('error', { code: 'INVALID_TENSION_MAP', retry: true }); } else { await emit('tension_map', tensionMap); } writer.close(); return new Response(readable, { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' } }); }
The Tension Map Visualized
Three weeks after launching the orchestrator, I noticed the Round 2 trigger rate had dropped from 28% to under 5% over a four-day period. The system was still running. Agents were still responding. No errors in the logs.
The investigation: pulled random tension maps from that period and read them. The synthesis was clean. Almost too clean. Eight-agent queries with no major unresolved tensions. Confidence scores all above 0.85. The headline on one response: "All domain experts agree this represents a sound investment opportunity." Ten agents. Zero tension entries.
The cause was subtle. A prompt update to the synthesis system prompt had added a line intended to improve readability: "Prioritize producing a clear, actionable synthesis the user can act on immediately." That single instruction shifted the orchestrator's optimization target from "accurately represent the state of disagreement" to "produce something the user can act on." The model correctly inferred that a clean synthesis is more actionable than a messy tension map. So it produced clean syntheses. By suppressing the disagreements.
The fix: removed the readability instruction entirely, added the explicit anti-pattern rules now in the production prompt ("A bad synthesis looks like: empty tensions array on a complex topic"). Required minimum tension entries for queries above a complexity threshold. Added an automated check: if a query contains more than 800 tokens of agent responses and produces zero tension entries, flag for manual review.
The lesson: the orchestrator's synthesis incentive must be truth-first, not clarity-first. If you optimize for readable output, you get readable lies. The tension map exists precisely because the world is complicated. Making it look simple is the failure mode.
Do Its Job
Ask the Consilium something genuinely hard. The tension map is visible in the output — you'll see exactly which agents are in conflict and why.
Open The Consilium