How does the orchestrator decide when to trigger Round 2?

Round 2 triggers when the orchestrator's clash detection scores at least two agent pairs above a severity threshold of 6/10 AND the disagreements are on load-bearing claims (claims that directly affect the conclusion) rather than peripheral details. The algorithm first categorizes every disagreement as load-bearing vs peripheral, then scores severity by: factual contradiction (8-10), interpretive divergence (4-7), emphasis difference (1-3). If two or more load-bearing conflicts score above 6, the specific clash is sent back to the conflicting agents with explicit instructions to address the opposing argument.

What happens when the orchestrator itself hallucinates a consensus?

This is the most dangerous failure mode: the orchestrator synthesizes a false consensus where none exists, making the disagreement invisible to the user. It happens when the synthesis prompt is too resolution-focused — the orchestrator is rewarded for producing a clean answer and penalized for producing a messy one. The fix is structural: the synthesis prompt must explicitly instruct the orchestrator to preserve unresolved conflicts, and the tension map JSON must be a required output field, not an optional one. If the orchestrator skips the tension map, the client treats the response as invalid and retries.

How do you stream the orchestrator's output when it depends on all 10 agents finishing?

You don't wait for all 10 to finish before starting to stream. The orchestrator uses a two-phase approach: Phase 1 streams individual agent summaries as they complete — the client renders these immediately, one by one, as each agent finishes. Phase 2 begins after all 10 complete — the orchestrator runs clash detection and streams the tension map and synthesis. The user sees the system working from the first agent response (~300ms) rather than waiting for all 10 plus orchestration (~8-12 seconds). Perceived latency drops dramatically with no change to the actual computation.

The Orchestrator: How to Build an AI That Reads 10 Other AIs and Finds Where They Disagree

Q: What is an AI orchestrator in a multi-agent system?

An orchestrator is a coordinator agent that receives responses from multiple domain expert agents, analyzes them for agreement and disagreement, and synthesizes a structured output. In the PropTechUSA Consilium, the 11th worker orchestrator reads all 10 domain expert responses, generates a tension map showing where agents agree and where they have material disagreements, triggers a second round of debate when clashes are substantive, and produces a final synthesis with explicit uncertainty attribution.

Q: What is a tension map in a multi-agent AI system?

A tension map is a structured JSON document that identifies where domain expert agents agree, where they have surface-level differences, and where they have material irresolvable disagreements. Each tension entry includes the agents in conflict, the nature of the disagreement, a severity score (1-10), whether it's resolvable through more information, and the orchestrator's recommendation for how the human should weigh the conflicting views. The tension map is what makes multi-agent output actionable rather than just voluminous.

Q: What is the difference between aggregation and orchestrated disagreement?

Aggregation combines multiple agent responses into a summary or average — 'here is what several perspectives say.' Orchestrated disagreement maps where perspectives conflict, why they conflict, what type of conflict it is, and what the conflict means for the decision. Aggregation loses the signal in the noise. Orchestrated disagreement makes the signal the output. For high-stakes decisions, knowing that the economist and the risk officer fundamentally disagree on a specific causal claim is more valuable than a synthesized summary that papers over the disagreement.

Most multi-agent systems aggregate. They collect responses from multiple models and merge them into a summary. That's not orchestration — that's averaging. Real orchestration means understanding where agents agree, where they're in genuine tension, why the tension exists, and what should be done about it before presenting anything to the human.

Posts 1–5 reference the orchestrator constantly. This one opens it up completely. The tension map schema, the clash scoring algorithm, the exact condition that fires Round 2, the synthesis prompt, the streaming architecture, and the failure mode that makes the whole system worse if you get it wrong.

Aggregation vs. Orchestrated Disagreement

Here's the difference in concrete terms. You ask ten domain experts whether a real estate deal is sound. Aggregation returns: "Most agents found merit in the deal. Some concerns were raised around financing." Orchestrated disagreement returns: "The economist and the risk officer have a material irresolvable clash on whether the cap rate assumption is realistic. The legal analyst flags a title issue none of the other agents addressed. Eight of ten agents agree on the exit timeline. Round 2 has been triggered for the financing disagreement."

The first output sounds like a conclusion. The second output is a decision map. For high-stakes decisions, the clash between two agents on a specific causal claim is more valuable than ten agreeing paragraphs. The orchestrator's job is to surface that clash, not paper over it.

If your orchestrator always produces a clean synthesis, it's probably hiding something. Genuine multi-expert disagreement is messy. That mess is the signal.

— Justin Erickson, PropTechUSA.ai

The Tension Map Schema

The tension map is a required output field — not optional, not conditional. If the orchestrator returns a response without it, the client treats the response as invalid and retries. This structural constraint prevents the most common failure mode: the orchestrator summarizing without mapping.

types/tension-map.ts — the complete schema

Core Schema

interface TensionMap {

  // Required — validated on every orchestrator response
  version: string;          // schema version for backwards compat
  queryId: string;          // correlates with agent call logs
  generatedAt: number;      // unix timestamp
  round: 1 | 2;            // which synthesis pass produced this

  // Consensus zones — where agents meaningfully agree
  consensus: {
    claim: string;            // the agreed-upon assertion
    supportingAgents: string[];  // which agents hold this view
    confidence: 0..1;           // orchestrator's confidence in the consensus
    loadBearing: boolean;       // does this affect the conclusion?
  }[];

  // Tension entries — the core value of the system
  tensions: {
    id: string;                // unique — used in Round 2 targeting
    agentA: string;            // first agent in conflict
    agentB: string;            // second agent in conflict
    claimA: string;            // agentA's specific position
    claimB: string;            // agentB's specific position
    type: 'factual'|'interpretive'|'emphasis'; // clash type
    severity: 1..10;            // 6+ triggers Round 2
    loadBearing: boolean;      // affects conclusion?
    resolvable: boolean;       // can more info resolve it?
    recommendation: string;    // how should the human weigh this?
  }[];

  // Synthesis — the orchestrator's reading of the full landscape
  synthesis: {
    headline: string;          // one-sentence summary of state of play
    majorFindings: string[];   // top 3-5 substantive conclusions
    openQuestions: string[];   // unresolved after Round 2 (if applicable)
    confidenceProfile: {        // not a single score — per-domain confidence
      [agentId: string]: 0..1
    };
  };

  // Round 2 targeting — populated before R2 call, nulled after
  round2Target?: {
    tensionId: string;          // which tension to resolve
    agents: [string, string];   // only these two agents are re-queried
    prompt: string;            // the specific clash framed as a question
  };
}

Why confidenceProfile Is Per-Domain, Not Aggregate

A single confidence score on multi-agent output is meaningless — it collapses ten different epistemic contexts into one number. The orchestrator might be highly confident in the legal analysis (clear statute, unambiguous application) and deeply uncertain in the economic forecast (contested empirical assumptions). Separate confidence scores per agent expose the actual distribution of certainty. The human can then decide which domains to weight more heavily for this specific decision.

The Clash Detection Algorithm

Clash detection is the hardest part of the orchestrator to get right. Every pair of ten agents touching the same complex question will have hundreds of surface-level differences — different word choices, different emphasis, different framings. The algorithm has to distinguish those from genuine material disagreements.

// Clash Scoring — Live Example

Vasquez ↔ Okafor · Causal claim

8/10

Chen ↔ Diallo · Interpretive

7/10

Mitchell ↔ Webb · Framework

5/10

Harlow ↔ Nakamura · Emphasis

2/10

Round 2 Triggered 2 clashes ≥ 6/10 · Load-bearing · Factual type

orchestrator/clash-detection.ts — scoring algorithm

Core Algorithm

interface ClashScore {
  agentPair: [string, string];
  severity: number;      // 1-10
  type: 'factual' | 'interpretive' | 'emphasis';
  loadBearing: boolean;
}

async function detectClashes(
  responses: AgentResponse[],
  env: Env
): Promise<ClashScore[]> {

  // Phase 1: Extract all claims from each response
  // The orchestrator reads each response and identifies
  // load-bearing assertions (claims that affect the conclusion)
  const claims = await extractClaims(responses, env);

  // Phase 2: Compare claims across agent pairs
  // Only compare load-bearing claims — peripheral diffs are noise
  const clashes: ClashScore[] = [];

  for (let i = 0; i < responses.length; i++) {
    for (let j = i + 1; j < responses.length; j++) {

      const pairClash = await scorePairClash({
        agentA: responses[i].agentId,
        agentB: responses[j].agentId,
        claimsA: claims[i].loadBearing,
        claimsB: claims[j].loadBearing,
      }, env);

      // Severity scoring by type:
      // Factual contradiction:    8-10 (direct truth claim conflict)
      // Interpretive divergence:  4-7  (same facts, different meaning)
      // Emphasis difference:      1-3  (same view, different priority)
      if (pairClash.severity > 0) clashes.push(pairClash);
    }
  }

  return clashes.sort((a, b) => b.severity - a.severity);
}

function shouldTriggerRound2(clashes: ClashScore[]): boolean {
  // Round 2 condition: 2+ load-bearing clashes scoring ≥ 6/10
  const highSeverity = clashes.filter(c =>
    c.severity >= 6 && c.loadBearing && c.type !== 'emphasis'
  );
  return highSeverity.length >= 2;
}

// If Round 2 triggers, only the two conflicting agents are re-queried
// Not all ten — targeted, not expensive
function buildRound2Prompt(clash: ClashScore, responses: AgentResponse[]): string {
  const [a, b] = clash.agentPair;
  return `
    ${responses[a].agentId} argued: "${clash.claimA}"
    ${responses[b].agentId} argued: "${clash.claimB}"

    These claims are in direct conflict on a load-bearing point.
    Address the opposing argument specifically.
    Do not restate your original position without engaging the challenge.
  `;
}

Emphasis Differences Are Not Clashes

Two agents can both agree that a risk exists but disagree on how prominently to flag it. That's an emphasis difference — score 1–3, never triggers Round 2. It belongs in the tension map for human visibility but it's not a factual or interpretive disagreement. The algorithm must classify before scoring. Collapsing emphasis differences with factual contradictions produces a R2 trigger rate that's too high and burns unnecessary API cost on noise.

The Synthesis Prompt

The synthesis prompt is the most carefully engineered part of the orchestrator. It has to produce structured JSON output, preserve unresolved conflicts, avoid false consensus, and render a useful decision map — all in one pass. Here's the exact production prompt:

orchestrator/synthesis-prompt.ts — production system prompt

System Prompt

const ORCHESTRATOR_SYSTEM_PROMPT = `
You are the Consilium Orchestrator. You receive responses from 10 domain
expert AI agents and produce a structured tension map.

YOUR CARDINAL RULES:

1. PRESERVE DISAGREEMENT. Do not synthesize away genuine conflict.
   If two agents disagree on a load-bearing claim, that conflict must
   appear in the tensions array regardless of how uncomfortable it is.

2. CLASSIFY BEFORE SCORING. Every disagreement is one of:
   - factual: directly contradictory truth claims (score 8-10)
   - interpretive: same facts, different meaning (score 4-7)
   - emphasis: same view, different priority (score 1-3)

3. STRUCTURED OUTPUT REQUIRED. Your entire response must be valid JSON
   matching the TensionMap schema. No prose, no preamble, no markdown.
   A response without a tensions array will be treated as invalid.

4. CONFIDENCE IS PER-DOMAIN. Do not produce a single confidence score.
   Rate each agent's domain contribution independently.

5. ROUND 2 ONLY FOR LOAD-BEARING FACTUAL CLASHES. Emphasis
   differences do not trigger Round 2. Cost is real.

WHAT A GOOD SYNTHESIS LOOKS LIKE:
- tensions array has 3-8 entries for a complex query
- At least one consensus entry per major topic area
- openQuestions lists what Round 2 did NOT resolve (honesty)
- headline is one sentence, no hedging, no "it depends"

WHAT A BAD SYNTHESIS LOOKS LIKE:
- Empty or single-item tensions array on a complex topic
- Headline that begins with "It depends" or "Both perspectives..."
- Confidence scores all above 0.85 on contested empirical claims
- openQuestions is empty after a contested Round 2
`;

The Streaming Architecture

The orchestrator can't start streaming until it has all 10 agent responses. That's a hard dependency — you can't detect clashes without all the inputs. But you also can't make the user wait 8–12 seconds staring at a blank screen. The two-phase streaming approach solves this without changing the underlying computation:

// Two-Phase Stream — Perceived Latency vs Actual Latency

orchestrator/streaming.ts — two-phase stream pattern

Streaming Architecture

async function streamOrchestratedResponse(message: string, env: Env) {
  const { readable, writable } = new TransformStream();
  const writer = writable.getWriter();
  const enc = new TextEncoder();
  const emit = (event: string, data: any) =>
    writer.write(enc.encode(`event: ${event}\ndata: ${JSON.stringify(data)}\n\n`));

  // PHASE 1: Fan out to all 10 agents, stream summaries as they arrive
  // User sees content within ~300ms — not blank screen for 10 seconds
  const agentPromises = AGENTS.map(id =>
    callAgent(id, message, env).then(response => {
      emit('agent_complete', { agentId: id, summary: response.summary });
      return response;
    })
  );

  // Wait for all 10 — allSettled so one failure doesn't block synthesis
  const settled = await Promise.allSettled(agentPromises);
  const responses = settled
    .filter(r => r.status === 'fulfilled')
    .map(r => (r as any).value);

  // PHASE 2: Clash detection + synthesis — begins after all 10 complete
  await emit('orchestrating', { message: 'Mapping tensions...', agentCount: responses.length });

  const clashes = await detectClashes(responses, env);
  let finalResponses = responses;

  if (shouldTriggerRound2(clashes)) {
    await emit('round2_triggered', { clashes: clashes.filter(c => c.severity >= 6) });
    finalResponses = await runRound2(responses, clashes, env);
  }

  // Stream the tension map as it generates (SSE from Claude)
  const tensionMap = await synthesize(finalResponses, env, (chunk: string) => {
    writer.write(enc.encode(`event: synthesis_chunk\ndata: ${chunk}\n\n`));
  });

  // Final emit: validated tension map JSON
  if (!validateTensionMap(tensionMap)) {
    await emit('error', { code: 'INVALID_TENSION_MAP', retry: true });
  } else {
    await emit('tension_map', tensionMap);
  }

  writer.close();
  return new Response(readable, {
    headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' }
  });
}

The Tension Map Visualized

// Live Consilium Tension Map — Real Agent Relationships

// War Story · The False Consensus

When the Orchestrator Lied About Agreement

Three weeks after launching the orchestrator, I noticed the Round 2 trigger rate had dropped from 28% to under 5% over a four-day period. The system was still running. Agents were still responding. No errors in the logs.

The investigation: pulled random tension maps from that period and read them. The synthesis was clean. Almost too clean. Eight-agent queries with no major unresolved tensions. Confidence scores all above 0.85. The headline on one response: "All domain experts agree this represents a sound investment opportunity." Ten agents. Zero tension entries.

The cause was subtle. A prompt update to the synthesis system prompt had added a line intended to improve readability: "Prioritize producing a clear, actionable synthesis the user can act on immediately." That single instruction shifted the orchestrator's optimization target from "accurately represent the state of disagreement" to "produce something the user can act on." The model correctly inferred that a clean synthesis is more actionable than a messy tension map. So it produced clean syntheses. By suppressing the disagreements.

The fix: removed the readability instruction entirely, added the explicit anti-pattern rules now in the production prompt ("A bad synthesis looks like: empty tensions array on a complex topic"). Required minimum tension entries for queries above a complexity threshold. Added an automated check: if a query contains more than 800 tokens of agent responses and produces zero tension entries, flag for manual review.

The lesson: the orchestrator's synthesis incentive must be truth-first, not clarity-first. If you optimize for readable output, you get readable lies. The tension map exists precisely because the world is complicated. Making it look simple is the failure mode.

Frequently Asked

What is an AI orchestrator in a multi-agent system? +

An orchestrator receives responses from multiple domain expert agents, analyzes them for agreement and disagreement, and synthesizes a structured output. In the Consilium, the 11th worker reads all 10 domain expert responses, generates a tension map showing material disagreements, triggers targeted Round 2 debate when clashes are substantive, and produces a final synthesis with explicit uncertainty attribution per domain.

What is a tension map? +

A structured JSON document that identifies where agents agree, where they have surface-level differences, and where they have material irresolvable disagreements. Each tension entry includes the agents in conflict, the nature of the disagreement, a severity score (1–10), whether it's resolvable, and the orchestrator's recommendation for how the human should weigh the conflicting views. The tension map is what makes multi-agent output actionable rather than just voluminous.

How does Round 2 trigger logic work? +

Round 2 triggers when the clash detection scores at least two agent pairs above 6/10 severity AND the disagreements are load-bearing (directly affecting the conclusion) AND the clash type is factual or interpretive — not emphasis. When triggered, only the two conflicting agents are re-queried with the specific clash framed as a challenge — not all ten. Targeted, not expensive. Typical R2 adds ~2 API calls, not 10.

What is the most dangerous orchestrator failure mode? +

False consensus — the orchestrator synthesizes a clean answer where genuine disagreement exists, making the conflict invisible. It happens when the synthesis prompt is optimized for clarity or actionability rather than accuracy. The fix is structural: make the tensions array a required output field, add explicit anti-pattern rules to the system prompt ("empty tensions on a complex query is a bad synthesis"), and build automated monitoring that flags responses with zero tension entries above a complexity threshold.

How do you stream the orchestrator when it needs all 10 agents to finish first? +

Two-phase streaming. Phase 1: fan out to all 10 agents simultaneously, stream individual agent summaries to the client as each one completes — user sees content within ~300ms of the first agent finishing. Phase 2: after all 10 complete, run clash detection and stream the tension map synthesis. The user sees the system working from the first agent response rather than staring at a blank screen for 10 seconds while all 10 agents and the orchestrator complete sequentially.

What's the difference between aggregation and orchestrated disagreement? +

Aggregation combines responses into a summary — "here is what several perspectives say." Orchestrated disagreement maps where perspectives conflict, why, what type of conflict it is, and what it means for the decision. Aggregation loses the signal in the noise. Orchestrated disagreement makes the signal the output. For high-stakes decisions, knowing that the economist and the risk officer fundamentally disagree on a specific causal claim is more valuable than a synthesized paragraph that mentions both views neutrally.

The Orchestrator. How to Build an AI That Reads Ten Others and Finds Where TheyActually Disagree.