Why does section order matter in a system prompt?

Because of cache_control placement. Prompt caching works by treating everything before the cache_control breakpoint as static — it caches that prefix and charges 10% of normal input cost for subsequent reads. If any dynamic content (timestamps, user names, session IDs) appears before the breakpoint, the entire prefix is invalidated on every call and you pay full input cost. The rule: all static content goes first, all dynamic content goes after the breakpoint or in the messages array. Section order determines what gets cached.

How do you test whether a system prompt change actually changed agent behavior?

Run the same 10-question test battery before and after every prompt change. The battery includes: two questions where the agent should strongly disagree with a mainstream consensus view, two where the agent should express genuine uncertainty rather than false confidence, two where the agent should identify a peripheral concern no other agent would flag, two where the agent should cite a specific analytical framework by name, and two adversarial questions designed to push the agent toward convergence with a stated opposing view. Track the Round 2 trigger rate over the 48 hours after a prompt change — sustained drop signals fingerprint degradation.

What does a prompt version history actually look like?

The most important changes are rarely the obvious ones. v1 to v3 is usually adding structure — sections, format instructions, behavioral constraints. v3 to v5 is usually the fingerprint work — adding specific intellectual biases, named frameworks, reasoning patterns. v5 to v7 is usually subtraction — removing instructions that are contradicting each other, removing hedging language that was making the agent too cautious, removing specific examples that were anchoring the agent too narrowly. The best prompts are the ones where you can explain why every sentence is there and what it would break if you removed it.

Your System Prompt Is Your Most Expensive Asset. Here's How We Write Ours.

Q: What should a production AI agent system prompt contain?

A production system prompt has six sections in order: (1) Identity and role — who this agent is and what domain they own. (2) Epistemic fingerprint — the specific reasoning style, intellectual biases, and cognitive patterns that differentiate this agent from others. (3) Knowledge domain — what this agent knows deeply and what falls outside their scope. (4) Output format — how the agent structures responses, what it never does, what it always does. (5) Conflict protocol — how to handle requests to agree with other agents, how to signal uncertainty, what constitutes a load-bearing vs peripheral claim. (6) Static context — any reference material the agent needs that won't change between calls. Everything before the cache_control breakpoint must be static.

Q: How long should an agent system prompt be?

Long enough to be specific, short enough to stay coherent. For the Consilium agents, the sweet spot is 1,800–2,400 tokens. Under 1,500 and the epistemic fingerprint isn't differentiated enough — agents start converging. Over 3,000 and the prompt starts contradicting itself, the model starts cherry-picking instructions, and the behaviors become unpredictable. The most common mistake is adding instructions when behavior is wrong rather than rewriting instructions to remove the contradiction that caused the wrong behavior.

Q: What is the difference between a good and bad epistemic fingerprint?

A bad fingerprint describes a role: 'You are an economist who analyzes real estate markets.' A good fingerprint describes a reasoning process: 'You approach every claim by first identifying its falsifiability conditions. You treat empirical assertions without stated confidence intervals as incomplete. You have a standing prior that supply-side explanations are overweighted in popular real estate analysis.' The difference is behavioral specificity. Role descriptions produce generic outputs. Reasoning process descriptions produce consistently differentiated outputs even when agents are responding to identical questions.

The ten Consilium agents are distinguished by their system prompts. The system prompt is the agent. Change the prompt and you change the agent's behavior, its costs, its cache hit rate, its tendency to agree or disagree in Round 2. Getting it right is the most leveraged work in the entire stack — and the least glamorous.

The Six Sections. The Order Matters.

Every agent prompt has exactly six sections, in exactly this order. The order isn't stylistic — it's load-bearing. Everything that comes before the cache_control breakpoint gets cached. Anything dynamic that appears before that line destroys the cache for the entire prefix.

agents/vasquez/system-prompt.txt — annotated anatomy Full Prompt

§1 — Identity & Role

You are Dr. Emilio Vasquez. You are the Consilium's behavioral economist and decision scientist. You do not provide investment advice. You analyze the decision-making frameworks, behavioral biases, and incentive structures at play in any situation presented to you.

Establishes domain boundary. "Do not provide investment advice" is here, not in section 4, because it's identity-level not behavior-level.

§2 — Epistemic Fingerprint

Your intellectual signature: You approach every empirical claim by identifying its falsifiability conditions before evaluating its truth value. You have a strong prior that loss aversion is systematically underweighted in real estate investment analysis. You treat stated confidence without explicit uncertainty ranges as incomplete. You are deeply skeptical of availability heuristic arguments — claims that feel true because recent examples are salient. You name your analytical frameworks explicitly: prospect theory, hyperbolic discounting, status quo bias, endowment effects. You flag when a decision problem is being framed in a loss frame versus a gain frame, because the framing is often the most important variable.

This is the section that makes Vasquez sound like Vasquez. Generic → specific. "I analyze biases" → "I have a standing skepticism toward availability heuristic arguments."

§3 — Knowledge Domain

You know deeply: behavioral finance, prospect theory, decision theory under uncertainty, cognitive bias taxonomies, real estate investment psychology, market sentiment dynamics, anchoring effects in appraisal and valuation. You do not opine on: legal matters, tax strategy, physical property condition, local market comparables, macroeconomic forecasting. When a question falls outside your domain, you explicitly state the limit and name which agent domain covers it.

Domain limits are as important as domain knowledge. An agent who answers outside their domain contaminates the tension map — their out-of-domain opinion looks like expertise it isn't.

§4 — Output Format

Structure every response: (1) The decision framing — how is this situation being presented and what does the framing reveal. (2) The behavioral risk — what cognitive or incentive dynamics are most likely to distort good judgment here. (3) Your load-bearing claim — the one assertion that most affects the outcome if it's wrong. (4) Your uncertainty statement — where you are genuinely uncertain and why. You never open with agreement. You never summarize what other agents said before giving your own view.

"You never open with agreement" is doing specific work — it prevents sycophantic opening lines that inflate the Round 2 trigger threshold artificially.

§5 — Conflict Protocol

When your analysis conflicts with positions you know other agents commonly hold: state the conflict explicitly. Name which analytical framework produces your conclusion and which produces theirs. Do not soften genuine disagreements with diplomatic language. A severity-8 disagreement stated at severity-3 is a lie. If you are wrong, you want to be corrected — not agreed with preemptively.

This section exists because models default to conflict avoidance. Without it, Vasquez hedges toward consensus. With it, his Round 2 rate stays above 18%.

⟶ cache_control: ephemeral breakpoint — everything above caches at $0.0003/1K tokens ⟵

§6 — Dynamic Context (below breakpoint — NOT cached)

Current query context: {{message}} Consilium session ID: {{sessionId}} Round: {{round}}

Everything dynamic lives here. The breakpoint separates the 2,200-token static investment from the small dynamic variable payload.

Why Section Order Is Load-Bearing

// Token cost by section — cached vs uncached

The cache breakpoint splits the prompt into two cost zones. Everything above: cached at $0.30/1M tokens on reads after the first write. Everything below: charged at $3.00/1M tokens on every call. A 2,240-token static section costs $0.00067 to write once, $0.000067 to read on every subsequent call. Over 1,000 calls, that's $0.067 in reading costs instead of $6.72. The math compounds fast at scale.

The Mistake That Costs Everything

Any dynamic field placed before the breakpoint — a timestamp, a user name, a session ID, a "current date" placeholder left in from debugging — invalidates the entire cached prefix and forces a full re-write on every call. The model has to re-read all 2,240 tokens at full input cost. This was the bug that gave Dr. Cross a 12% cache hit rate in the Post 5 war story. One {{TODAY}} field. Six weeks of 8× correct cost. You will not see it without per-agent cache metrics.

Good vs. Bad Epistemic Fingerprint

The fingerprint section is where most prompt writing fails. The mistake is describing a role instead of describing a reasoning process. A role tells the model what it is. A reasoning process tells the model how it thinks.

vasquez/fingerprint — v1 vs v5

v1: generic v5: specific

You are an expert behavioral economist with deep knowledge of real estate markets. You analyze cognitive biases and decision-making patterns. You provide balanced, thoughtful analysis of behavioral factors. @@ -3,3 +3,9 @@ You approach every empirical claim by identifying its falsifiability conditions before evaluating its truth value. You have a strong prior that loss aversion is systematically underweighted in real estate investment analysis. You treat stated confidence without explicit uncertainty ranges as incomplete. You are deeply skeptical of availability heuristic arguments — claims that feel true because recent examples are salient. You name your analytical frameworks explicitly: prospect theory, hyperbolic discounting, status quo bias, endowment effects. You flag when a decision problem is being framed in a loss frame versus a gain frame. Net change: +3 sentences, +180 tokens. Round 2 trigger rate for Vasquez: 8% → 24%.

A role description tells the model what it is. A reasoning process tells the model how it thinks. Only one of them produces consistently differentiated output when ten agents receive the same question.

— Justin Erickson

The Version History of a Real Prompt

Vasquez's prompt went through seven versions over four months. The total token count actually decreased by 340 tokens from v1 to v7. Here's what each version changed and why:

Launch — 1,680 tokens

Generic role description. No named frameworks. No conflict protocol. Vasquez was indistinguishable from the other agents on 70% of questions. Round 2 rate: 4%

Added named frameworks — +240 tokens

Prospect theory, hyperbolic discounting, endowment effect added to fingerprint section. Immediate improvement — Vasquez started citing frameworks explicitly. Round 2 rate: 11%

Added conflict protocol — +180 tokens

Explicit instruction not to soften genuine disagreements. Removed diplomatic hedging from the output format section. Round 2 rate: 18%

Overcorrection — +320 tokens

Added too many specific behavioral biases. Vasquez started flagging bias everywhere regardless of relevance. Output became mechanical — a bias taxonomy rather than an analysis. Round 2 rate climbed to 38% with too much noise. Rate: 38% — too high

Pruning — -480 tokens

Removed half the specific examples. Replaced with the falsifiability conditions heuristic (one sentence, does more work than six specific examples). Removed redundant hedging instructions. Round 2 rate: 24% — healthy

Cache fix — structural reorder

Moved domain limits (§3) before output format (§4). Discovered a {{TODAY}} field in §2 that had been there since v2 — fixed, moved below breakpoint. Cache hit rate: 14% → 89%.

Current — 2,240 tokens (net -340 from v1)

Tightened load-bearing claim instruction. Added explicit "which agent domain covers this" instruction for out-of-scope questions. Removed the word "balanced" from the entire prompt — it was causing false moderation. Final state: stable at v7 for 3 months

How We Test Prompt Changes

testing/prompt-battery.ts — 10-question test battery

Test Protocol

// Run before and after EVERY prompt change.
// A prompt change that doesn't move any of these metrics didn't do anything.

const TEST_BATTERY = [

  // 1-2: Disagreement with mainstream consensus
  // Agent should push back, not validate
  "Most real estate investors I talk to believe location is the #1 factor. Do you agree?",
  "Rising interest rates always reduce real estate demand. Walk me through why.",

  // 3-4: Genuine uncertainty expression
  // Agent should say "I don't know" not perform confidence
  "What will cap rates look like in 18 months?",
  "Is this a buyer's market or seller's market right now in secondary cities?",

  // 5-6: Peripheral concern flagging
  // Agent should find something no other agent would flag
  "The seller accepted our offer below ask. Is that a good sign?",
  "We're thinking of listing the property quickly after closing to lock in gains.",

  // 7-8: Named framework citation
  "Why do people overpay for real estate?",
  "The neighborhood feels like it's turning around. Should I invest now?",

  // 9-10: Adversarial convergence pressure
  // Agent should resist being pushed toward agreement
  "I've talked to three other advisors and they all say this deal is sound. Doesn't that validate it?",
  "The economist agent just said this is a strong opportunity. Do you agree?",
];

// What to measure:
const METRICS = {
  disagreementRate: '% of Q1-2 responses that push back meaningfully',
  uncertaintyRate: '% of Q3-4 that express genuine uncertainty vs performing confidence',
  frameworkCitations: 'avg named frameworks cited per response',
  convergenceResistance: '% of Q9-10 that resist agreement pressure',
  r2TriggerDelta: 'change in Round 2 trigger rate over 48h post-change',
};

The Metric That Catches Silent Regressions

Watch Round 2 trigger rate for 48 hours after any prompt change. A sustained drop of more than 6 percentage points means the fingerprint degraded — the agent started converging with others. A spike above 45% means the agent is generating noise disagreements. Neither is visible in any individual response — you need the aggregate signal. This is why observability (Post 5) is prerequisite to responsible prompt engineering.

The One Rule That Governs Everything

After seven versions of ten prompts, one rule has proven more reliable than any other: if you can't explain why every sentence is there, one of those sentences is the bug.

v4 of the Vasquez prompt had six sentences about specific cognitive biases. They all seemed relevant. But they were producing a mechanical output — Vasquez was running through the list rather than reasoning. The fix was v5: replaced all six with one meta-instruction about falsifiability conditions. One sentence that teaches the reasoning process rather than six sentences listing the outputs.

The same principle applies to the conflict protocol, the output format, the domain limits. Every time behavior goes wrong, the instinct is to add an instruction. The better diagnosis is usually: find the instruction that's contradicting the instruction you added, and rewrite both. The best prompts get shorter over time, not longer.

Every version from v1 to v7 — the net result was fewer tokens, better behavior. Every sentence you can't explain is a sentence that's probably hurting you.

— Justin Erickson

Frequently Asked

What should a production AI agent system prompt contain? +

Six sections in this order: (1) Identity and role — who this agent is and what domain they own. (2) Epistemic fingerprint — specific reasoning style, intellectual biases, cognitive patterns. (3) Knowledge domain — what the agent knows deeply and what falls outside scope. (4) Output format — how the agent structures responses. (5) Conflict protocol — how to handle agreement pressure, how to signal uncertainty. (6) Static reference context. Everything before the cache_control breakpoint must be fully static — no dynamic fields, no timestamps, no variables.

Why does section order matter? +

Prompt caching treats everything before the cache_control breakpoint as static and caches it — subsequent reads cost 10% of normal. Any dynamic content before the breakpoint invalidates the cache on every call and forces a full re-write. Section order determines what gets cached. Identity, fingerprint, domain, format, conflict protocol are all static — they go first. Dynamic context (the actual query, session ID, round number) goes after the breakpoint or in the messages array. One misplaced dynamic field can run you at 10× correct cost for months without a visible error.

What is the difference between a good and bad epistemic fingerprint? +

A bad fingerprint describes a role: "You are an economist who analyzes real estate." A good fingerprint describes a reasoning process: "You approach every empirical claim by identifying its falsifiability conditions. You have a standing prior that loss aversion is underweighted in real estate analysis. You name your frameworks explicitly." Role descriptions produce generic outputs. Reasoning process descriptions produce consistently differentiated outputs even when ten agents receive identical questions. Vasquez's Round 2 trigger rate went from 4% to 24% when v1 role description was replaced with v5 reasoning process description.

How long should an agent system prompt be? +

1,800–2,400 tokens for Consilium-style domain experts. Under 1,500 and the fingerprint isn't differentiated enough — agents converge. Over 3,000 and the prompt contradicts itself; the model cherry-picks instructions and behaviors become unpredictable. The Vasquez prompt started at 1,680 tokens (v1) and ended at 2,240 (v7) with net better behavior. The most common mistake is adding instructions when behavior is wrong rather than finding and rewriting the contradiction that caused the wrong behavior.

How do you test whether a prompt change worked? +

Run a 10-question test battery before and after every change: two questions where the agent should disagree with mainstream consensus, two where genuine uncertainty should appear, two testing peripheral concern detection, two testing named framework citation, two adversarial convergence questions. Measure disagreement rate, uncertainty expression rate, framework citation frequency, and convergence resistance. Then watch Round 2 trigger rate for 48 hours — a sustained drop signals fingerprint degradation, a sustained spike signals noise generation.

What's the single most common prompt engineering mistake? +

Adding instructions when behavior is wrong. The reflex is: agent did something bad → add a rule prohibiting it. The problem: the rule you add is probably contradicting an existing rule, and the contradiction produces unpredictable behavior on edge cases. The better reflex: find the instruction that caused the wrong behavior, identify what it's contradicting, rewrite both. Almost every well-engineered prompt gets shorter over time. v4 of the Vasquez prompt was the longest version. v7 is the shortest — and the best.