Prompt Decay | PropTechUSA.ai

// Observation log: PropTechUSA.ai production systems, 2025–2026
// Subject: Prompt performance degradation across model version boundaries
// Status: Active, unresolved at industry level

// Simulated Production Observation Log — Prompt: lead-analysis-v3 DEGRADING

2025-08-14 09:12

94.1%

Baseline. Prompt deployed on model v2.1. Outputs clean, structured, consistent.

2025-09-02 14:37

93.8%

Stable. No changes to prompt or system. Model unchanged.

2025-10-18 08:55

87.3%

Unexplained drop. Provider silently updated model weights. No changelog published.

2025-11-03 11:20

82.1%

Formatting instructions partially ignored. Multi-task prompt showing task interference.

2025-12-01 16:44

71.4%

Major version upgrade to v3.0. Output structure changed. Zero warning from provider.

2026-01-15 09:08

68.9%

Drift confirmed. Prompt rewrite required. 3 days of engineering time to remediate.

That log is not pulled from a paper. It is a composite of what actually happens when you run prompts in production over time without a monitoring layer. The prompt didn't change. Your code didn't change. The model did — or more precisely, the relationship between your prompt and the model changed — and nothing in the pipeline surfaced it until the outputs were bad enough to notice.

// Prompt Performance Over Time — Schematic Silent degradation pattern with model-update cliff events

§01 What Prompt Decay Actually Is

Prompt decay is the degradation of a prompt's effectiveness over time, without changes to the prompt itself. It has three distinct mechanisms, and they require different responses. Conflating them is the source of most bad remediation decisions.

Model Version Drift

A model provider silently updates weights, alignment tuning, or RLHF calibration. The same prompt now runs on a different underlying system. Major version bumps are announced; minor weight updates are often not. gpt-4-0613 and gpt-4-1106-preview have meaningfully different behavior on the same prompts — behavior that is not documented in release notes. The effective mitigation is pinning to dated model snapshots wherever the API supports it.

Multi-Task Complexity Collapse

As you add tasks to a prompt — formatting + extraction + classification + summarization — performance degrades. A November 2025 study across six LLM families showed this degradation is architecture-dependent, not universal: some models showed positive transfer (adding tasks helped), others collapsed. The finding that breaks most assumptions: model size does not predict multitask robustness. A smaller model with the right architecture can outperform a larger one on complex prompts.

III

Inherent Stochastic Instability

Even on frozen prompts and frozen models, LLM outputs are stochastic. Research consistently shows correctness and consistency are only weakly correlated: models answer correctly but inconsistently, or confidently but incorrectly. On identical prompts across repeated calls, a model can alternate between correct and incorrect responses — not due to any external change, but as a structural property of probabilistic inference. Refusal behavior has been shown to flip on identical prompts across random seeds. This is not a bug. It is the system.

§02 Why It's Silent

Prompt decay is dangerous not because it's inevitable — it is — but because it's invisible by default. There is no error thrown when a model update changes how your prompt performs. No alert fires when output quality drops 15% over three weeks. The pipeline returns a 200, the token count looks right, the response is in the correct format. It's just subtly, consistently worse.

⚠️

The most dangerous period is the 2–6 weeks after a silent model update. Outputs are degraded enough to matter but not degraded enough to trigger obvious failures. Downstream systems accumulate bad data. Evaluations that ran fine last month now run on a different model with no flag. You won't know until something downstream breaks hard enough to trace back.

The research is unambiguous on the scope: 91% of machine learning models degrade over time. Not as a risk — as an outcome. The question is not whether your prompts will decay. It's whether you'll catch it before the outputs are meaningfully wrong and the damage has propagated.

§03 Detection

You cannot manage what you don't measure. Prompt decay detection requires three layers, and none of them are optional once you're running prompts in production at scale.

Layer 1 — Pin Your Models

Bad practice vs. good practice JS

// ❌ Floating model reference — will silently shift behavior on provider update
const response = await callAPI({
  model: "claude-sonnet-latest",
  // "latest" = unknown behavior surface, no reproducibility
});

// ✅ Pinned model snapshot — deterministic, comparable, auditable
const response = await callAPI({
  model: "claude-sonnet-4-20250514",
  // Specific version = reproducible eval baseline
  // Upgrade is a deliberate decision, not a silent event
});

// In your config — never hardcode the model string in worker code
const MODEL_VERSIONS = {
  primary: "claude-sonnet-4-20250514",
  fallback: "claude-haiku-4-5-20251001",
  // When you upgrade: conscious decision, eval suite run first, changelog entry
};

Layer 2 — Golden Set Evals

Build a set of reference prompt/output pairs that represent correct behavior. Run them on every deploy and on a weekly schedule against your production model config. Track the scores over time. A drop in golden set performance is your earliest decay signal — often weeks before the degradation is visible in production outputs.

eval-runner/index.js — minimal golden set pattern JS

const GOLDEN_SET = [
  {
    input: "Lead: Sarah M., motivated seller, house in foreclosure, 3BR",
    expected_keys: ['motivation_score', 'urgency', 'recommended_approach'],
    expected_format: 'json',
    motivation_score_range: [70, 100],
  },
  // ... 20-50 representative examples covering your key use cases
];

async function runEvals(modelVersion) {
  const results = [];
  for (const test of GOLDEN_SET) {
    const output = await runPrompt(modelVersion, test.input);
    const score = scoreOutput(output, test);
    results.push({ test, output, score });
  }
  const avg = results.reduce((s, r) => s + r.score, 0) / results.length;
  // Log to Supabase, alert if avg drops > 5% from baseline
  await logEvalRun({ modelVersion, avg, results, ts: Date.now() });
  return avg;
}

Layer 3 — Output Distribution Monitoring

Track the statistical properties of your production outputs over time — not just whether they look right, but whether their distribution is stable. If your lead scoring prompt typically returns motivation scores between 60–85 and suddenly the distribution shifts to 40–70 with no corresponding change in lead quality, that's a decay signal. Track embedding drift using cosine similarity against a baseline if your use case warrants it. For most production systems, simpler statistical checks (output length distribution, key presence rate, score range) catch 80% of decay events before they matter.

Prompt decay is not
a risk to manage.
It is an outcome to measure.

The question is never whether your prompts will decay. It's whether your eval layer catches it before your users do.

§04 How To Build For Decay

The architectural response to prompt decay is to treat prompts as production code — versioned, tested, and monitored with the same rigor as any other component that can break silently.

📌

Pin model versions explicitly

Never use floating references like latest or stable in production. Pin to dated model snapshots. Model upgrades are deliberate engineering decisions with eval runs, not automatic config updates.

🧪

Build a golden eval set before you need it

20–50 representative prompt/output pairs per critical prompt. Run them weekly. The first week you build this it feels like overhead. The first time it catches a decay event before production it pays back ten-fold.

📊

Track output distribution, not just output correctness

Correctness is binary and expensive to evaluate at scale. Distribution drift is statistical and cheap. Track output length, score ranges, key presence rates, format compliance. Drift in these signals precedes correctness failures.

🔀

Separate multi-task prompts before you hit the complexity cliff

If your prompt is doing extraction + classification + formatting + scoring, split it. Architecture-dependent complexity collapse is not predictable without testing. Simpler prompts have more stable decay curves and cheaper remediation when they do decay.

📝

Version control your prompts like code

Prompts live in version control, not in database fields or hardcoded strings. A prompt change is a commit with a description. A model upgrade is a PR with eval results attached. This sounds like process overhead until a prompt change breaks production and you need to roll back.

Justin Erickson — PropTechUSA.ai

GED (juvenile detention) · Self-taught · Running 87 workers that all have this problem · March 2026

More Engineering

Engineering

Multi-Agent AI on Cloudflare Workers

Engineering

The Agentic Trust Problem

Research

The Hallucination Taxonomy

Research

Reasoning Models in Production