// Observation log: PropTechUSA.ai production systems, 2025–2026
// Subject: Prompt performance degradation across model version boundaries
// Status: Active, unresolved at industry level
That log is not pulled from a paper. It is a composite of what actually happens when you run prompts in production over time without a monitoring layer. The prompt didn't change. Your code didn't change. The model did — or more precisely, the relationship between your prompt and the model changed — and nothing in the pipeline surfaced it until the outputs were bad enough to notice.
§01 What Prompt Decay Actually Is
Prompt decay is the degradation of a prompt's effectiveness over time, without changes to the prompt itself. It has three distinct mechanisms, and they require different responses. Conflating them is the source of most bad remediation decisions.
gpt-4-0613 and gpt-4-1106-preview have meaningfully different behavior on the same prompts — behavior that is not documented in release notes. The effective mitigation is pinning to dated model snapshots wherever the API supports it.§02 Why It's Silent
Prompt decay is dangerous not because it's inevitable — it is — but because it's invisible by default. There is no error thrown when a model update changes how your prompt performs. No alert fires when output quality drops 15% over three weeks. The pipeline returns a 200, the token count looks right, the response is in the correct format. It's just subtly, consistently worse.
The research is unambiguous on the scope: 91% of machine learning models degrade over time. Not as a risk — as an outcome. The question is not whether your prompts will decay. It's whether you'll catch it before the outputs are meaningfully wrong and the damage has propagated.
§03 Detection
You cannot manage what you don't measure. Prompt decay detection requires three layers, and none of them are optional once you're running prompts in production at scale.
Layer 1 — Pin Your Models
// ❌ Floating model reference — will silently shift behavior on provider update const response = await callAPI({ model: "claude-sonnet-latest", // "latest" = unknown behavior surface, no reproducibility }); // ✅ Pinned model snapshot — deterministic, comparable, auditable const response = await callAPI({ model: "claude-sonnet-4-20250514", // Specific version = reproducible eval baseline // Upgrade is a deliberate decision, not a silent event }); // In your config — never hardcode the model string in worker code const MODEL_VERSIONS = { primary: "claude-sonnet-4-20250514", fallback: "claude-haiku-4-5-20251001", // When you upgrade: conscious decision, eval suite run first, changelog entry };
Layer 2 — Golden Set Evals
Build a set of reference prompt/output pairs that represent correct behavior. Run them on every deploy and on a weekly schedule against your production model config. Track the scores over time. A drop in golden set performance is your earliest decay signal — often weeks before the degradation is visible in production outputs.
const GOLDEN_SET = [ { input: "Lead: Sarah M., motivated seller, house in foreclosure, 3BR", expected_keys: ['motivation_score', 'urgency', 'recommended_approach'], expected_format: 'json', motivation_score_range: [70, 100], }, // ... 20-50 representative examples covering your key use cases ]; async function runEvals(modelVersion) { const results = []; for (const test of GOLDEN_SET) { const output = await runPrompt(modelVersion, test.input); const score = scoreOutput(output, test); results.push({ test, output, score }); } const avg = results.reduce((s, r) => s + r.score, 0) / results.length; // Log to Supabase, alert if avg drops > 5% from baseline await logEvalRun({ modelVersion, avg, results, ts: Date.now() }); return avg; }
Layer 3 — Output Distribution Monitoring
Track the statistical properties of your production outputs over time — not just whether they look right, but whether their distribution is stable. If your lead scoring prompt typically returns motivation scores between 60–85 and suddenly the distribution shifts to 40–70 with no corresponding change in lead quality, that's a decay signal. Track embedding drift using cosine similarity against a baseline if your use case warrants it. For most production systems, simpler statistical checks (output length distribution, key presence rate, score range) catch 80% of decay events before they matter.
a risk to manage.
It is an outcome to measure.
The question is never whether your prompts will decay. It's whether your eval layer catches it before your users do.
§04 How To Build For Decay
The architectural response to prompt decay is to treat prompts as production code — versioned, tested, and monitored with the same rigor as any other component that can break silently.
latest or stable in production. Pin to dated model snapshots. Model upgrades are deliberate engineering decisions with eval runs, not automatic config updates.