§01 Why The Framing Is Already Wrong
The question most engineering teams bring to this decision is: "Fine-tuning or prompting — which performs better?" That question assumes you're choosing a single tool for all time. The answer to that question is always "it depends," which is useless. The right question is: what kind of intelligence does my task require, and where should that intelligence live?
Two kinds of intelligence matter here. Volatile knowledge — facts, documents, current state of the world — changes. It needs to be updatable without retraining. It belongs in retrieval. Stable behavior — output format, tone, domain vocabulary, task-specific reasoning patterns — is consistent. It can be baked into weights. The mistake most teams make is trying to use fine-tuning to teach facts and prompting to enforce behavior. Both are backwards.
§02 What Each Fighter Actually Does
Shape model behavior entirely at inference time — no weight changes, no training data, no GPU. Instructions, examples, constraints, personas: all delivered in the prompt. Iterate in minutes. Swap approaches without redeployment. The model remains fully general — it can handle diverse tasks in the same call. Inference cost scales with prompt length. For knowledge bases under roughly 200,000 tokens, full-context prompting with prefix caching can be faster and cheaper than building entire retrieval infrastructure — Anthropic has noted this explicitly as an architecture simplifier for internal copilots.
✓ iterate fast · ✓ no training data · ✓ one model, many tasksUpdate model weights using task-specific examples. The target behavior becomes internalized — the model doesn't need to be told how to respond every call. Lower inference cost at scale (shorter prompts), more consistent output structure, brand voice that can't drift. PEFT methods like LoRA made this accessible in 2025 — you no longer need to update every parameter. The catch: you need 100–1,000 quality examples minimum. You need hold-out eval sets. Iteration is days, not minutes. Fine-tuning a chatbot on customer support tickets may make it worse at adjacent tasks. Task scope must be narrow and stable.
✓ consistent behavior · ✓ lower latency at scale · ✓ internalized style§03 Round-by-Round Scorecard
§04 The Decision Framework
Before reaching for fine-tuning, run through these gates in order. Each is a reason to not fine-tune yet. If you pass all of them — then fine-tuning is the right fight:
Few-shot examples, chain-of-thought, explicit format constraints, persona framing. High-capability modern models respond remarkably well to prompt engineering. If you haven't seriously invested in prompt iteration, you don't have a fine-tuning problem — you have a prompt problem. Fine-tuning before exhausting prompting is skipping the cheap option for the expensive one.
→ If NO: stop here. Iterate on prompts first.If what you need the model to "know" changes — pricing, docs, recent events, customer data — fine-tuning is the wrong tool. That's retrieval. Fine-tuning is for stable patterns: output format, reasoning approach, domain-specific vocabulary, consistent tone. If the answer to "will this change in three months?" is yes, it doesn't belong in weights.
→ If VOLATILE: use RAG. → If STABLE: continue to gate 3.Fine-tuning without a curated, labeled dataset produces a model that's confidently inconsistent. The training set must closely mirror actual production inputs. You need a hold-out eval set separate from training. You need to be able to measure whether the fine-tuned version actually performs better than the prompted baseline. Without that infrastructure, fine-tuning produces unmeasured change — which is worse than no change.
→ If NO: build the dataset first. Don't skip this.Fine-tuning a model on a narrow task reduces its capability on adjacent tasks. A model fine-tuned on customer support ticket classification may become measurably worse at writing emails or handling edge-case questions. If your application requires the model to handle diverse inputs, fine-tuning collapses the capability you're depending on. Fine-tuning is a sharpening tool — it works best on a single blade, not a Swiss Army knife.
→ If MULTI-TASK: use a prompted general model. → If NARROW: continue.Fine-tuning's cost advantage materializes at scale: when you're running millions of inferences with long system prompts, baking that behavior into weights reduces per-call cost significantly. At early scale — thousands of calls per day — the inference cost savings don't offset the training and maintenance overhead. Run the numbers: training cost + ongoing maintenance vs. inference savings at your actual volume. The crossover point is higher than most teams expect.
→ If scale justifies it AND you passed gates 1–4: fine-tune.Neither is the better fighter in the abstract. The LaRA benchmark (ICML 2025) confirmed no silver bullet — the better choice depends on task type, model behavior, context length, and retrieval setup. What you're choosing is not a tool. It's an architecture decision about where your intelligence lives. Get that question right and the tool choice is obvious. Get it wrong and no amount of fine-tuning saves you.