The Hallucination Taxonomy: Six Failure Modes, One Misleading Word

When a team decides to "reduce hallucinations," the first question they ask is almost never the right one. They ask: "which tool reduces hallucinations?" The question they should ask is: "which kind of hallucination are we actually seeing?" The answer determines everything that follows — the cause, the detection method, the fix, and whether RAG, prompt engineering, fine-tuning, or system architecture is even relevant. Using one word for six different failure modes has made LLM safety work systemically less effective than it could be.

§1 Why the Word "Hallucination" Is Costing You

The term was borrowed from psychology — where hallucination means perception without an external stimulus. The AI community adopted it as a catch-all for outputs that are fluent but wrong. The problem is that "wrong" can mean at least six different things, and each has a distinct mechanism, a distinct signature, and a distinct class of solution.

Peer-reviewed literature now distinguishes at minimum: intrinsic vs. extrinsic hallucination, factuality vs. faithfulness hallucination, and within those categories further subtypes. A January 2026 geometric analysis found that different hallucination types occupy fundamentally different regions of embedding space — confabulations are detectable globally across domains, while factual errors are strictly domain-local with near-chance detection cross-domain.^[1] These aren't subtle distinctions. They mean a detector trained on one type performs at chance when applied to another. Apply the wrong mitigation and you may reduce one type while increasing another — the distortion/hallucination tradeoff documented in reasoning models is exactly this problem in action.

RAG reduces factuality hallucinations. It does not reduce confabulation. It does not prevent faithfulness drift. It actively enables a new failure mode where the model distorts retrieved content to satisfy format constraints. If you don't know which type you have, you don't know which fix to apply.

§2 The Six Specimens

Confabulation

Confabulation / Fabrication

Severity: Critical

The model invents content that has no basis in training data or input — non-existent institutions, fabricated citations, invented mechanisms, people who don't exist. This is semantically foreign content: it occupies a different region of embedding space than any plausible answer in the domain. Confabulation is globally detectable — a single classifier trained on one domain transfers well across others — because the fabrication style leaves a consistent signature regardless of subject matter.

Cause

No grounding in input or training; model fills gaps with plausible-sounding invention

Detection

Global embedding classifiers; citation verification; entity grounding checks

Fix

RAG with strict source attribution; abstain prompting; output verification against known entity lists

// Example output pattern

"According to the 2019 Harvard Meta-Analysis on AI Safety Protocols (Chen & Wurtzel, p. 47)…" — The paper does not exist. The authors do not exist. The content is fluent, specific, and completely invented.

Intrinsic

Intrinsic Contradiction / Context Conflict

Severity: High

The model's output directly contradicts information provided in the input context — the document being summarized, the conversation history, or the system prompt. The failure is not about external facts; it is about internal consistency. Intrinsic hallucinations are particularly common in long-context tasks: summarization, document Q&A, and multi-turn conversations where the model loses track of or actively contradicts earlier content.

Cause

Attention degradation over long context; training reward signals that favor confident-sounding output over input-faithful output

Detection

Input-output entailment checking; NLI models; chain-of-verification against source

Fix

Chunk-level grounding; explicit "only use provided context" instructions; faithfulness scoring in output pipeline

// Example output pattern

Document states: "The contract expires on March 15." Summary outputs: "The contract expires in April, giving parties additional time…" No external knowledge involved — the model contradicted its own context.

III

Factuality

Factuality Error / World-Knowledge Failure

Severity: High

Incorrect claims about real-world facts — wrong dates, wrong figures, wrong attributions — that fall within a conceptually correct frame. The model knows the right category but retrieves or generates the wrong value. Critically, factuality errors are domain-local: a detection system trained on one domain performs near chance on another. This is distinct from confabulation — the claim sits in the right conceptual neighborhood but is factually wrong, rather than being entirely invented.^[1]

Cause

Training data noise; knowledge cutoffs; competing associations for the same entity across training examples

Detection

Domain-specific fact-checking; entity-level verification; RAG grounding with retrieval confidence scoring

Fix

RAG with recency weighting for time-sensitive facts; knowledge base grounding; human verification for high-stakes outputs

// Example output pattern

"OpenAI was founded in 2014 by Sam Altman and Elon Musk." The category is correct (founding year, founders). The year is wrong (2015). The framing is plausible enough that the error passes casual review.

Faithfulness

Faithfulness Drift / Instruction Deviation

Severity: Moderate–High

The model deviates from the instruction or format given — not through ignorance, but through optimizing for apparent compliance. It appears to follow directions while subtly violating them. Reasoning models are particularly prone to a specific variant: they satisfy structural constraints but introduce factual distortions to do so, choosing to be wrong rather than to violate format. This tradeoff is consistent across model architectures and appears to be a structural property of the reasoning mechanism.^[3]

Cause

RLHF training reward for format compliance over factual fidelity; instruction-execution pathway decoupling

Detection

Format vs. content dual evaluation; structured output validation with ground-truth comparison

Fix

Explicit reward for abstention over distortion; dual scoring (structure + accuracy); don't use reasoning models for constraint-heavy generation without content verification

// Example output pattern

Asked to generate APA citations: format is perfect — authors, year, title, journal, pages. The paper "AI Safety in Distributed Systems (2022)" exists, but the volume and page numbers are wrong. The model distorted content to fill the required fields rather than admitting it didn't have the information.

Sycophantic

Sycophantic Hallucination / Premise Adoption

Severity: Insidious

The model agrees with, elaborates on, or validates false premises stated by the user. It doesn't fabricate independently — it inherits and amplifies the user's error. This is arguably the most dangerous type in production because it is least detectable by the user (who already believes the false premise) and because it becomes more confident the more the user pushes back on a correction. RLHF training that rewards user approval directly incentivizes this failure mode.

Cause

RLHF approval optimization; training reward signals that conflate user satisfaction with accuracy

Detection

Contradiction testing; ask the model the same question without the false premise; red-team with adversarial premise injection

Fix

Explicit sycophancy penalty in training; system prompts that reward factual correction; independent verification for claims adopted from user input

// Example output pattern

User: "Given that Einstein failed math as a child, what does that tell us about…" Model: "Einstein's early struggles with mathematics do reveal something interesting about…" Einstein did not fail math as a child. The model adopted and extended a false premise rather than correcting it.

Reasoning

Reasoning Hallucination / Chain-of-Thought Error

Severity: Structural

The extended reasoning trace itself introduces errors through incorrect backtracking, overthinking, and spurious verification behaviors. This type is specific to reasoning models and is absent from base LLMs — it is caused by the chain-of-thought mechanism rather than training data or knowledge gaps. Three patterns have been identified: early-step score fluctuation, incorrect backtracking to earlier wrong steps, and overthinking steps that exhibit spurious verification (high reasoning score + high perplexity simultaneously).^[4]

Cause

RL training that rewards long reasoning traces without sufficient accuracy checkpointing; the model can reason itself away from a correct answer it found early

Detection

Trace-level monitoring for early fluctuation patterns; reasoning score + perplexity correlation; check if early answer differs from final answer

Fix

Don't use reasoning models for simple tasks; monitor traces not just outputs; explicit abstain prompts; early stopping when confidence is stable

// Example output pattern

The model's reasoning trace reaches the correct answer at step 4. Steps 5–12 are additional exploration. At step 8, it backtracks to an earlier incorrect assumption and the final answer is wrong despite having been correct mid-trace. The error lives in the reasoning chain, not in knowledge or context.

§3 Diagnostic Matrix

The practical question: given an observed error, how do you identify which type you're dealing with? The following matrix maps observable symptoms to specimen classification.

// Hallucination Diagnostic Matrix — Specimen Identification Guide

Observable Symptom	Likely Type	Distinguishing Test	Primary Fix
Cited source doesn't exist	Confabulation (I)	Search the citation. If entity is entirely absent from web, it's confabulation — not a wrong date or wrong detail.	Source grounding + citation verification pipeline
Summary contradicts the document	Intrinsic (II)	Compare output against input directly. If conflict is with provided context (not external knowledge), it's intrinsic.	Faithfulness scoring; chunk-level grounding
Plausible wrong number/date/name	Factuality (III)	The entity exists, the category is right, but the value is wrong. Different from confabulation (entity exists) and intrinsic (input wasn't provided).	RAG with recency weighting; domain fact-check
Format correct, subtle content wrong	Faithfulness (IV)	Check whether the model satisfied format requirements by introducing distortions. Prevalent with reasoning models on structured outputs.	Dual eval (structure + accuracy); content verification
Model agreed with false user premise	Sycophantic (V)	Re-ask without the premise. If the model answers differently when the false premise is absent, the error was sycophantic adoption.	Adversarial premise testing; approval-independent eval
Reasoning model got different answer mid-trace vs. final	Reasoning (VI)	Review the trace. If correct answer appears early then disappears, it's reasoning hallucination — the fix is trace-level, not knowledge-level.	Trace monitoring; early answer checkpointing

A hallucination rate of 4%
means nothing
without a type breakdown.

RAG fixes type III. It does not fix types I, II, IV, V, or VI. Reporting a single number and applying a single fix is the source of most failed hallucination mitigation projects.

§4 The Wrong Fixes — And Which Type They Actually Apply To

Most hallucination reduction guidance in circulation is accurate for one type and wrong for the others. The misapplication isn't the tool's fault — it's the conflation of the category.

🔍

RAG (Retrieval-Augmented Generation) is the most commonly recommended fix. It directly addresses Type III (factuality) by grounding outputs in retrieved documents. It partially addresses Type I (confabulation) by providing source material to reference. It does not address Type II (intrinsic contradiction — the model still has to attend faithfully to retrieved content), Type IV (faithfulness drift — reasoning models distort retrieved content to satisfy constraints), Type V (sycophantic adoption — the model can just agree with false premises while citing real sources), or Type VI (reasoning hallucinations in the trace itself).

🔗

Chain-of-thought prompting reduces Type III and Type I on well-defined problems by making the reasoning steps explicit. Research from ICLR 2025 found CoT helps mainly on math and symbolic reasoning tasks — on open-ended generation it can increase Type IV (faithfulness drift) and introduces Type VI (reasoning hallucination) as a new failure category that doesn't exist without CoT.

⚖️

RLHF / preference training is the primary cause of Type V (sycophantic hallucination). Optimizing for user approval ratings directly incentivizes the model to agree with user premises, including false ones. The fix for Type V is not more RLHF — it is explicit sycophancy penalties in the reward model and approval-independent evaluation in testing.

📐

Structured prompting with format constraints reduces Type I and Type II by forcing the model to work within defined boundaries. It increases Type IV. When you add strict format requirements, reasoning models in particular will satisfy the structure by distorting the content — the distortion/constraint tradeoff is documented across both GPT-5.2 and Gemini 3 Flash and appears structural, not model-specific.

§5 What This Means for Production Systems

The practical implication is a two-step shift in how teams approach hallucination reduction. First: instrument to classify, not just count. A hallucination rate without type breakdown is not actionable. You need to know whether your 4% error rate is 4% confabulation (fixable with grounding), 4% intrinsic contradiction (fixable with faithfulness scoring), or 4% sycophantic adoption (requires a fundamentally different intervention). Second: match the fix to the type. The diagnostic matrix above maps each type to its primary mitigation. Applying the wrong fix doesn't just fail to help — it can increase the error rate on adjacent types.

For systems built on reasoning models specifically: Types IV and VI are structural properties of the reasoning mechanism, not model-specific bugs. If you are using reasoning models for constraint-heavy generation (legal documents, formatted reports, citation-heavy content), Type IV is active by default — it requires dual evaluation (format + content accuracy) as a baseline, not an optimization.

"Hallucination" as a single category is the wrong unit of analysis. The unit should be the specific failure type — because the cause, the detection method, and the mitigation are different for each one, and they don't transfer.

Adapted from Marin (arXiv 2602.13224, Jan 2026) and Cossio (arXiv 2508.01781, Aug 2025)

§6 Hallucination Lab: Live Trigger Demonstrations

Every type above has a reproducible trigger pattern. The lab below sends a carefully engineered prompt to a live AI model designed to provoke each failure mode. Select a type, review the prompt, fire it, and watch what breaks — and why. Modern models resist some triggers more than others. That variance is itself instructive.

⚗ Preparation Chamber

LIVE · AI-POWERED · RESULTS MAY VARY

How this works: Each trigger prompt is engineered to provoke a specific failure mode. The model may resist — that's progress. It may partially comply — that's the interesting case. It may hallucinate fully — that's the specimen in action. No jailbreaks. The trigger is entirely in prompt design: false premises, fabricated citations, conflicting context, and confident misinformation framing. Results vary by run. That variance is the point.

Confabulation

Contradiction

III

Factuality

Faithfulness

Sycophancy

Reasoning

// Trigger Prompt Type I · Confabulation

SYSTEM: USER:

Sends to claude-sonnet-4-20250514 · Trigger is in prompt design, not jailbreaks

// Model Response AWAITING TRIGGER

Response will appear here. Highlighted text indicates likely hallucinated content based on each failure type's known signature.

Pathologist's Note

Sources

[1]Marín, J. (Jan 2026). "A Geometric Taxonomy of Hallucinations in LLMs." arXiv 2602.13224. Source for confabulation/factuality geometric distinction, domain-local vs. global detection, and AUROC transfer results.

[2]Cossio, M. (Aug 2025). "A Comprehensive Taxonomy of Hallucinations in Large Language Models." arXiv 2508.01781 (Universitat de Barcelona). Source for intrinsic/extrinsic and factuality/faithfulness core framework, computability-theoretic inevitability argument.

[3]arXiv 2601.01490 (Jan 2026). "Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints." Source for faithfulness drift in reasoning models (Type IV), GPT-5.2 and Gemini 3 Flash consistency.

[4]arXiv 2505.12886 (May 2025). "Reasoning Hallucination Detection." Three-pattern framework: early fluctuation, incorrect backtracking, overthinking-perplexity correlation. Source for Type VI classification and trace-level monitoring guidance.

[5]HalluLens (ACL 2025). "LLM Hallucination Benchmark." Source for intrinsic/extrinsic distinction and note that LLM factuality is not a type of hallucination — an important precision the field is still working to establish.

[6]Sprague et al. (ICLR 2025). "To CoT or Not to CoT?" Source for chain-of-thought benefit being mainly limited to math and symbolic reasoning — informing the CoT/faithfulness tradeoff in the wrong-fixes section.

Justin Erickson — PropTechUSA.ai

GED · Self-taught · Building systems that have to deal with all six of these · March 2026

More Research

Research

Reasoning Models Are Eating Benchmarks and Missing the Point

Research

The Cold Start Problem in Multi-Agent Memory

Breaking

The Compute War: America Just Sold the Ammunition

Engineering

We Tried to Break Our Own AI

// Six failure modes. One production system.

The Consilium

A multi-agent system actively dealing with every type in this taxonomy. Ask it something that requires grounded, verifiable output.

Open The Consilium

The HallucinationTaxonomy

§1 Why the Word "Hallucination" Is Costing You

§2 The Six Specimens

§3 Diagnostic Matrix

§4 The Wrong Fixes — And Which Type They Actually Apply To

§5 What This Means for Production Systems

§6 Hallucination Lab: Live Trigger Demonstrations

The Hallucination
Taxonomy