The Benchmark Problem.
There isn't one hallucination rate. There are five failure modes, three benchmark families, and completely opposite winners depending on which one you read.
Key insight
Every headline that says Claude hallucinates less is citing a different test than every headline that says ChatGPT does.
Summarization benchmarks favor ChatGPT. Calibration benchmarks favor Claude. A large-scale practical test gives ChatGPT a slight edge. These aren't contradictory findings — they're measuring different failure modes.
It depends on the task, and that's not a dodge — it's the finding. On summarization tasks (faithfully representing source documents), ChatGPT wins: GPT-4o produces 1.5% unfaithful outputs versus Claude Sonnet's 4.4% on the Vectara HHEM benchmark. On calibration (knowing what you don't know), Claude wins: Claude Opus 4.1 scores 0% overclaims versus GPT-5.5's 86% on AA-Omniscience. On a 1000-prompt practical test of open-ended tasks, ChatGPT edges out Claude 12% vs 15%. No model wins every category. The question "who hallucinates more" has three different correct answers depending on what you measure.
Yes, but at lower rates than most models on coding-specific tasks. Claude's strongest performance is in knowing when it doesn't know — it's more likely to say "I'm not sure this library method exists" than to invent one confidently. In practice, Claude Code hallucinates most on obscure API details, recent library changes post-training-cutoff, and multi-step reasoning chains where early errors compound. The fix isn't model-switching; it's grounding Claude in real documentation via tool use or RAG.
AI hallucination is when a model generates content that is factually incorrect, fabricated, or unfaithful to source material — presented with the same confidence as accurate content. It's not a single failure mode: fabrication (inventing facts), confabulation (filling gaps with plausible-sounding guesses), faithfulness drift (drifting from source document content), calibration error (asserting when uncertain), and omission (leaving out critical information) are all called "hallucination" in different benchmarks. That's why the numbers vary so wildly between studies — they're measuring different things.
They measure different failure modes on different task types. Vectara's HHEM measures faithfulness to source documents (summarization). AA-Omniscience measures calibration (does the model assert when it shouldn't?). HalluHard measures conversational hallucinations on factual questions. SimpleQA measures factual accuracy on closed-form questions. FaithBench measures NLG faithfulness. Each benchmark produces a different ranking because the underlying skill being tested is different. An article that says "Claude hallucinates less" likely cited calibration data. An article that says "ChatGPT is more accurate" likely cited summarization data. Both are correct in their domain.
Fabrication is inventing information that wasn't in the training data or source. Calibration error is asserting with confidence when the model should express uncertainty or decline. These require different fixes. Fabrication is addressed by RAG and grounding — giving the model real source material. Calibration error is addressed by model training (Constitutional AI, RLHF with refusal training). Claude excels at calibration; it's more willing to say "I don't know." ChatGPT has historically prioritized helpfulness, which can mean confident wrong answers on topics outside its training data.
Trustworthy for their specific task, not as general statements about a model. Vectara's HHEM is reliable for summarization faithfulness — but tells you nothing about conversational factuality. AA-Omniscience is reliable for calibration on a specific question type — but doesn't test document tasks. The practical 1000-prompt tests have high ecological validity but poor reproducibility. Use benchmarks as direction-setters, not verdicts. The only benchmark that matters for your use case is one that tests your actual task on your actual data. No published benchmark does that for you.
Head to Head.
Claude wins on calibration. ChatGPT wins on summarization faithfulness. A 1000-prompt practical test gives ChatGPT a 12% vs 15% edge. No model sweeps the board.
Vectara HHEM (summarization faithfulness): GPT-4o 1.5% vs Claude Sonnet 4.4%. ChatGPT wins by a significant margin on staying faithful to source documents. AA-Omniscience (calibration): Claude Opus 4.1 at 0% overclaims vs GPT-5.5 at 86%. Claude wins dramatically on knowing what it doesn't know. hellobuilder.ai 1000-prompt practical test: ChatGPT 12% vs Claude 15%. Perplexity performed best at 3.3%, suggesting real-time web access changes the equation entirely. The pattern: Claude is better at refusing to answer when uncertain. ChatGPT is better at accurately representing documents it's given.
For research requiring accurate recall of facts: Claude has the edge because it's more likely to say "I'm not certain" rather than confidently assert a wrong answer. Claude Opus 4.1 scores near-zero on calibration benchmarks. However, if your research involves summarizing existing sources or analyzing documents, ChatGPT performs more faithfully. For both: ground the model in current sources via RAG or web access to reduce hallucination rates to near-zero regardless of which model you use.
Yes, by a large margin on calibration benchmarks. AA-Omniscience tests whether models assert confident answers on questions where the correct response is to decline or express uncertainty. Claude Opus 4.1 achieves 0% overclaims — it correctly refuses to answer when it lacks reliable knowledge. GPT-5.5 overclaims on 86% of those same questions. This is the result of Claude's Constitutional AI training, which explicitly penalizes confident wrong answers. In practice, Claude is less likely to fabricate authoritative-sounding misinformation — but may frustrate users who want an answer even when none is available.
Yes — all large language models can and do fabricate citations, but rates vary by task and model version. GPT-4o shows higher citation fabrication rates on open-ended research queries. The pattern is plausible journal names, real author names combined with invented paper titles, or real papers misattributed to the wrong claim. ChatGPT with browsing enabled reduces this significantly. Without browsing, always verify any citation a language model provides.
Yes, though Claude is more likely to flag uncertainty than to fabricate a citation confidently. Claude's calibration training means it will sometimes say "I believe there is research on this but I can't cite a specific paper" rather than inventing one. That's better than a fabricated citation — but you still need to verify any specific claim. Claude with web search enabled dramatically reduces fabrication. Never use any model's citations in academic or legal contexts without independent verification.
For closed-domain factual questions with a single correct answer, Claude Opus 4.1 leads on recent benchmarks including SimpleQA. For open-ended questions where confidence calibration matters, Claude also leads due to its refusal training. ChatGPT edges Claude on summarization accuracy and faithfulness to provided source material. For most factual tasks, the difference is small enough that prompt quality and task framing matter more than model choice.
For writing that requires staying grounded in provided source material, ChatGPT (GPT-4o) has a measurable edge on faithfulness benchmarks. For factual writing requiring accurate recall without source documents, Claude has a slight edge on calibration. In practice: provide source material for either model rather than relying on training-data recall. The model that uses real sources most faithfully wins — and that depends more on your prompting strategy than on model choice.
Neither model is universally better — they excel in different dimensions. Claude leads on calibration, refusal quality, long-context reasoning, and coding. ChatGPT leads on document summarization faithfulness, broad task coverage, and ecosystem integrations. For content where confident wrong answers are dangerous (medical, legal, financial), Claude's calibration advantage matters. For document-based workflows where faithfulness to source material is critical, ChatGPT's summarization accuracy matters.
Task by Task.
The same model that's reliable for summarizing documents will hallucinate more in open-ended conversation. The question isn't which model — it's which task.
ChatGPT (GPT-4o) for strict faithfulness to source content; Claude for long documents over 100k tokens. GPT-4o produces 1.5% unfaithful outputs versus Claude's 4.4% on Vectara HHEM — making ChatGPT measurably more likely to stay within what the document actually says. However, Claude handles very long documents with less degradation. For critical summarization tasks (legal documents, research papers), use ChatGPT with explicit "stay strictly within the source material" instructions. For ultra-long documents that exceed GPT-4o's context window, Claude is your only practical option.
Claude (Sonnet and Opus) for complex, multi-file coding tasks; ChatGPT for broad language and framework coverage. Claude is less likely to confabulate library methods or invent API signatures. It performs better on SWE-bench and similar agentic coding benchmarks. ChatGPT covers more obscure frameworks and has marginally better JavaScript pattern coverage. The more important variable is grounding: providing current documentation and real error messages as context. Claude Code with MCP tools accessing real docs outperforms either model running ungrounded by a wider margin than Claude vs ChatGPT alone.
For ungrounded content generation, ChatGPT has a slight edge — 12% vs 15% fabrication rate in a 1000-prompt practical test. But the difference is small, and both models will confidently invent statistics, customer stories, and product claims without real source material. The practical fix: provide actual product data, real customer quotes, and current market information as context. A model grounded in real data produces near-zero marketing hallucinations regardless of model. This is the task where RAG and real web data matter most — the gap between grounded and ungrounded outputs dwarfs the gap between models.
Claude is generally preferred for customer support in regulated or risk-sensitive industries because it's more likely to escalate or say "I'm not sure" rather than provide a confident wrong answer. Both models benefit enormously from grounding in your actual product documentation and FAQs — which reduces hallucination to near-zero regardless of model. Choose Claude when a wrong answer is worse than no answer; choose ChatGPT when breadth of coverage matters more than caution.
For research, the model with access to current sources beats both Claude and ChatGPT without access. Perplexity (with real-time search) scored 3.3% hallucination in the same 1000-prompt test where Claude scored 15% and ChatGPT 12%. Real-time web access is a bigger variable than model choice. When you must choose between ungrounded models: Claude's calibration advantage makes it safer for factual research because it's less likely to invent authoritative-sounding citations. For document-based research, ChatGPT's summarization faithfulness wins.
It's competitive at the top, but the gap with GPT-4o-based tools has narrowed significantly through 2025–2026. Claude Opus 4.1 leads on SWE-bench Verified and multi-step software engineering tasks requiring large codebase reasoning. GPT-4o remains competitive on frontend tasks and has broader plugin ecosystem coverage. The more important variable is tool access: Claude Code with MCP tools (filesystem, docs, real web data) outperforms either model running ungrounded by a wider margin than Claude vs ChatGPT alone.
Neither model should be used for medical or legal decisions without professional oversight — but for informational use, Claude's calibration advantage is meaningful in these domains. Claude is more likely to say "consult a doctor" rather than providing a specific recommendation. For regulated domains, the required mitigation is human review, not model selection. Both models can fabricate plausible-sounding clinical or legal information — the risk doesn't disappear by choosing one over the other.
Which model wins on your task?
Select a task type to see benchmark data and a recommendation.
Source: Vectara HHEM v2 (2025)
When It Gets Worse.
Both models hallucinate significantly more in three specific conditions. Knowing the conditions matters more than knowing the model.
The fix
Giving an AI real source material eliminates an entire class of hallucinations that no model swap can fix.
When a model has current, specific source material to work from, faithfulness errors drop to near-zero for both ChatGPT and Claude. The debate about which model hallucinates less is mostly a debate about behavior when flying blind.
Training data has a cutoff date, and both models were not trained on events after it. When asked about recent events, models must either refuse to answer (calibration) or generate plausible-sounding content from pattern matching (fabrication). Claude's calibration training makes it more likely to say "my knowledge cuts off in early 2025" rather than invent current information. The fix is the same for both: enable web search or provide current source material as context. This single change reduces hallucination on recent-event queries from high to near-zero regardless of model.
Yes — both models experience "lost in the middle" degradation where content in the middle of long contexts is recalled less faithfully than content at the start or end. This affects Claude less than GPT-4o at very long context lengths (200k+ tokens), which is why Claude is preferred for long-document analysis. For typical business document lengths (under 50k tokens), both models perform similarly. Position your most important source material at the beginning or end of the context window, not the middle.
Yes — higher temperature increases creativity and variation, which also increases hallucination rate. At temperature 0, both models are most deterministic and most faithful to their training. At temperature 1+, outputs are more creative but more likely to include fabricated details. For factual tasks, use low temperature (0–0.3). Most production hallucination benchmarks test at temperature 0–0.5. The benchmarks that show worst-case hallucination rates are often testing higher-temperature outputs.
Claude's calibration training reduces but doesn't eliminate overconfidence, especially on memorized but incorrect facts. If Claude's training data contained incorrect information stated confidently (a common pattern in web text), it can reproduce that incorrect information with the same confidence. The Constitutional AI training helps Claude catch many cases of uncertainty, but doesn't fix memorized factual errors. Claude most often states wrong information confidently on: specific dates and numbers, obscure historical claims, and technical details from domains underrepresented in training data.
ChatGPT is trained to be helpful, and "helpful" in its RLHF process often meant providing specific, authoritative-sounding answers. When asked for statistics on a niche topic where no data exists in training, the model generates plausible numbers in the style of statistics it has seen — percentage ranges, study sizes, citation formats — because that pattern scored well in human feedback during training. GPT-4o has reduced this tendency compared to earlier models, but not eliminated it. Specific statistics from any LLM should be independently verified.
Three conditions reliably increase hallucination rates for both models: (1) Asking about topics near or past training cutoff without current sources. (2) Asking for specific numbers, dates, or citations on niche topics. (3) Very long conversations where the model loses track of earlier context. Secondary factors: prompts that reward sounding authoritative over expressing uncertainty, asking multiple questions in one prompt, and tasks where the model doesn't have enough information but the prompt implies it should. The single most effective mitigation is providing current, specific source material as context.
For coding tasks, yes — with the same verification practices you'd apply to any junior developer's output. Claude Code generates code that sometimes includes confabulated API parameters, deprecated methods, or incorrect logic for edge cases. It is not safe to deploy AI-generated code without review. What makes Claude Code safer than raw API usage: it reads actual files in your codebase, which grounds it in real context rather than training-data recall. With MCP tools providing real documentation and current library versions, confabulation drops dramatically. The safety ceiling is your review process, not the model.
The Actual Fix.
RAG, source-grounded prompting, and real-time web data reduce hallucination rates more than model choice. This is what the benchmarks miss.
Three signals indicate likely hallucination: suspicious specificity, unfalsifiable claims, and lack of source. Suspicious specificity means a precise number or date on a niche topic that seems too convenient. Unfalsifiable claims are ones where the model can't tell you where to verify them. Lack of source means the model asserts something without citing where to check it. For any claim that will be published, acted on, or that carries risk if wrong — verify it independently. Ask the model "where can I verify this?" A well-calibrated model will give you a source. A hallucinating model will give you another assertion.
Provide source material rather than asking the model to recall. This is the highest-leverage intervention for both models. Instead of "what are the statistics on X?" — paste the relevant document and ask "according to this document, what does it say about X?" Faithfulness to provided content is dramatically higher than recall accuracy. Secondary interventions: use low temperature for factual tasks, ask the model to cite sources for any specific claim, use chain-of-thought prompting for complex reasoning, and enable web search when recency matters.
RAG retrieves relevant source documents at query time and includes them in the model's context, replacing training-data recall with current source material. Yes, it dramatically reduces hallucination for factual tasks. The Perplexity result (3.3% hallucination in a 1000-prompt test where Claude was at 15%) demonstrates the effect: real-time retrieval is a bigger variable than model choice. RAG works best for knowledge that changes (product docs, market data, news, current prices) and knowledge that's too specific for training data (company-internal documents, proprietary research).
For content grounded in provided source material, use ChatGPT — its summarization faithfulness benchmark is meaningfully better. For content where the model must reason from memory and you prefer a cautious refusal over a confident wrong answer, use Claude. For content where accuracy genuinely matters: don't rely on either model's ungrounded recall. Provide the source material. Use web search. Verify specific claims. An MCP server that provides real SERP data, PAA questions, and current page content to your AI eliminates the most common class of content hallucinations regardless of which model you use.
Build a test set of 50–100 queries in your domain with known correct answers, run both models, score the outputs. This is the only hallucination benchmark that matters for your use case. The setup: take real queries you receive or generate, find the ground truth answer from authoritative sources, run them through both models at your standard temperature and prompting setup, score for factual accuracy and faithfulness. Run the test with and without source material to measure the grounding delta. Budget 4–6 hours for a 100-query test set. This is more actionable than any published benchmark.
Yes — this is the single most effective intervention, and it works regardless of which model you use. When a model has current SERP results, actual page content, and real PAA questions as context, it can answer from source rather than from training-data recall. Perplexity's 3.3% result vs Claude's 15% and ChatGPT's 12% in a 1000-prompt test is largely explained by real-time web access. For content workflows — writing blog posts, researching topics, generating marketing copy — grounding the model in current SERP data reduces hallucination risk more than switching between Claude and ChatGPT.
Ground your AI in real data.
MCP Scraper gives your AI workflows real SERP data, People Also Ask harvests, and live page content — eliminating the class of hallucinations that come from outdated or invented facts. Less model-switching, more reliable outputs.
Start free →