Vectara Hallucination Leaderboard — LLM 할루시네이션 비율 비교 차트

repository-images.githubusercontent.com

The AI Tools That Won't Answer Unless They Can Cite a Source

AI hallucination prevention, Grainulator, claim-level verification, Vectara HHEMBusiness

AI 인용 할루시네이션 방지를 위한 6단계 검증 시스템

Grainulator — Research that compiles (GitHub)

Hacker News 토론: The tool that won't let AI say anything it can't cite

Ask ChatGPT for paper references and it'll confidently cite studies that don't exist. With GPT-3.5, between 39–55% of citations are fabricated. Even with GPT-4, 18–29% are still made up. As of July 2025, there are over 206 documented cases of lawyers being fined for submitting AI-hallucinated case citations in court. "Reduce hallucinations" has become a tired phrase. A new class of tools is taking a different approach: if the AI can't cite it, it doesn't say it.

What Is It?

Grainulator, which recently caught attention on Hacker News, is an open-source research tool built around one principle: if it can't cite a source, it won't answer. Toss it a question and it runs through a 3-pass investigation followed by a 7-pass compilation before producing a response. The real story is in how that pipeline is designed.

How Grainulator Works

Input question → 3-pass investigation (multi-angle evidence gathering) → claims tagged by type (fact / constraint / risk / recommendation / estimate) → evidence tier classification (stated / web / documented / tested / production) → 7-pass compiler runs contradiction detection, bias scanning, and gap analysis → confidence score (0–100) generated → unresolved contradictions block the response entirely

What sets Grainulator apart from standard chatbots is that every claim gets an evidence tier. "Stated" (just said it), "web" (pulled from a web search), "documented" (verified in a document), "tested" (verified through testing), "production" (verified in a live environment). If the evidence is weak or contradictions between claims go unresolved, the compiler blocks the output.

What Changes?

Here's the thing — when people talk about preventing hallucinations, the go-to answer is RAG (Retrieval-Augmented Generation): stuff search results into the context window. But data is piling up showing RAG alone isn't enough.

Approach	How It Works	Limitations
Basic RAG	Retrieve documents → feed as context to LLM	If search results are inaccurate, hallucinations persist. On Stanford's legal RAG benchmark, 1 in 6 citations was still fabricated
Multi-layer verification (INRA, etc.)	Source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trail	Achieves hallucination rates below 0.1%, but specialized for academic citations — limited general applicability
Claim-level verification (Grainulator, CLATTER)	Decompose responses into atomic claims → match each claim to evidence → detect contradictions → block unverified claims	Slower processing (40–70 seconds), but structurally prevents any unsourced statement from getting through
Constrained Decoding	Structurally enforce source mapping at the token output level in code	Most reliable, but high implementation complexity. Requires actual programming, not just prompt engineering

Looking at Vectara's hallucination leaderboard, even top-performing models hallucinate at rates above 1.8% on summarization tasks. GPT-4o sits at 9.6%, Claude Sonnet 4.6 at 10.6%. That means no matter how good the model gets, you can't hit 0% without architectural-level verification.

The HN Community's Honest Take
Grainulator drew attention on Hacker News, but the community's reaction was mixed. Critics pointed out that "it's prompt-based, so the AI can still say whatever it wants" and that "constrained decoding would block hallucinations at the code level without needing prompts at all." A demo where it got the director of the 1932 film Scarface wrong was also flagged. The tool has real promise — just don't treat it as a silver bullet.

Getting Started

If you want to meaningfully improve your AI hallucination defenses right now, here's a three-step approach.

Measure your actual hallucination rate
Use an open-source evaluation model like Vectara's HHEM to put a real number on how often your system hallucinates. Getting from "it seems to mess up sometimes" to "7.2% of outputs don't match verified sources" is where you start.
Add a verification layer that breaks responses into atomic claims
Like the CLATTER framework, build a pipeline that splits AI responses into individual facts and matches each one to a source. Claim-level verification is far more precise than validating the full response as a unit.
If you're enterprise, make multi-layer verification your baseline
The 6-layer structure of source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trail is the most battle-tested pattern available today. Evaluate specialized tools like Avido or INRA, or consider cloud-native options like Google Vertex AI Grounding.

Deep Dive Resources

The technical evolution of hallucination detection

Hallucination detection has evolved through roughly three generations. The first used text-overlap metrics (ROUGE, BERTScore), measuring only surface-level similarity. The second used NLI (Natural Language Inference) to assess entailment relationships between sentences (SUMMAC, AlignScore). The current third generation uses atomic fact decomposition — breaking responses into the smallest possible claims and verifying each one independently (MiniCheck, CLATTER, REFIND).

Google made an interesting discovery in late 2024: simply asking an LLM "are you hallucinating right now?" reduced follow-up hallucinations by 17%. That suggests the problem isn't fundamentally unsolvable — it's a matter of architectural design. With constrained decoding, you can structurally prevent hallucinations by controlling the output token generation directly, without relying on prompts at all.

FAQ

Does Grainulator completely eliminate hallucinations?

Not completely. Grainulator is a prompt-based system, so there's still a chance the LLM ignores its instructions — and incorrect answers were reported even in the HN demo. That said, its evidence tier classification and contradiction detection provide structurally stronger verification than conventional approaches.

If I'm already using RAG, do I need additional verification on top of that?

Yes. On Stanford's legal RAG benchmark, even a well-designed RAG system hallucinated 1 in 6 times. You need multi-layer verification — source validation, real-time checking, and an audit trail — to get anywhere close to 0%.

If constrained decoding is the most reliable method, why isn't everyone using it?

Implementation complexity. Prompt engineering is just text — constrained decoding requires programming at the API level and designing token decoding strategies. That said, as OpenAI, Google, and others roll out Structured Output APIs, the barrier to entry is coming down.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Top 20% of Companies Capture 74% of AI's Economic Value — PwC's 1,217-Executive Study Reveals the Real Gap

PwC's 2026 AI Performance Study shows that 74% of AI's economic value is captured by just 20% of companies. Here's what AI leaders do differently and how to close the gap.

Explore more AI workflow guides on similar topics

AI Covers 94% of Tasks but Only 33% Adopt It — Anthropic Measured the Gap

i.redd.it

Anthropic's research shows AI can handle 94% of knowledge work tasks, yet real a

AI Covers 94% of Tasks but Only 33% Adopt It — Anthropic Measured the Gap

Anthropic's research shows AI can handle 94% of knowledge work tasks, yet real adoption sits at 33%. Here's why.

Microsoft Copilot Wave 3 — From Chat Assistant to Agentic Platform

blogs.microsoft.com

Wave 3 transforms Microsoft Copilot from a simple chat helper into a full agenti

Microsoft Copilot Wave 3 — From Chat Assistant to Agentic Platform

Wave 3 transforms Microsoft Copilot from a simple chat helper into a full agentic platform that takes action.

Next →Top 20% of Companies Capture 74% of AI's Economic Value — PwC's 1,217-Executive Study Reveals the Real Gap