Ask ChatGPT for paper references and it'll confidently cite studies that don't exist. With GPT-3.5, between 39–55% of citations are fabricated. Even with GPT-4, 18–29% are still made up. As of July 2025, there are over 206 documented cases of lawyers being fined for submitting AI-hallucinated case citations in court. "Reduce hallucinations" has become a tired phrase. A new class of tools is taking a different approach: if the AI can't cite it, it doesn't say it.

What Is It?

Grainulator, which recently caught attention on Hacker News, is an open-source research tool built around one principle: if it can't cite a source, it won't answer. Toss it a question and it runs through a 3-pass investigation followed by a 7-pass compilation before producing a response. The real story is in how that pipeline is designed.

How Grainulator Works
Input question → 3-pass investigation (multi-angle evidence gathering) → claims tagged by type (fact / constraint / risk / recommendation / estimate) → evidence tier classification (stated / web / documented / tested / production) → 7-pass compiler runs contradiction detection, bias scanning, and gap analysis → confidence score (0–100) generated → unresolved contradictions block the response entirely

What sets Grainulator apart from standard chatbots is that every claim gets an evidence tier. "Stated" (just said it), "web" (pulled from a web search), "documented" (verified in a document), "tested" (verified through testing), "production" (verified in a live environment). If the evidence is weak or contradictions between claims go unresolved, the compiler blocks the output.

What Changes?

Here's the thing — when people talk about preventing hallucinations, the go-to answer is RAG (Retrieval-Augmented Generation): stuff search results into the context window. But data is piling up showing RAG alone isn't enough.

ApproachHow It WorksLimitations
Basic RAGRetrieve documents → feed as context to LLMIf search results are inaccurate, hallucinations persist. On Stanford's legal RAG benchmark, 1 in 6 citations was still fabricated
Multi-layer verification (INRA, etc.)Source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trailAchieves hallucination rates below 0.1%, but specialized for academic citations — limited general applicability
Claim-level verification (Grainulator, CLATTER)Decompose responses into atomic claims → match each claim to evidence → detect contradictions → block unverified claimsSlower processing (40–70 seconds), but structurally prevents any unsourced statement from getting through
Constrained DecodingStructurally enforce source mapping at the token output level in codeMost reliable, but high implementation complexity. Requires actual programming, not just prompt engineering

Looking at Vectara's hallucination leaderboard, even top-performing models hallucinate at rates above 1.8% on summarization tasks. GPT-4o sits at 9.6%, Claude Sonnet 4.6 at 10.6%. That means no matter how good the model gets, you can't hit 0% without architectural-level verification.

The HN Community's Honest Take
Grainulator drew attention on Hacker News, but the community's reaction was mixed. Critics pointed out that "it's prompt-based, so the AI can still say whatever it wants" and that "constrained decoding would block hallucinations at the code level without needing prompts at all." A demo where it got the director of the 1932 film Scarface wrong was also flagged. The tool has real promise — just don't treat it as a silver bullet.

Getting Started

If you want to meaningfully improve your AI hallucination defenses right now, here's a three-step approach.

  1. Measure your actual hallucination rate
    Use an open-source evaluation model like Vectara's HHEM to put a real number on how often your system hallucinates. Getting from "it seems to mess up sometimes" to "7.2% of outputs don't match verified sources" is where you start.
  2. Add a verification layer that breaks responses into atomic claims
    Like the CLATTER framework, build a pipeline that splits AI responses into individual facts and matches each one to a source. Claim-level verification is far more precise than validating the full response as a unit.
  3. If you're enterprise, make multi-layer verification your baseline
    The 6-layer structure of source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trail is the most battle-tested pattern available today. Evaluate specialized tools like Avido or INRA, or consider cloud-native options like Google Vertex AI Grounding.

Deep Dive Resources

The technical evolution of hallucination detection

Hallucination detection has evolved through roughly three generations. The first used text-overlap metrics (ROUGE, BERTScore), measuring only surface-level similarity. The second used NLI (Natural Language Inference) to assess entailment relationships between sentences (SUMMAC, AlignScore). The current third generation uses atomic fact decomposition — breaking responses into the smallest possible claims and verifying each one independently (MiniCheck, CLATTER, REFIND).

Google made an interesting discovery in late 2024: simply asking an LLM "are you hallucinating right now?" reduced follow-up hallucinations by 17%. That suggests the problem isn't fundamentally unsolvable — it's a matter of architectural design. With constrained decoding, you can structurally prevent hallucinations by controlling the output token generation directly, without relying on prompts at all.