Ask ChatGPT for paper references and it'll confidently cite studies that don't exist. With GPT-3.5, between 39–55% of citations are fabricated. Even with GPT-4, 18–29% are still made up. As of July 2025, there are over 206 documented cases of lawyers being fined for submitting AI-hallucinated case citations in court. "Reduce hallucinations" has become a tired phrase. A new class of tools is taking a different approach: if the AI can't cite it, it doesn't say it.
What Is It?
Grainulator, which recently caught attention on Hacker News, is an open-source research tool built around one principle: if it can't cite a source, it won't answer. Toss it a question and it runs through a 3-pass investigation followed by a 7-pass compilation before producing a response. The real story is in how that pipeline is designed.
What sets Grainulator apart from standard chatbots is that every claim gets an evidence tier. "Stated" (just said it), "web" (pulled from a web search), "documented" (verified in a document), "tested" (verified through testing), "production" (verified in a live environment). If the evidence is weak or contradictions between claims go unresolved, the compiler blocks the output.
What Changes?
Here's the thing — when people talk about preventing hallucinations, the go-to answer is RAG (Retrieval-Augmented Generation): stuff search results into the context window. But data is piling up showing RAG alone isn't enough.
| Approach | How It Works | Limitations |
|---|---|---|
| Basic RAG | Retrieve documents → feed as context to LLM | If search results are inaccurate, hallucinations persist. On Stanford's legal RAG benchmark, 1 in 6 citations was still fabricated |
| Multi-layer verification (INRA, etc.) | Source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trail | Achieves hallucination rates below 0.1%, but specialized for academic citations — limited general applicability |
| Claim-level verification (Grainulator, CLATTER) | Decompose responses into atomic claims → match each claim to evidence → detect contradictions → block unverified claims | Slower processing (40–70 seconds), but structurally prevents any unsourced statement from getting through |
| Constrained Decoding | Structurally enforce source mapping at the token output level in code | Most reliable, but high implementation complexity. Requires actual programming, not just prompt engineering |
Looking at Vectara's hallucination leaderboard, even top-performing models hallucinate at rates above 1.8% on summarization tasks. GPT-4o sits at 9.6%, Claude Sonnet 4.6 at 10.6%. That means no matter how good the model gets, you can't hit 0% without architectural-level verification.
Grainulator drew attention on Hacker News, but the community's reaction was mixed. Critics pointed out that "it's prompt-based, so the AI can still say whatever it wants" and that "constrained decoding would block hallucinations at the code level without needing prompts at all." A demo where it got the director of the 1932 film Scarface wrong was also flagged. The tool has real promise — just don't treat it as a silver bullet.
Getting Started
If you want to meaningfully improve your AI hallucination defenses right now, here's a three-step approach.
- Measure your actual hallucination rate
Use an open-source evaluation model like Vectara's HHEM to put a real number on how often your system hallucinates. Getting from "it seems to mess up sometimes" to "7.2% of outputs don't match verified sources" is where you start. - Add a verification layer that breaks responses into atomic claims
Like the CLATTER framework, build a pipeline that splits AI responses into individual facts and matches each one to a source. Claim-level verification is far more precise than validating the full response as a unit. - If you're enterprise, make multi-layer verification your baseline
The 6-layer structure of source retrieval → context annotation → LLM constraints → real-time verification → post-processing cleanup → audit trail is the most battle-tested pattern available today. Evaluate specialized tools like Avido or INRA, or consider cloud-native options like Google Vertex AI Grounding.
Deep Dive Resources
The technical evolution of hallucination detection
Hallucination detection has evolved through roughly three generations. The first used text-overlap metrics (ROUGE, BERTScore), measuring only surface-level similarity. The second used NLI (Natural Language Inference) to assess entailment relationships between sentences (SUMMAC, AlignScore). The current third generation uses atomic fact decomposition — breaking responses into the smallest possible claims and verifying each one independently (MiniCheck, CLATTER, REFIND).
Google made an interesting discovery in late 2024: simply asking an LLM "are you hallucinating right now?" reduced follow-up hallucinations by 17%. That suggests the problem isn't fundamentally unsolvable — it's a matter of architectural design. With constrained decoding, you can structurally prevent hallucinations by controlling the output token generation directly, without relying on prompts at all.



