PDF parser comparison and limitations for LLM-readability

firecrawl.dev

Academic Papers Are Still PDFs — Not Because of AI, But That's Why Science Is Falling Behind

PDF, llms.txt, 학술 출판, RAG, 마크다운 출판Business

Ethan Mollick — 모든 과학 논문이 2026년에도 PDF로만 올라간다

LLM-Friendly Academic Papers: A Proposal — Paul Goldsmith-Pinkham

The Best PDF Parsers in 2025/2026

"The fact that every scientific paper in 2026 is still published as a PDF tells you everything about how far the science system has fallen behind AI."

On March 31st, Wharton's Ethan Mollick posted that one line on X, and 1,200 people liked it. The academic response came almost immediately. Yale SOM's Paul Goldsmith-Pinkham published "LLM-Friendly Academic Papers: A Proposal" the same month, and the whole argument boils down to one sentence — "We're asking AI to read photographs of text (PDFs), while the original intent and context sits with us, the authors."

What Is It?

As of 2026, arXiv has 2.4 million cumulative papers and Overleaf has 15 million users. Almost all of them write in LaTeX — but what gets published externally becomes a single PDF. Here's the thing — PDF was designed in 1993 for print. It's not text; it's a collection of glyphs placed at specific X and Y coordinates.

That's where the AI-era cost kicks in. A 2025 study by Peters & Chin-Yee published in Royal Society Open Science found that LLM-generated paper summaries overgeneralize conclusions about five times more often than human summaries do. A treatment that works for patients under 65 becomes "a treatment that works." The caveats are always the first thing to drop out.

And it's worth noting this isn't just a PDF problem.

Accessibility is already broken
A 2024 analysis of 20,000 PDFs by Kumar & Wang found that 74.9% didn't meet any accessibility standard for visually impaired readers. That's a human problem before it's an AI problem.
Mobile breaks it too
A two-column layout on a 6.1-inch screen turns into scrambled line breaks and dropped captions. Researchers don't read mobile PDFs all the way through either.
LLMs are reading pixels
A table comes in not as "data" but as a "rendered image of a table." To an LLM, a regression coefficient and a standard error are just similar-looking pixel clusters.
The author's judgment disappears
The fact that "this result is the main finding and that one is secondary" exists nowhere in the PDF. LLMs can only guess from paragraph length.

So Can't We Just Build Better PDF Parsers?

The industry has already spent years on exactly that. As of May 2026, Firecrawl, Docling (IBM, 58.6k GitHub stars), Marker-PDF (34.4k stars), LlamaParse, Unstructured (14.6k stars), and Reducto are all competing for the title of "best PDF parser".

But the same comparative analysis makes two things clear at once.

Parser	Strengths	Weaknesses
Firecrawl /parse (auto/fast/ocr)	Under 400ms per page, 5x faster than alternatives	Still loses data on complex table structures
Docling (IBM)	Unified DoclingDocument representation, MCP server included	Requires local GPU; performance varies outside trained domains
Marker-PDF (--use_llm)	LLM post-processes table structure; cleanest output for human readers	VLM hallucination — risk increases with text-dense academic papers
Bottom line	Even perfect parsing leaves some problems unsolved	Recovering layout and recovering "author intent" are different problems

The Firecrawl comparison piece spells out the conclusion explicitly: layout errors have a cascading effect — one break and everything downstream falls apart like dominoes; table structure is the last remaining unsolved problem; and VLM-based parsers carry the highest hallucination risk specifically with text-dense academic papers.

And then Goldsmith-Pinkham lands the key claim — "Even perfect parsing can't solve certain problems." Which result is the main finding, which limitation is the most critical, whether "experience" in this paper means customer count over the past year or years of tenure — none of that exists anywhere in the PDF's pixels. Only the author knows.

What's the Fix Academia Is Proposing?

The core of Goldsmith-Pinkham's proposal is this: leave the PDF alone and put two more files next to it. No code changes required.

llms.txt — A guide written by the author
A short Markdown file covering "what this paper shows and what it deliberately does not show." Seven recommended sections: what the paper is about / important context / data and methods / key findings / limitations and scope / where to start reading / publication status. The most critical section is limitations — the first thing LLMs tend to drop.
paper bundle — Paper + data + code as a zip
paper.md (the full Markdown body), figures/, data/ (tables as CSVs), code/ (reproducible with a single reproduce.sh command), references.bib. The key move is including tables as CSVs, not PNGs.
Tiered adoption
arXiv and Overleaf already run LaTeX-to-Markdown conversion pipelines, so a single "Generate LLM bundle" button would handle it. For PDF-only papers, GROBID, Docling, or Nougat can handle the conversion. The minimum viable step is "write one llms.txt by hand and upload it next to the PDF" — that takes 15 minutes.
Why the author has to write it
An LLM can draft a llms.txt from paper.md — but "which limitation is actually the binding constraint" is something only the author knows. Which sample restriction really boxes in the conclusion, which robustness check is worth staking your career on — that information isn't in the pixels.

The comment thread on Mollick's post points the same direction. One researcher writes: "Writing in RMarkdown simultaneously outputs both LaTeX and Markdown — the switching cost is basically zero, but nobody moves." Another comment sums it up in one line: "mdarxiv should exist."

What Changes for Your Company?

"We're not academics, so why does this matter?" might be your first reaction. But the same structural problem applies to every PDF inside your company.

Step 1: Default to Markdown alongside every internal PDF
When publishing legal reviews, IR materials, internal reports, or quarterly earnings PDFs, upload a .md or .html version alongside it in the same repository. Your RAG pipeline accuracy improves immediately. This is the corporate version of the Goldsmith-Pinkham proposal.
Step 2: Save every table as a separate CSV
Break the habit of embedding table images in slides and reports. Put the same table as a CSV next to it, and your internal LLM can actually compare and verify numbers.
Step 3: Write author intent as a 1-page llms.txt
At the top of any long report, add a separate Markdown section covering: "what this report shows / what it does not show / the three most important limitations." AI will read this first when it summarizes — it's the cheapest way to cut that 5x overgeneralization risk.
Step 4: Apply the same approach to external publications
The PDFs you send to customers and journalists will eventually be fed into an LLM. Distributing press releases and white papers alongside their Markdown originals means your intent gets preserved more accurately — in both search indexes and AI summaries.

Getting Started

Step 1: Bundle the next PDF you publish with a .md file
Pick one report, paper, or white paper you're currently working on and produce a .md version alongside it. If it's LaTeX, one pandoc command does it. If it's Word, use Pandoc or Markitdown.
Step 2: Break out the tables as CSVs
Take three tables from that document and save them as separate CSVs. Just put them in the same folder — that's it.
Step 3: Write one llms.txt
15 minutes. Of the seven sections, make sure at minimum you're explicit about "what this does not show" and "the most important limitations." An LLM can draft the rest.
Step 4: Compare RAG and search results
Ask the same question against your index with PDF-only vs. (PDF + md + llms.txt). The difference in answer accuracy and source citation quality shows up immediately.
Step 5: Lock it in as a guideline
If the results hold, add one line to your publishing guide: "No PDF-only releases — Markdown and CSV must accompany all documents." If academia is going to get there within a year, companies can move faster.

Deep Dive Resources

Paul Goldsmith-Pinkham — LLM-Friendly Academic Papers: A Proposal The full version of the llms.txt + paper bundle proposal. Includes the seven-section template, three-tier adoption path, and an arXiv automation code repository paulgp.substack.com

Firecrawl — The Best PDF Parsers in 2025/2026 A head-to-head comparison of Firecrawl, Docling, Marker, LlamaParse, Unstructured, and Reducto. Covers specific failure modes — layout cascading, VLM hallucination — with concrete examples firecrawl.dev

Ethan Mollick — Original Post on X (2026-03-31) The one-liner that sparked the conversation about how far the science system has fallen behind AI. Worth reading the comment thread too — the mdarxiv and RMarkdown proposals buried in the replies are worth your time x.com/emollick

FAQ

arXiv already has HTML versions of papers — do we really need llms.txt on top of that?

arXiv HTML is auto-converted via LaTeXML, so the body text survives — but author intent doesn't make it through. What Goldsmith-Pinkham emphasizes is judgment calls like 'which limitation is actually the binding constraint' and 'which sample restriction really boxes in the conclusion.' HTML and Markdown preserve text. llms.txt preserves intent. They serve different purposes — you need both to stop LLMs from overgeneralizing.

Isn't adding llms.txt metadata to internal reports basically just an ESG report appendix that nobody reads?

Not quite. The critical difference is human reader vs. machine. ESG appendices are documents people barely open. llms.txt is a document your internal RAG reads from the very first token. Search results and summary accuracy are immediately affected. Think of it less as 'a document humans skip' and more as 'the first document AI reads.' Even just spelling out three key limitations and the scope clearly will reduce hallucination frequency in your internal LLMs.

Is there any realistic chance Goldsmith-Pinkham's proposal actually gets adopted in academia? Or is this just another dead standard?

Two things make this different from past failed standards. First, the infrastructure already exists — arXiv is already running a LaTeXML pipeline converting LaTeX to HTML, and Overleaf has 15 million users. Second, the adoption unit is small — no journal standardization needed; one author uploads a .txt file next to their PDF and it's done. The llmstxt.org standard was adopted by 840,000+ sites within a year. mdarxiv could start on GitHub Pages without any journal resolution.

Can't I just go deep on whichever PDF parser is most popular right now? Which one should I pick?

There's no single right answer because use cases differ. General recommendations: for automated internal document RAG, use Firecrawl /parse (cloud-based, fast) or Docling (IBM, runs locally). For reports where tables are critical, try Marker-PDF --use_llm or Reducto. For academic papers and patents, Marker-PDF handles LaTeX equations best. That said, none of them fully prevent cascading layout errors. The safest setup is combining a parser with publishing Markdown originals alongside your PDFs — use both.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Top 20% of Companies Capture 74% of AI's Economic Value — PwC's 1,217-Executive Study Reveals the Real Gap

PwC's 2026 AI Performance Study shows that 74% of AI's economic value is captured by just 20% of companies. Here's what AI leaders do differently and how to close the gap.

Explore more AI workflow guides on similar topics

AI Covers 94% of Tasks but Only 33% Adopt It — Anthropic Measured the Gap

i.redd.it

Anthropic's research shows AI can handle 94% of knowledge work tasks, yet real a

AI Covers 94% of Tasks but Only 33% Adopt It — Anthropic Measured the Gap

Anthropic's research shows AI can handle 94% of knowledge work tasks, yet real adoption sits at 33%. Here's why.

Microsoft Copilot Wave 3 — From Chat Assistant to Agentic Platform

blogs.microsoft.com

Wave 3 transforms Microsoft Copilot from a simple chat helper into a full agenti

Microsoft Copilot Wave 3 — From Chat Assistant to Agentic Platform

Wave 3 transforms Microsoft Copilot from a simple chat helper into a full agentic platform that takes action.

Next →Top 20% of Companies Capture 74% of AI's Economic Value — PwC's 1,217-Executive Study Reveals the Real Gap