"The fact that every scientific paper in 2026 is still published as a PDF tells you everything about how far the science system has fallen behind AI."

On March 31st, Wharton's Ethan Mollick posted that one line on X, and 1,200 people liked it. The academic response came almost immediately. Yale SOM's Paul Goldsmith-Pinkham published "LLM-Friendly Academic Papers: A Proposal" the same month, and the whole argument boils down to one sentence — "We're asking AI to read photographs of text (PDFs), while the original intent and context sits with us, the authors."

What Is It?

As of 2026, arXiv has 2.4 million cumulative papers and Overleaf has 15 million users. Almost all of them write in LaTeX — but what gets published externally becomes a single PDF. Here's the thing — PDF was designed in 1993 for print. It's not text; it's a collection of glyphs placed at specific X and Y coordinates.

That's where the AI-era cost kicks in. A 2025 study by Peters & Chin-Yee published in Royal Society Open Science found that LLM-generated paper summaries overgeneralize conclusions about five times more often than human summaries do. A treatment that works for patients under 65 becomes "a treatment that works." The caveats are always the first thing to drop out.

And it's worth noting this isn't just a PDF problem.

  • Accessibility is already broken
    A 2024 analysis of 20,000 PDFs by Kumar & Wang found that 74.9% didn't meet any accessibility standard for visually impaired readers. That's a human problem before it's an AI problem.
  • Mobile breaks it too
    A two-column layout on a 6.1-inch screen turns into scrambled line breaks and dropped captions. Researchers don't read mobile PDFs all the way through either.
  • LLMs are reading pixels
    A table comes in not as "data" but as a "rendered image of a table." To an LLM, a regression coefficient and a standard error are just similar-looking pixel clusters.
  • The author's judgment disappears
    The fact that "this result is the main finding and that one is secondary" exists nowhere in the PDF. LLMs can only guess from paragraph length.

So Can't We Just Build Better PDF Parsers?

The industry has already spent years on exactly that. As of May 2026, Firecrawl, Docling (IBM, 58.6k GitHub stars), Marker-PDF (34.4k stars), LlamaParse, Unstructured (14.6k stars), and Reducto are all competing for the title of "best PDF parser".

But the same comparative analysis makes two things clear at once.

Parser Strengths Weaknesses
Firecrawl /parse (auto/fast/ocr) Under 400ms per page, 5x faster than alternatives Still loses data on complex table structures
Docling (IBM) Unified DoclingDocument representation, MCP server included Requires local GPU; performance varies outside trained domains
Marker-PDF (--use_llm) LLM post-processes table structure; cleanest output for human readers VLM hallucination — risk increases with text-dense academic papers
Bottom line Even perfect parsing leaves some problems unsolved Recovering layout and recovering "author intent" are different problems

The Firecrawl comparison piece spells out the conclusion explicitly: layout errors have a cascading effect — one break and everything downstream falls apart like dominoes; table structure is the last remaining unsolved problem; and VLM-based parsers carry the highest hallucination risk specifically with text-dense academic papers.

And then Goldsmith-Pinkham lands the key claim — "Even perfect parsing can't solve certain problems." Which result is the main finding, which limitation is the most critical, whether "experience" in this paper means customer count over the past year or years of tenure — none of that exists anywhere in the PDF's pixels. Only the author knows.

What's the Fix Academia Is Proposing?

The core of Goldsmith-Pinkham's proposal is this: leave the PDF alone and put two more files next to it. No code changes required.

  1. llms.txt — A guide written by the author
    A short Markdown file covering "what this paper shows and what it deliberately does not show." Seven recommended sections: what the paper is about / important context / data and methods / key findings / limitations and scope / where to start reading / publication status. The most critical section is limitations — the first thing LLMs tend to drop.
  2. paper bundle — Paper + data + code as a zip
    paper.md (the full Markdown body), figures/, data/ (tables as CSVs), code/ (reproducible with a single reproduce.sh command), references.bib. The key move is including tables as CSVs, not PNGs.
  3. Tiered adoption
    arXiv and Overleaf already run LaTeX-to-Markdown conversion pipelines, so a single "Generate LLM bundle" button would handle it. For PDF-only papers, GROBID, Docling, or Nougat can handle the conversion. The minimum viable step is "write one llms.txt by hand and upload it next to the PDF" — that takes 15 minutes.
  4. Why the author has to write it
    An LLM can draft a llms.txt from paper.md — but "which limitation is actually the binding constraint" is something only the author knows. Which sample restriction really boxes in the conclusion, which robustness check is worth staking your career on — that information isn't in the pixels.

The comment thread on Mollick's post points the same direction. One researcher writes: "Writing in RMarkdown simultaneously outputs both LaTeX and Markdown — the switching cost is basically zero, but nobody moves." Another comment sums it up in one line: "mdarxiv should exist."

What Changes for Your Company?

"We're not academics, so why does this matter?" might be your first reaction. But the same structural problem applies to every PDF inside your company.

  1. Step 1: Default to Markdown alongside every internal PDF
    When publishing legal reviews, IR materials, internal reports, or quarterly earnings PDFs, upload a .md or .html version alongside it in the same repository. Your RAG pipeline accuracy improves immediately. This is the corporate version of the Goldsmith-Pinkham proposal.
  2. Step 2: Save every table as a separate CSV
    Break the habit of embedding table images in slides and reports. Put the same table as a CSV next to it, and your internal LLM can actually compare and verify numbers.
  3. Step 3: Write author intent as a 1-page llms.txt
    At the top of any long report, add a separate Markdown section covering: "what this report shows / what it does not show / the three most important limitations." AI will read this first when it summarizes — it's the cheapest way to cut that 5x overgeneralization risk.
  4. Step 4: Apply the same approach to external publications
    The PDFs you send to customers and journalists will eventually be fed into an LLM. Distributing press releases and white papers alongside their Markdown originals means your intent gets preserved more accurately — in both search indexes and AI summaries.

Getting Started

  1. Step 1: Bundle the next PDF you publish with a .md file
    Pick one report, paper, or white paper you're currently working on and produce a .md version alongside it. If it's LaTeX, one pandoc command does it. If it's Word, use Pandoc or Markitdown.
  2. Step 2: Break out the tables as CSVs
    Take three tables from that document and save them as separate CSVs. Just put them in the same folder — that's it.
  3. Step 3: Write one llms.txt
    15 minutes. Of the seven sections, make sure at minimum you're explicit about "what this does not show" and "the most important limitations." An LLM can draft the rest.
  4. Step 4: Compare RAG and search results
    Ask the same question against your index with PDF-only vs. (PDF + md + llms.txt). The difference in answer accuracy and source citation quality shows up immediately.
  5. Step 5: Lock it in as a guideline
    If the results hold, add one line to your publishing guide: "No PDF-only releases — Markdown and CSV must accompany all documents." If academia is going to get there within a year, companies can move faster.

Deep Dive Resources

Paul Goldsmith-Pinkham — LLM-Friendly Academic Papers: A Proposal The full version of the llms.txt + paper bundle proposal. Includes the seven-section template, three-tier adoption path, and an arXiv automation code repository paulgp.substack.com

Firecrawl — The Best PDF Parsers in 2025/2026 A head-to-head comparison of Firecrawl, Docling, Marker, LlamaParse, Unstructured, and Reducto. Covers specific failure modes — layout cascading, VLM hallucination — with concrete examples firecrawl.dev

Ethan Mollick — Original Post on X (2026-03-31) The one-liner that sparked the conversation about how far the science system has fallen behind AI. Worth reading the comment thread too — the mdarxiv and RMarkdown proposals buried in the replies are worth your time x.com/emollick