Why AI Stumbles Over PDFs

Originally reported bytheverge

One of the most common and unassuming file formats, the PDF, is proving to be a significant challenge for even the world's most advanced artificial intelligence models.

The extent of this challenge became evident last November when the House Oversight Committee released 20,000 pages of documents from Jeffrey Epstein's estate. Luke Igel and his friends found themselves struggling to navigate tangled email threads using a PDF viewer he described as "gross." In subsequent months, the Department of Justice followed suit, releasing over three million files, all in PDF format.

This presented a substantial problem. According to Igel, despite the Department of Justice applying optical character recognition (OCR) to the text, its quality was so poor that the files were virtually unsearchable.

Igel, cofounder of the AI video editing startup Kino, lamented, "There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for." This frustration sparked an idea: what if they could build a "Gmail clone" to intuitively view and search all this correspondence?

Achieving this would require extracting information from PDFs, a task far more complex than it sounds. Despite AI's remarkable progress in developing sophisticated software and solving advanced physics problems, the ubiquitous PDF format remains a formidable hurdle. Edwin Chen, CEO of the data company Surge, categorizes it among AI’s “unsexy failures” that limit real-world utility. He observed last year that even cutting-edge models attempting to extract PDF information often resort to summarizing, conflating footnotes with main text, or outright fabricating content. In a tongue-in-cheek AI development timeline, researcher Pierre-Carl Langlais humorously placed “PDF parsing is solved!” just before the advent of Artificial General Intelligence (AGI).

Igel's initial attempt involved his friend, "tech jester" Riley Walz, utilizing their remaining credits on Google's Gemini. However, this proved reliable only for the cleanest scans and would be prohibitively expensive for millions of documents. This led Igel to reach out to Adit Abraham, a former MIT classmate who ran Reducto, a PDF-parsing AI company located in the office above his.

PDFs pose a notorious challenge for machine parsing, partly because they were never designed for automated interpretation.

Reducto, one of several companies tackling the PDF problem, demonstrated its capability by successfully extracting information from email threads marred by cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. Once the data was in a usable format, Igel and Walz embarked on a rapid development effort, creating an entire "Epstein-themed app ecosystem." This included Jmail, a searchable prototype of Epstein’s inbox; Jflights, an interactive globe displaying flight paths, each clickable to reveal underlying PDFs of flight data, passenger manifests, and scanned invitations; Jamazon, for searching Epstein’s Amazon purchases; and Jikipedia, for finding businesses and individuals mentioned in the files, naturally cross-referencing more PDFs.

“That’s where the magic of extracting information of PDFs became real for me,” Igel stated, adding, “It’s going to completely change the way a lot of jobs happen.”

The inherent difficulty of PDFs for machines stems from their original purpose. Developed by Adobe in the early 1990s, the format aimed to faithfully reproduce documents, preserving their precise visual layout for printing and later for screen display. Unlike formats such as HTML, which represent text in a logical, ordered structure, PDFs comprise character codes, coordinates, and other instructions that essentially "paint" an image of a page.

While optical character recognition (OCR) can convert these textual images back into machine-readable text, it struggles with complex layouts. For instance, in multi-column academic papers, OCR often processes text linearly from left to right, resulting in an incomprehensible jumble. Although OCR tools are designed to adapt to some formatting variations, elements like tables, images, diagrams, captions, footnotes, and headers introduce further significant obstacles. When an AI assistant like ChatGPT is given a PDF, it typically cycles through various tools, sometimes failing, sometimes passing it to a large vision model for OCR, occasionally hallucinating content, and generally consuming considerable time and computing power for inconsistent results.

“The key issue is that they cannot recognize editorial structure,” explained Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”

Compounding PDF's intrinsic difficulty is the historical lack of AI models trained on them. This trend is now shifting, partly driven by AI developers' increasing demand for high-quality data, which PDFs disproportionately contain. Government reports, textbooks, and academic papers are frequently in PDF format. Researchers at the Allen Institute for AI noted last year in a paper announcing their specialized PDF-reading model that “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models.”

“The lore has it that the very first PDF ever was an IRS 1040,” recounted Duff Johnson, CEO of the PDF Association, the industry body responsible for the PDF global standard (ISO 32000-2:2020), which itself is a nearly thousand-page PDF. In 1994, the IRS sought a method to distribute forms that would maintain absolute consistency without the need for printing and mailing every document, opting instead to send CDs filled with PDFs. From there, the PDF format proliferated with email, becoming a cornerstone of digital workflows. Whether book publishers sending manuscripts to printers, patent applicants submitting device diagrams, or anyone needing to share a document that would appear identically to all recipients, PDF became the go-to solution.

“There’s no other technology solving the problem the PDF solves,” Johnson asserted. He highlighted the ephemeral and browser-dependent nature of websites, the issue of broken links, and the variability and editability of Word documents. In contrast, a PDF remains immutable, appearing identically regardless of who opens it, when, or how.

“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Johnson emphasized. He shared a personal anecdote: “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (Coincidentally, it was a PDF about PDFs.)

“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct.”

Luca Soldaini, an AI researcher at the Allen Institute for AI who contributed to their PDF model, olmOCR, noted a recent shift towards specialized PDF-parsing models. His team trained a vision language model—akin to a large language model but operating on pixels instead of word tokens—using approximately 100,000 PDFs. This dataset included public domain books, academic papers, brochures, and documents from the Library of Congress with human-written transcriptions. The model was further refined to excel in specific problem areas, such as accurately parsing tables without conflating rows and columns.

“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” Soldaini explained. He noted that olmOCR was the institute's most popular release last year, even rivaling their more generalist models. While PDF-reading AIs may not garner as much public attention, Soldaini stressed that "people are actually using it."

A few months later, researchers at Hugging Face, operators of a prominent open-source AI platform, were contemplating their next steps after publishing a 5 billion-document dataset for multilingual model training. Having already processed the vast Common Crawl—a massive archive of mostly HTML text from the web that underpins many large language models—Hugging Face’s Hynek Kydlíček recalled that, like many AI researchers, they wondered if they had exhausted easily accessible data.

“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” Kydlíček recounted. To their surprise, they discovered approximately 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček stated, acknowledging, “But the format of PDFs is, like, super super hard to extract text from.”

Kydlíček and his team developed a system to categorize PDFs into "easy to parse" (primarily text) and "difficult to parse" (rich in images and charts). The more challenging PDFs were then processed by RolmOCR, a version of olmOCR modified by Reducto. After removing an inexplicably large volume of horse racing results from their corpus, the team triumphantly announced they had “liberated three trillion of the finest tokens,” now available for model training.

However, parsing PDFs sufficiently for model training is distinct from achieving the high degree of accuracy demanded by professionals like lawyers and engineers. Initial tests by the Hugging Face team revealed their model sometimes hallucinated text, populating blank pages with nonsense or inventing descriptions for images. While they trained the model to correct these errors, anticipating every formatting quirk or imperfect scan remains an impossibility.

“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček observed. He added, “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto this. So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct.”

Kydlíček commended Reducto, the company Igel is using for the Epstein files, as one of the leading teams in this domain. Adit Abraham cofounded Reducto initially to manage customers' long-term interactions with language models, akin to the "memory" feature in modern chatbots. However, Abraham increasingly received requests to manage client files, predominantly in PDF format. He found working with them to be “shockingly hard.”

Like self-driving cars, PDFs present a vast array of unusual and complex challenges.

Abraham articulated one of their core insights: “One of our core intuitions was all these documents were made for humans like you and I to interpret, and there’s a lot of visual information here that we take for granted, like that every gap between two paragraphs is me telling you, ‘Hey, this is a new idea.’ Every indentation is me telling you, ‘Hey, this is a sub idea of the parent idea.’ The question was like, how do you encode all of that context?”

Many of Reducto's team members came from backgrounds in self-driving vehicles, where computer vision models segment data into distinct entities such as cars, pedestrians, or dumpsters. They applied a similar methodology to PDFs, first using a model to divide a page into components like headers, tables, and footnotes, before routing these segments to other specialized models for detailed parsing. Their approach, shared in early 2024, garnered immediate and significant attention.

“This wasn’t supposed to be a pivot,” Abraham admitted. Other developers contacted them, revealing that their own progress had been hindered by PDF parsing difficulties. “It kind of spiraled from there.”

Reducto now employs an expanding suite of small, specialized models that perform multiple passes to parse a PDF. When the segmenting model identifies a table, it directs it to a dedicated table-parsing model. If a chart is detected, different elements are sent to various models: one trained to extract axes, another to interpret legends, and so forth. A vision language model then reviews the combined output to correct any errors. This sophisticated approach enables Reducto to convert charts into spreadsheets with high accuracy, a capability Abraham notes has long been sought by their financial clients and still stumps far larger frontier models.

Despite these advancements, PDFs, much like self-driving cars, continue to present a "long tail" of unique and unpredictable challenges.

“There’s a big difference between getting a car to stay in a lane versus getting a car to handle whatever would show up on the street, and we see with PDFs a similar thing. I’ve seen the most insane documents you could imagine,” Abraham shared. He described PDF files containing other PDFs, legal documents with alternating underlined and crossed-out passages, and faxes of medical forms covered in doctors' scribbles and connecting lines. “I don’t think PDFs are a fully solved problem. I wish that were the case. We’re close, but there’s still plenty to do.”

A shortage of PDFs to parse is unlikely. The format shows no signs of disappearing. Duff Johnson of the PDF Association expressed incredulity at the very thought. He recalled past attempts by companies to displace PDF, noting that their products are “now a footnote in history,” while PDFs continue their widespread proliferation.

“Look at the Google Trends for PDF,” Johnson suggested, pointing to a steadily rising curve (with consistent dips in August) year after year. “No other technology looks like that. More and more people over time are including PDF in their searches, because that tends to be where the high-quality content is.”

“What’s going to happen is that all the world’s systems will instead understand and use PDF better and better,” Johnson concluded. “The AI companies didn’t focus on PDF, because PDF is very hard, until they realized that, well, it turns out a lot of the really high-quality stuff is in fact in PDF, and so now we have to deal with it.”

#AI#News#Tech

Editorial StaffEditor

The Editorial Staff at AIChief is a team of professional content writers with extensive experience in AI and marketing. Founded in 2025, AIChief has quickly grown into the largest free AI resource hub in the industry.

Why AI Stumbles Over PDFs

What did you think of this story?

User Comments

xAI's Anthropic Deal: What's the Catch?

Wispr Flow's Audacious Bet on India's Voice AI Challenge

Heard AI Terms? Stop Nodding, Start Understanding.