Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Donut and OCR-Free Document Understanding

End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.

AdvancedTier 3Current~40 min
0

Why This Matters

Traditional document intelligence pipelines rely on OCR as a first step: extract text, then reason about it. This creates a hard dependency on OCR quality. When OCR fails (handwritten text, degraded scans, unusual fonts, low-resolution images), every downstream component fails too. Errors compound through the pipeline.

OCR-free models sidestep this entirely. They take a document image as input and produce structured output directly. No text detection, no character recognition, no bounding box alignment. One model, end to end.

Mental Model

Think of the difference between reading a document with your eyes versus having someone transcribe it for you first. If the transcriber makes errors, you reason over corrupted text. OCR-free models "read with their eyes": a vision encoder processes the raw pixels, and a text decoder generates the desired output (JSON fields, LaTeX, answers to questions).

Formal Setup and Notation

Definition

OCR-Free Document Model

An OCR-free document model is a function fθ:ISf_\theta: \mathcal{I} \to \mathcal{S} where I\mathcal{I} is the space of document images and S\mathcal{S} is the space of structured text sequences. The model consists of:

  • A vision encoder gϕ:IRL×dg_\phi: \mathcal{I} \to \mathbb{R}^{L \times d} producing LL visual tokens of dimension dd
  • A text decoder hψ:RL×d×VVh_\psi: \mathbb{R}^{L \times d} \times \mathcal{V}^* \to \mathcal{V} that autoregressively generates output tokens conditioned on visual features

No OCR module appears in this pipeline. The encoder must learn to "read" directly from pixels.

Definition

Prompted Output Generation

For tasks like key information extraction, the model is prompted with a task token. Given image II and task prompt pp (e.g., "[extract_invoice]"), the model generates:

s^=argmaxsVt=1Tpθ(sts<t,gϕ(I),p)\hat{s} = \arg\max_{s \in \mathcal{V}^*} \prod_{t=1}^{T} p_\theta(s_t | s_{<t}, g_\phi(I), p)

The output s^\hat{s} is a structured string (JSON, XML, or LaTeX) that can be parsed into typed fields.

Core Definitions

The Donut (Document Understanding Transformer) architecture uses a Swin Transformer as the vision encoder and a BART-style decoder for text generation. The encoder processes the document image at high resolution (typically 2560×19202560 \times 1920 pixels) and outputs a sequence of visual feature vectors. The decoder generates output tokens conditioned on these features.

Nougat (Neural Optical Understanding for Academic Documents) applies the same OCR-free principle to academic papers. Given a PDF page rendered as an image, Nougat outputs the corresponding LaTeX/Markdown source. This is useful for converting legacy papers to machine-readable formats.

The teacher forcing training procedure is standard: given ground-truth output sequence s1,,sTs_1, \ldots, s_T, minimize the cross-entropy loss at each position conditioned on the true prefix.

Main Theorems

Proposition

OCR-Free Training Objective

Statement

The Donut training objective minimizes:

L(θ)=1Ni=1Nt=1Tilogpθ(st(i)s<t(i),gϕ(I(i)))\mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T_i} \log p_\theta(s_t^{(i)} | s_{<t}^{(i)}, g_\phi(I^{(i)}))

where (I(i),s(i))(I^{(i)}, s^{(i)}) are image-text pairs. The output sequence s(i)s^{(i)} encodes the structured extraction target as a serialized string (e.g., JSON with special tokens for field names).

For Donut, the SwinTransformer encoder processes the image at resolution H×WH \times W with patch size pp, producing L=(H/p)×(W/p)L = (H/p) \times (W/p) visual tokens. These tokens serve as the cross-attention keys and values for the decoder.

Intuition

This is the same sequence-to-sequence objective used in machine translation, but the "source language" is an image and the "target language" is structured text. The model must learn OCR, layout understanding, and information extraction simultaneously from the single training signal of next-token prediction.

Proof Sketch

No formal proof. This is a training objective, not a theorem about guarantees. The empirical result from Kim et al. (2022) is that Donut achieves competitive performance with OCR-dependent models on standard benchmarks (CORD, RVL-CDIP) despite receiving no explicit text supervision.

Why It Matters

Collapsing the entire document understanding pipeline into a single differentiable model eliminates cascading errors. OCR mistakes cannot propagate because there is no OCR. The model also naturally handles visual cues that OCR discards: font weight, color, spatial grouping.

Failure Mode

OCR-free models require large training sets of image-text pairs. On clean, well-structured documents where OCR achieves over 99% character accuracy, OCR-based pipelines still outperform Donut. The OCR-free approach wins on noisy inputs (handwriting, degraded scans) where OCR fails. Resolution matters: if the input image is too small, the encoder cannot resolve individual characters.

Donut Architecture Details

The Donut encoder uses a Swin Transformer pretrained on document images (IIT-CDIP dataset, 11M document images). Pretraining uses a pseudo-OCR task: given an image, predict the text it contains. This teaches the encoder to extract textual information from pixels without an explicit OCR module.

The decoder uses learned prompt tokens to specify the extraction task. Different prompts produce different output formats from the same encoder. For document classification, the output is a single class token. For KIE, the output is a JSON-like string with field names and values.

Nougat for Academic Papers

Nougat processes each page of a PDF independently. The training data consists of PDF page images paired with LaTeX source from arXiv papers. The model learns to reverse-render: given the visual output of LaTeX compilation, recover the source code.

Key challenge: mathematical notation. LaTeX has many ways to express the same formula, so the training must normalize the target representation. Nougat handles equations, tables, figures (as placeholders), and multi-column layouts.

Observed limitation: Nougat sometimes hallucinates repetitive text on pages with unusual layouts. A repetition detection heuristic is used at inference time to catch and truncate these failures.

When OCR-Free Wins and Loses

On the CORD receipt extraction benchmark, Donut achieves 84.1% F1 without any OCR, compared to 86.3% for LayoutLMv2 with OCR. On handwritten form understanding, OCR-free models close the gap further because OCR accuracy on handwriting is much lower.

The trade-off: OCR-free models are architecturally simpler (one model instead of a pipeline) but currently less accurate on clean, printed documents. They are better suited to scenarios where OCR quality is unreliable or where deployment simplicity matters more than peak accuracy.

Common Confusions

Watch Out

OCR-free does not mean the model ignores text

Donut learns to read text from pixels during pretraining. It performs implicit character recognition inside the vision encoder. The difference is that this recognition is end-to-end differentiable and jointly optimized with downstream tasks, rather than being a separate, fixed preprocessing step.

Watch Out

High resolution is not optional

OCR-free models need high input resolution to distinguish individual characters. A 224×224224 \times 224 image (standard for ImageNet classification) is far too small. Donut uses 2560×19202560 \times 1920. Reducing resolution degrades performance sharply because the model literally cannot see the text.

Exercises

ExerciseCore

Problem

A Donut encoder uses a Swin Transformer with patch size 4×44 \times 4 on a 2560×19202560 \times 1920 input image. After 4 stages of downsampling by factor 2 each, how many visual tokens does the encoder produce?

ExerciseAdvanced

Problem

Explain why OCR-free models are more robust to document degradation (stains, folds, faded ink) than OCR-based pipelines, even when both use the same vision encoder capacity.

References

Canonical:

  • Kim et al., OCR-free Document Understanding Transformer (Donut) (2022), ECCV, Sections 3-4
  • Blecher et al., Nougat: Neural Optical Understanding for Academic Documents (2023), arXiv:2308.13418

Current:

  • Davis et al., End-to-End Document Recognition and Understanding: A Survey (2023), IJDAR
  • Xu et al., LayoutLMv3 (2022), for comparison with OCR-dependent approaches

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.