Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Document Intelligence

Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.

AdvancedTier 2Frontier~50 min

Prerequisites

0

Why This Matters

Most enterprise data lives in documents: PDFs, scanned invoices, legal contracts, financial reports, medical records. These documents have rich spatial structure (headers, tables, columns, figures) that plain text extraction destroys. Document intelligence recovers this structure and extracts typed, queryable information from unstructured documents.

OCR converts images to text. Document intelligence goes further: it understands what the text means based on where it appears, how it is formatted, and what surrounds it. A number in a table cell means something different from the same number in a page header.

Mental Model

A document page is a 2D spatial arrangement of text, images, and graphical elements. Understanding a document requires three types of information:

  1. Text content: what words appear on the page
  2. Spatial layout: where each word is positioned (bounding boxes)
  3. Visual appearance: font size, boldness, color, surrounding lines and borders

Document intelligence models fuse all three modalities to produce a unified representation of the document.

Formal Setup and Notation

Definition

Document Representation

A document page is represented as a set of tokens {(wi,bi,vi)}i=1N\{(w_i, b_i, v_i)\}_{i=1}^N where:

  • wiw_i is the text token (from OCR or digital extraction)
  • bi=(x0,y0,x1,y1)[0,1]4b_i = (x_0, y_0, x_1, y_1) \in [0, 1]^4 is the normalized bounding box
  • viv_i is a visual feature vector from the image patch containing token ii

A document model learns fθ:{(wi,bi,vi)}Rdf_\theta: \{(w_i, b_i, v_i)\} \to \mathbb{R}^d mapping each token to a contextualized representation that incorporates text, position, and visual context.

Definition

Key Information Extraction (KIE)

Given a document with tokens {(wi,bi)}\{(w_i, b_i)\} and a predefined schema of field types {f1,,fK}\{f_1, \ldots, f_K\} (e.g., invoice number, date, total amount), KIE assigns each token to a field or to "none":

y^i=argmaxk{0,1,,K}pθ(y=kwi,bi,context)\hat{y}_i = \arg\max_{k \in \{0, 1, \ldots, K\}} p_\theta(y = k | w_i, b_i, \text{context})

This is sequence labeling with spatial context. Standard BIO tagging applies, but the "sequence" is a 2D spatial arrangement, not a 1D text stream. The loss function for this classification is typically cross-entropy.

Core Definitions

The layout analysis step segments a document page into regions: text blocks, tables, figures, headers, footers. Each region is classified by type. This is an object detection problem applied to document images.

Table structure recognition identifies rows, columns, and cells within a detected table region. This is harder than it appears: tables can have merged cells, implicit borders (no visible lines), and nested headers.

Reading order determines the sequence in which text regions should be read. For single-column documents this is trivial (top to bottom). For multi-column layouts, magazine-style pages, or documents with sidebars, reading order detection is a nontrivial graph problem.

Main Theorems

Proposition

Layout-Aware Masked Language Modeling

Statement

LayoutLM-style models extend masked language modeling (MLM) to incorporate spatial position, building on the transformer architecture. The pretraining objective is:

L=E[iMlogpθ(wiwM,b1,,bN)]\mathcal{L} = \mathbb{E}\left[-\sum_{i \in \mathcal{M}} \log p_\theta(w_i | w_{\setminus \mathcal{M}}, b_1, \ldots, b_N)\right]

where M\mathcal{M} is the set of masked token positions. The model must predict masked tokens using both textual context (surrounding words) and spatial context (bounding box positions of all tokens).

The spatial embedding maps each coordinate to a learned embedding: elayout=Ex(x0)+Ey(y0)+Ex(x1)+Ey(y1)+Ew(x1x0)+Eh(y1y0)e_{\text{layout}} = E_x(x_0) + E_y(y_0) + E_x(x_1) + E_y(y_1) + E_w(x_1 - x_0) + E_h(y_1 - y_0), where Ex,Ey,Ew,EhE_x, E_y, E_w, E_h are learned embedding tables over discretized coordinate bins.

Intuition

By conditioning on spatial position during pretraining, the model learns that tokens at the top of a page are likely headers, tokens aligned in columns are likely table entries, and tokens in the upper-right corner of an invoice are likely dates or invoice numbers. This spatial prior transfers to downstream extraction tasks.

Proof Sketch

No formal proof. This is an architectural design choice motivated by the observation that document understanding requires spatial reasoning. The empirical validation is that LayoutLM models outperform text-only models on document understanding benchmarks by 5-15% F1 on standard KIE tasks.

Why It Matters

This pretraining objective is what separates document AI from standard NLP. By encoding position as a first-class input, the model can distinguish between identical text in different spatial contexts. The word "Total" at the bottom of a column means something different from "Total" in a paragraph heading.

Failure Mode

The spatial embedding assumes a fixed page coordinate system. Documents with unusual layouts (foldouts, rotated pages, free-form designs) break the spatial prior. The model also assumes OCR bounding boxes are accurate; noisy OCR with incorrect positions degrades performance significantly.

Key Approaches

LayoutLM Family

LayoutLM (v1, v2, v3) progressively integrates more modalities:

  • LayoutLM v1: Text embeddings + 2D position embeddings, pretrained with spatial MLM
  • LayoutLM v2: Adds visual tokens from document image via CNN backbone
  • LayoutLM v3: Unified multimodal pretraining with text, layout, and image objectives

Each version improves on KIE, document classification, and document VQA benchmarks. The v3 model uses a shared transformer for all three modalities rather than separate encoders.

Table Extraction Pipeline

A practical table extraction system operates in stages:

  1. Table detection: Locate tables in the page (object detection)
  2. Structure recognition: Identify rows, columns, merged cells
  3. Cell content extraction: OCR or text extraction within each cell
  4. Semantic typing: Determine header rows, data types, relationships

Each stage introduces errors that propagate. End-to-end models that jointly detect and parse tables (e.g., TableFormer) reduce this cascading error.

LLM-Based Document Understanding

Recent systems bypass the traditional pipeline entirely. A vision-language model (GPT-4V, Gemini) receives the document image directly and answers questions about it. This is simpler to deploy but less controllable: you cannot easily guarantee structured output or audit which parts of the document informed the answer.

Canonical Examples

Example

Invoice processing

An invoice has a predictable schema: vendor name, invoice number, date, line items (description, quantity, unit price, amount), subtotal, tax, total. A KIE model trained on labeled invoices can extract these fields with 90-95% F1. The spatial structure (line items in a table, total at the bottom) provides strong signal beyond the text alone.

Common Confusions

Watch Out

OCR accuracy is not the bottleneck

Modern OCR (from Google, AWS, Azure) achieves over 99% character accuracy on clean documents. The bottleneck is understanding structure: which text belongs to which table cell, which lines form a logical paragraph, what the reading order is. Document intelligence is about layout and semantics, not character recognition.

Watch Out

Digital PDFs are not easier than scanned documents

Digital PDFs contain embedded text, so OCR is unnecessary. But extracting structure from digital PDFs is still hard: text is stored as positioned characters with no semantic markup, tables have no explicit structure (just positioned strings), and multi-column layouts require non-trivial reading order inference. The PDF format was designed for display, not for information extraction.

Watch Out

LLMs do not replace document AI pipelines (yet)

Vision-language models can answer questions about documents, but they struggle with precise extraction tasks: exact dollar amounts, complete table parsing, consistent field extraction across thousands of documents. For enterprise-scale processing where accuracy and consistency matter, specialized document AI models still outperform general-purpose LLMs on structured extraction.

Exercises

ExerciseAdvanced

Problem

A LayoutLM model uses discretized bounding box coordinates with 1000 bins per axis. How many total position embedding parameters does the model learn, assuming separate embedding tables for x0x_0, y0y_0, x1x_1, y1y_1, width, and height, each of dimension d=768d = 768?

ExerciseCore

Problem

Why does a text-only model (ignoring layout) fail on key information extraction from invoices? Give a concrete example where two tokens have identical text context but different meanings due to spatial position.

References

Canonical:

  • Xu et al., LayoutLM: Pre-training of Text and Layout for Document AI (2020), KDD
  • Xu et al., LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (2022), ACM MM

Current:

  • Huang et al., A Comprehensive Survey on Document AI (2023), arXiv:2305.08098
  • Smock et al., PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents (2022), CVPR

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This