Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

PaddleOCR and Practical OCR

A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.

AdvancedTier 2Current~45 min
0

Why This Matters

OCR is the entry point for most document intelligence pipelines. Before you can extract structured information from a document, you need to convert pixel regions into text strings. The choice of OCR engine determines the quality ceiling for everything downstream.

Modern OCR is not a single model. It is a pipeline of specialized components: detect text regions, classify text direction, recognize characters. Each component has its own architecture and failure modes. Knowing these components lets you diagnose and fix extraction failures.

Mental Model

OCR operates in three stages:

  1. Text detection: find rectangular regions containing text in the image
  2. Direction classification: determine if each text region is rotated or mirrored
  3. Text recognition: convert each cropped text region into a character string

Each stage is a separate model. Detection is an object detection problem. Classification is a simple image classifier. Recognition is a sequence prediction problem (image to string).

Formal Setup and Notation

Definition

Text Detection

Given a document image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, text detection produces a set of bounding regions {R1,,RK}\{R_1, \ldots, R_K\} where each RkR_k is a polygon (typically a quadrilateral) enclosing a text line or word. The Differentiable Binarization (DB) method used in PaddleOCR predicts a probability map P[0,1]H×WP \in [0,1]^{H \times W} and a threshold map T[0,1]H×WT \in [0,1]^{H \times W}:

Bi,j=11+ek(Pi,jTi,j)B_{i,j} = \frac{1}{1 + e^{-k(P_{i,j} - T_{i,j})}}

where kk controls binarization sharpness (typically k=50k = 50). Text regions are extracted as connected components from the binarized map BB.

Definition

CTC-Based Text Recognition

Given a cropped text image resized to fixed height hh and variable width ww, a recognition model produces a probability distribution over characters at each horizontal position. Let A\mathcal{A} be the character alphabet augmented with a blank token ϵ\epsilon. The model outputs πΔ(A{ϵ})T\pi \in \Delta(\mathcal{A} \cup \{\epsilon\})^T where T=w/sT = w/s for stride ss.

The CTC decoding collapses repeated characters and removes blanks:

CTC-decode(π)=collapse(argmaxtπt)\text{CTC-decode}(\pi) = \text{collapse}(\arg\max_{t} \pi_t)

For example, the raw output "hh-ee-ll-ll-oo" (where - is blank) decodes to "hello".

Core Definitions

PaddleOCR is an open-source multilingual OCR toolkit from Baidu, supporting 80+ languages. Its default pipeline uses:

  • DB (Differentiable Binarization) for text detection: a segmentation network that predicts text/non-text probability per pixel, then extracts bounding polygons from connected components.
  • SVTR (Scene Text Recognition with a Vision Transformer) or CRNN for text recognition: encodes the cropped text image and decodes character sequences.
  • A lightweight classifier for text direction (0 or 180 degrees).

TrOCR (Microsoft) uses a Vision Transformer encoder and a text transformer decoder. The encoder is initialized from a pretrained ViT (DeiT or BEiT). The decoder is initialized from a pretrained language model (RoBERTa or GPT-2). This exploits large-scale vision and language pretraining for OCR.

EasyOCR is a Python library supporting 80+ languages, using CRAFT for detection and a CRNN for recognition. Simpler to deploy than PaddleOCR, but generally less accurate on complex layouts.

Tesseract (Google, originally HP Labs) is the classical open-source OCR engine. Version 4+ uses an LSTM-based recognizer. It works well on clean, single-column, printed text. It struggles with complex layouts, mixed languages, and unusual fonts.

Main Theorems

Proposition

CTC Loss for Sequence Recognition

Statement

The CTC (Connectionist Temporal Classification) loss marginalizes over all valid alignments between the feature sequence and the target string. For target string y=(y1,,yU)\mathbf{y} = (y_1, \ldots, y_U), define the set of valid CTC paths:

B1(y)={π(A{ϵ})T:collapse(π)=y}\mathcal{B}^{-1}(\mathbf{y}) = \{\boldsymbol{\pi} \in (\mathcal{A} \cup \{\epsilon\})^T : \text{collapse}(\boldsymbol{\pi}) = \mathbf{y}\}

The CTC loss is:

LCTC=logπB1(y)t=1Tp(πtx)\mathcal{L}_{\text{CTC}} = -\log \sum_{\boldsymbol{\pi} \in \mathcal{B}^{-1}(\mathbf{y})} \prod_{t=1}^{T} p({\pi_t} | \mathbf{x})

This sum is computed efficiently in O(TU)O(T \cdot U) using a forward-backward algorithm analogous to HMM inference.

Intuition

CTC solves the alignment problem: we know the image says "hello" but we do not know which pixel columns correspond to which characters. CTC considers all possible alignments (including inserting blanks between characters) and maximizes their total probability. The model learns to produce spiky probability peaks at character positions and blanks elsewhere.

Proof Sketch

The forward-backward algorithm defines forward variable αt(s)\alpha_t(s) as the total probability of all paths producing prefix y1:s\mathbf{y}_{1:s} at time tt. The recurrence handles three cases: staying on the same character, advancing to the next character, or emitting a blank. The total CTC probability is αT(2U+1)\alpha_T(2U+1) (accounting for interleaved blanks). This runs in O(TU)O(T \cdot U) time, making it tractable for gradient-based training.

Why It Matters

CTC is the standard training objective for CRNN-based text recognizers in PaddleOCR, EasyOCR, and Tesseract 4+. Without CTC, you would need character-level bounding box annotations for training, which are extremely expensive to obtain. CTC only requires the image and the target string.

Failure Mode

CTC assumes a monotonic alignment: characters appear left-to-right in the image. This fails for curved text, circular text, or bidirectional scripts where the visual order differs from the reading order. CTC also struggles when TT is much larger than UU (very wide images with short text), as the model must learn to emit many consecutive blanks. Attention-based decoders (as in TrOCR) handle these cases better.

PaddleOCR Pipeline in Detail

PaddleOCR 3.0 extends beyond basic OCR with additional modules:

  • Document parsing: segmenting a page into text blocks, tables, figures, headers, and footers (similar to layout analysis in LayoutLM systems)
  • Table recognition: detecting table structure and extracting cell content
  • Key information extraction: extracting named fields from structured documents like receipts and invoices
  • Seal text recognition: reading text arranged in circular seals (common in Chinese business documents)

The pipeline is modular: you can replace individual components without retraining the entire system. For example, swapping DB for a more accurate detector while keeping the same recognizer.

TrOCR: Transformer-Native OCR

TrOCR replaces both the CNN feature extractor and the CTC/attention decoder with transformers. The encoder (ViT) splits the text line image into 16×1616 \times 16 patches and processes them with self-attention. The decoder generates characters autoregressively with cross-attention to encoder outputs.

Advantages over CRNN+CTC: TrOCR handles variable-length outputs naturally (no CTC alignment needed), benefits from pretrained vision and language models, and achieves state-of-the-art accuracy on standard benchmarks (IAM handwriting: 3.4% CER, SROIE: 95.2% F1).

Disadvantage: TrOCR is slower than CRNN at inference due to autoregressive decoding. For high-throughput production pipelines processing millions of documents, CRNN+CTC is often preferred.

When to Use Which

ScenarioRecommended ToolReason
Production pipeline, multilingualPaddleOCRBest accuracy-speed trade-off, 80+ languages
Research, pushing accuracyTrOCRState-of-the-art, pretrained initialization
Quick prototype, PythonEasyOCRSimple API, pip install, reasonable accuracy
Clean printed English textTesseractLightweight, no GPU needed, good enough
Handwritten textTrOCR or PaddleOCR v4CRNN-based models struggle with handwriting

Common Confusions

Watch Out

OCR accuracy metrics can be misleading

Character Error Rate (CER) and Word Error Rate (WER) are computed on cropped, pre-detected text regions. They do not account for detection errors (missed text, false detections). A system with 99% CER but 80% text detection recall misses 20% of the text entirely. Always evaluate the full pipeline, not just the recognizer.

Watch Out

Language support does not mean equal accuracy

PaddleOCR supports 80+ languages, but accuracy varies dramatically. Chinese, English, and Japanese have millions of training samples. Low-resource languages (Tibetan, Khmer) have far fewer and correspondingly lower accuracy. Check benchmark numbers for your specific language before committing to a toolkit.

Watch Out

GPU is not always necessary for OCR

PaddleOCR and Tesseract run on CPU with acceptable latency for single-document processing (under 1 second per page). GPU acceleration matters for batch processing thousands of pages. Do not over-engineer the infrastructure for a prototype that processes 10 documents per day.

Exercises

ExerciseCore

Problem

A CRNN text recognizer processes a cropped text image of width 320 pixels with a CNN that downsamples width by a factor of 4, producing a feature sequence of length T=80T = 80. The target text is "Invoice" (7 characters). Explain why CTC can handle this mismatch between T=80T = 80 and U=7U = 7.

ExerciseAdvanced

Problem

TrOCR uses autoregressive decoding while CRNN uses CTC (parallel decoding). Analyze the inference time complexity of each approach for a text line of UU characters, given encoder output of length TT.

References

Canonical:

  • Liao et al., Real-Time Scene Text Detection with Differentiable Binarization (DB) (2020), AAAI
  • Graves et al., Connectionist Temporal Classification (2006), ICML, Sections 3-4
  • Li et al., TrOCR: Transformer-based Optical Character Recognition (2023), AAAI

Current:

  • PaddleOCR documentation and benchmarks, github.com/PaddlePaddle/PaddleOCR
  • Du et al., SVTR: Scene Text Recognition with a Single Visual Model (2022), IJCAI

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics