LLM Construction
PaddleOCR and Practical OCR
A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.
Prerequisites
Why This Matters
OCR is the entry point for most document intelligence pipelines. Before you can extract structured information from a document, you need to convert pixel regions into text strings. The choice of OCR engine determines the quality ceiling for everything downstream.
Modern OCR is not a single model. It is a pipeline of specialized components: detect text regions, classify text direction, recognize characters. Each component has its own architecture and failure modes. Knowing these components lets you diagnose and fix extraction failures.
Mental Model
OCR operates in three stages:
- Text detection: find rectangular regions containing text in the image
- Direction classification: determine if each text region is rotated or mirrored
- Text recognition: convert each cropped text region into a character string
Each stage is a separate model. Detection is an object detection problem. Classification is a simple image classifier. Recognition is a sequence prediction problem (image to string).
Formal Setup and Notation
Text Detection
Given a document image , text detection produces a set of bounding regions where each is a polygon (typically a quadrilateral) enclosing a text line or word. The Differentiable Binarization (DB) method used in PaddleOCR predicts a probability map and a threshold map :
where controls binarization sharpness (typically ). Text regions are extracted as connected components from the binarized map .
CTC-Based Text Recognition
Given a cropped text image resized to fixed height and variable width , a recognition model produces a probability distribution over characters at each horizontal position. Let be the character alphabet augmented with a blank token . The model outputs where for stride .
The CTC decoding collapses repeated characters and removes blanks:
For example, the raw output "hh-ee-ll-ll-oo" (where - is blank) decodes to "hello".
Core Definitions
PaddleOCR is an open-source multilingual OCR toolkit from Baidu, supporting 80+ languages. Its default pipeline uses:
- DB (Differentiable Binarization) for text detection: a segmentation network that predicts text/non-text probability per pixel, then extracts bounding polygons from connected components.
- SVTR (Scene Text Recognition with a Vision Transformer) or CRNN for text recognition: encodes the cropped text image and decodes character sequences.
- A lightweight classifier for text direction (0 or 180 degrees).
TrOCR (Microsoft) uses a Vision Transformer encoder and a text transformer decoder. The encoder is initialized from a pretrained ViT (DeiT or BEiT). The decoder is initialized from a pretrained language model (RoBERTa or GPT-2). This exploits large-scale vision and language pretraining for OCR.
EasyOCR is a Python library supporting 80+ languages, using CRAFT for detection and a CRNN for recognition. Simpler to deploy than PaddleOCR, but generally less accurate on complex layouts.
Tesseract (Google, originally HP Labs) is the classical open-source OCR engine. Version 4+ uses an LSTM-based recognizer. It works well on clean, single-column, printed text. It struggles with complex layouts, mixed languages, and unusual fonts.
Main Theorems
CTC Loss for Sequence Recognition
Statement
The CTC (Connectionist Temporal Classification) loss marginalizes over all valid alignments between the feature sequence and the target string. For target string , define the set of valid CTC paths:
The CTC loss is:
This sum is computed efficiently in using a forward-backward algorithm analogous to HMM inference.
Intuition
CTC solves the alignment problem: we know the image says "hello" but we do not know which pixel columns correspond to which characters. CTC considers all possible alignments (including inserting blanks between characters) and maximizes their total probability. The model learns to produce spiky probability peaks at character positions and blanks elsewhere.
Proof Sketch
The forward-backward algorithm defines forward variable as the total probability of all paths producing prefix at time . The recurrence handles three cases: staying on the same character, advancing to the next character, or emitting a blank. The total CTC probability is (accounting for interleaved blanks). This runs in time, making it tractable for gradient-based training.
Why It Matters
CTC is the standard training objective for CRNN-based text recognizers in PaddleOCR, EasyOCR, and Tesseract 4+. Without CTC, you would need character-level bounding box annotations for training, which are extremely expensive to obtain. CTC only requires the image and the target string.
Failure Mode
CTC assumes a monotonic alignment: characters appear left-to-right in the image. This fails for curved text, circular text, or bidirectional scripts where the visual order differs from the reading order. CTC also struggles when is much larger than (very wide images with short text), as the model must learn to emit many consecutive blanks. Attention-based decoders (as in TrOCR) handle these cases better.
PaddleOCR Pipeline in Detail
PaddleOCR 3.0 extends beyond basic OCR with additional modules:
- Document parsing: segmenting a page into text blocks, tables, figures, headers, and footers (similar to layout analysis in LayoutLM systems)
- Table recognition: detecting table structure and extracting cell content
- Key information extraction: extracting named fields from structured documents like receipts and invoices
- Seal text recognition: reading text arranged in circular seals (common in Chinese business documents)
The pipeline is modular: you can replace individual components without retraining the entire system. For example, swapping DB for a more accurate detector while keeping the same recognizer.
TrOCR: Transformer-Native OCR
TrOCR replaces both the CNN feature extractor and the CTC/attention decoder with transformers. The encoder (ViT) splits the text line image into patches and processes them with self-attention. The decoder generates characters autoregressively with cross-attention to encoder outputs.
Advantages over CRNN+CTC: TrOCR handles variable-length outputs naturally (no CTC alignment needed), benefits from pretrained vision and language models, and achieves state-of-the-art accuracy on standard benchmarks (IAM handwriting: 3.4% CER, SROIE: 95.2% F1).
Disadvantage: TrOCR is slower than CRNN at inference due to autoregressive decoding. For high-throughput production pipelines processing millions of documents, CRNN+CTC is often preferred.
When to Use Which
| Scenario | Recommended Tool | Reason |
|---|---|---|
| Production pipeline, multilingual | PaddleOCR | Best accuracy-speed trade-off, 80+ languages |
| Research, pushing accuracy | TrOCR | State-of-the-art, pretrained initialization |
| Quick prototype, Python | EasyOCR | Simple API, pip install, reasonable accuracy |
| Clean printed English text | Tesseract | Lightweight, no GPU needed, good enough |
| Handwritten text | TrOCR or PaddleOCR v4 | CRNN-based models struggle with handwriting |
Common Confusions
OCR accuracy metrics can be misleading
Character Error Rate (CER) and Word Error Rate (WER) are computed on cropped, pre-detected text regions. They do not account for detection errors (missed text, false detections). A system with 99% CER but 80% text detection recall misses 20% of the text entirely. Always evaluate the full pipeline, not just the recognizer.
Language support does not mean equal accuracy
PaddleOCR supports 80+ languages, but accuracy varies dramatically. Chinese, English, and Japanese have millions of training samples. Low-resource languages (Tibetan, Khmer) have far fewer and correspondingly lower accuracy. Check benchmark numbers for your specific language before committing to a toolkit.
GPU is not always necessary for OCR
PaddleOCR and Tesseract run on CPU with acceptable latency for single-document processing (under 1 second per page). GPU acceleration matters for batch processing thousands of pages. Do not over-engineer the infrastructure for a prototype that processes 10 documents per day.
Exercises
Problem
A CRNN text recognizer processes a cropped text image of width 320 pixels with a CNN that downsamples width by a factor of 4, producing a feature sequence of length . The target text is "Invoice" (7 characters). Explain why CTC can handle this mismatch between and .
Problem
TrOCR uses autoregressive decoding while CRNN uses CTC (parallel decoding). Analyze the inference time complexity of each approach for a text line of characters, given encoder output of length .
References
Canonical:
- Liao et al., Real-Time Scene Text Detection with Differentiable Binarization (DB) (2020), AAAI
- Graves et al., Connectionist Temporal Classification (2006), ICML, Sections 3-4
- Li et al., TrOCR: Transformer-based Optical Character Recognition (2023), AAAI
Current:
- PaddleOCR documentation and benchmarks, github.com/PaddlePaddle/PaddleOCR
- Du et al., SVTR: Scene Text Recognition with a Single Visual Model (2022), IJCAI
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Document IntelligenceLayer 5
- Multimodal RAGLayer 5
- Context EngineeringLayer 5
- KV CacheLayer 5
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1