Multimodal RAG

Sneiderman, Robby

LLM Construction

Multimodal RAG

RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.

AdvancedTier 2FrontierSupporting~55 min

Prerequisites

Context Engineering Audio Language Models Clip and Openclip in Practice Semantic Search and Embeddings

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Hallucination Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard RAG retrieves text chunks and injects them into the context window. But most real-world documents are not pure text. They contain tables, charts, diagrams, images, equations, and complex layouts. A financial report has tables that contradict the surrounding text. A medical paper has figures that contain the key result. A technical manual has diagrams that explain what paragraphs cannot.

If your RAG pipeline can only retrieve text, it is blind to the majority of information in most document collections. Multimodal RAG extends retrieval to handle the full range of content types that appear in real documents.

Mental Model

Think of a research assistant helping you answer questions about a set of documents. A text-only assistant can read the paragraphs but skips every table, chart, and figure. A multimodal assistant can look at the charts, read the tables, interpret the diagrams, and synthesize across all content types.

The engineering challenge is not the language model (modern VLMs can process images and text together). The challenge is the retrieval pipeline: how do you parse, chunk, embed, and retrieve multimodal content so that the right pieces reach the model's context window?

Core Concepts

Definition

Multimodal RAG

Multimodal RAG is a retrieval-augmented generation pipeline that can retrieve and reason over multiple content types: text, images, tables, charts, and structured data. The retriever operates over a multimodal index, and the generator (typically a vision-language model) processes both text and visual inputs.

Definition

Document Parsing

Document parsing is the process of extracting structured content from raw documents (PDFs, Word files, HTML pages). This includes: text extraction, table detection and extraction, figure/chart extraction, layout analysis (headers, footers, columns), and metadata extraction. The quality of parsing directly determines the quality of downstream retrieval.

Definition

Vision-Language Model (VLM)

A vision-language model processes both images and text as input and generates text output. In multimodal RAG, VLMs serve two roles: (1) as retrievers that can embed both text and images into a shared vector space, and (2) as generators that can reason over retrieved images and text together.

The Multimodal RAG Pipeline

A full multimodal RAG pipeline has these stages:

1. Document Parsing

Raw documents (PDFs, scanned images, HTML) must be converted into structured, retrievable units. This is the hardest engineering problem in multimodal RAG.

Text extraction: OCR for scanned documents, PDF text extraction for digital PDFs. Layout-aware extraction preserves reading order and structure.

Table extraction: Detect table boundaries, extract cell contents, and preserve the row/column structure. Tables can be stored as markdown, HTML, or structured data. Naive text extraction destroys table structure.

Figure extraction: Detect figure boundaries, extract the image, and optionally generate a text description (caption or VLM-generated summary). Figures can be stored as images (for VLM retrieval) or as text descriptions (for text-only retrieval).

Layout analysis: Determine which text belongs to which section, column, and page. Multi-column layouts, sidebars, and footnotes require spatial understanding.

2. Multimodal Chunking

After parsing, content must be chunked for retrieval. Multimodal chunking differs from text-only chunking:

Text chunks: Standard semantic or fixed-size chunking on extracted text.

Table chunks: Each table is typically a single chunk, possibly with its caption and surrounding context. Large tables may be split by row groups.

Figure chunks: Each figure is a chunk containing the image and its caption. The image can be stored as a raw image (for VLM embedding) or as a text description.

Hybrid chunks: A chunk that contains a text paragraph, the table it references, and the figure it describes. These preserve cross-modal relationships.

Current Retrieval Pattern

As of May 2026, visual document retrieval is a serious baseline, not a novelty. ColPali-style systems embed page images directly and use multi-vector late interaction to retrieve visually rich pages without a long OCR-and-layout pipeline. This is attractive for PDFs where tables, figures, fonts, and spatial layout carry meaning that text extraction loses.

That does not make parsing obsolete. Page-image retrieval finds the right page, but production RAG still needs structured evidence for citation, computation, filtering, and exact extraction. A practical system often uses both:

Page-level visual retrieval to catch layout, charts, and scanned documents.
Parsed text/table retrieval to support exact citation and numerical operations.
Fusion or re-ranking so the generator receives a small, coherent evidence pack instead of a pile of loosely related pages.

3. Multimodal Embedding

Chunks must be embedded into a vector space for retrieval. Two approaches:

Shared embedding space: Use a model like CLIP or ColPali that maps both text and images into the same vector space. Text queries can retrieve image chunks and vice versa.

Separate embeddings with fusion: Embed text and images separately, then combine scores at retrieval time. Simpler but loses cross-modal relationships.

4. Retrieval and Generation

Given a query, retrieve the top- $k$ multimodal chunks and inject them into the VLM's context window. The VLM processes text and images together, generating an answer with source attribution.

Proposition

The Retrieval Bottleneck in Multimodal RAG

Statement

In a document collection where a fraction $\alpha$ of the answer-relevant information is contained in non-text elements (tables, figures, charts), a text-only retrieval pipeline has an upper bound on recall:

$\text{Recall}_{\text{text-only}} \leq 1 - \alpha + \alpha \cdot q$

where $q$ is the fraction of non-text information that is adequately captured by text proxies (captions, OCR of tables, generated descriptions). For visual information with no text proxy ( $q = 0$ ), the recall loss equals $\alpha$ .

In practice, for technical and scientific document collections $\alpha$ is substantial: financial reports, scientific papers, and technical manuals routinely hide a large fraction of the answer-relevant content in tables and figures that text-only retrieval never sees. Reported gains from moving to multimodal retrieval on such corpora are sizable, but the exact $\alpha$ depends on the corpus and is best measured per deployment rather than assumed from a single number.

Intuition

If the answer to a question is in a chart, and your retriever can only search text, it will never find the chart. Text proxies (captions, generated descriptions) partially bridge this gap, but they are lossy: a description of a chart does not contain the precise data points. The gap between text-only and multimodal retrieval is largest for documents that are visually rich.

Why It Matters

This quantifies why text-only RAG fails on real-world document collections. Most enterprise documents (financial reports, technical manuals, research papers) have significant visual content. Ignoring it is not a minor limitation; it is a fundamental gap in the retrieval pipeline.

Failure Mode

The bound assumes that non-text information is genuinely not captured by text. In practice, good captions and OCR can recover substantial information from tables and simple charts. The bound is tightest for complex visualizations (scatter plots, diagrams, photographs) where text descriptions are inherently lossy.

report a correction →

Agentic RAG

Definition

Agentic RAG

Agentic RAG extends the RAG pipeline with an LLM agent that decides when, what, and how to retrieve. Instead of a fixed retrieve-then- generate pipeline, the agent can:

Decide whether retrieval is needed for a given query
Formulate and reformulate search queries
Retrieve from multiple sources (text index, image index, database, API)
Evaluate retrieved results and decide whether to retrieve more
Synthesize across multiple retrieval rounds

The agent uses tool-calling to interact with retrieval systems, making retrieval a decision within a multi-step reasoning process rather than a fixed pipeline stage.

Agentic RAG is particularly valuable for multimodal content because:

The agent can decide whether to search text, images, or structured data
Failed retrievals can be retried with reformulated queries
Complex questions can be decomposed into sub-questions, each targeting a different content type

Reasoning RAG

Definition

Reasoning RAG

Reasoning RAG performs multi-step retrieval with chain-of-thought reasoning between retrieval steps. Instead of a single retrieve-generate cycle, the model:

Retrieves initial evidence
Reasons about what is missing
Retrieves additional evidence to fill gaps
Repeats until sufficient evidence is gathered
Generates the final answer with full chain of reasoning

This handles multi-hop questions where no single chunk contains the full answer.

Example: "How did the company's revenue growth compare to the industry average?" requires (1) retrieving the company's revenue table, (2) retrieving the industry benchmark data, (3) computing and comparing growth rates. A single retrieval step is unlikely to surface both pieces of information.

Build It This Way by Default

For document QA, start with this pipeline: (1) Parse documents with a layout-aware parser (e.g., marker, docling, or unstructured.io) to extract text, tables, and figures. (2) Chunk text semantically; keep tables and figures as individual chunks with their captions. (3) Embed all chunks using a multimodal embedding model (ColPali for visual, or text embeddings plus VLM-generated descriptions). (4) Retrieve top- $k$ chunks with a hybrid search (dense retrieval plus keyword matching). (5) Generate the answer with a VLM that receives both text and images, and require source attribution in the output. Only add agentic or reasoning RAG when single-hop retrieval demonstrably fails on your eval set.

Common Fake Understanding

RAG does not solve hallucination. It reduces hallucination for in-scope queries where the answer exists in the retrieved documents. But RAG introduces its own failure modes: (1) the retriever fails to find the relevant chunk (recall failure), (2) the model ignores the retrieved evidence and generates from parametric memory anyway (faithfulness failure), (3) the model synthesizes an answer from multiple retrieved chunks that individually do not support it (compositionality hallucination). RAG changes the hallucination distribution; it does not eliminate hallucination. Measuring retrieval recall and answer faithfulness separately is essential.

Evaluation for Multimodal RAG

Evaluating multimodal RAG requires metrics beyond standard QA accuracy:

Retrieval recall: Did the retriever surface the relevant chunks? Measured separately for text, table, and image chunks.
Faithfulness: Is the generated answer supported by the retrieved evidence? Checked by a judge model or human evaluation.
Source attribution: Can the system point to the specific chunk (paragraph, table, figure) that supports each claim?
Multimodal coverage: For questions requiring visual information, does the pipeline retrieve and use the relevant images or tables?

Common Confusions

Watch Out

Multimodal RAG is not just adding images to the prompt

Simply including images in the context window is not multimodal RAG. RAG requires a retrieval step: the system must decide which images (out of potentially millions) are relevant to the query and include only those. Without retrieval, you are doing context stuffing, which does not scale beyond a handful of documents.

Watch Out

Better parsing beats better models

The most common failure in multimodal RAG pipelines is bad parsing, not bad generation. If a table is extracted as garbled text, no amount of model quality can recover the information. Investing in document parsing quality typically yields larger gains than upgrading the language model.

Watch Out

VLM-generated descriptions are lossy

A common shortcut is to convert all images to text descriptions and then use text-only RAG. This works for simple images but fails for data-rich content like charts, graphs, and technical diagrams where the description cannot capture the precise information. Use native multimodal retrieval when precision matters.

Summary

Real documents are multimodal: text, tables, charts, figures, equations
Text-only RAG misses a substantial fraction of information in technical documents with many tables and figures; the exact fraction depends on the corpus and should be measured
The pipeline: parse, chunk (multimodal), embed (shared space), retrieve, generate
Document parsing quality is the most important and most underinvested stage
Agentic RAG: LLM decides when and what to retrieve
Reasoning RAG: multi-step retrieval with chain-of-thought between steps
RAG reduces hallucination for in-scope queries but does not eliminate it
Evaluate retrieval recall and faithfulness separately

Exercises

ExerciseCore

Problem

You have a collection of 500 financial reports (PDFs). Each report contains approximately 30 pages of text, 15 tables, and 5 charts. You want to build a RAG pipeline that can answer questions about revenue trends shown in charts. Describe the minimal viable pipeline, identifying which stages are text-only and which require multimodal capabilities.

ExerciseAdvanced

Problem

A user asks: "According to the Q3 report, did marketing spend increase faster than revenue?" This requires retrieving a marketing expense table and a revenue chart from different sections of the same document. Explain why single-hop RAG is likely to fail and design a reasoning RAG approach.

ExerciseResearch

Problem

Propose a method to evaluate whether a multimodal RAG system is actually using retrieved images versus ignoring them and relying on text-only context. Design an experiment that isolates the contribution of visual retrieval.

References

Canonical:

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)

Current:

Faysse et al., "ColPali: Efficient Document Retrieval with Vision Language Models" (ICLR 2025, arXiv:2407.01449) — direct page-image retrieval with multi-vector embeddings and late interaction.
Lin et al., "Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation" (2025, arXiv:2502.08826) — taxonomy of multimodal RAG retrieval, fusion, augmentation, and generation.
Tripathi et al., "Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding" (2025, arXiv:2506.16035) — page-batch chunking for complex PDF structure and cross-page context.
Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (2024)

Next Topics

The natural next steps from multimodal RAG:

Hallucination theory: understanding when retrieval fails to prevent confabulation and why faithfulness to retrieved evidence is not guaranteed
Inference systems overview: the serving infrastructure required to run multimodal RAG pipelines at scale

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
Semantic Search and Embeddingslayer 3 · tier 2
Context Engineeringlayer 5 · tier 2
Audio Language Modelslayer 5 · tier 3

Derived topics

1

Document Intelligencelayer 5 · tier 2

Graph-backed continuations

Document Intelligence