Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Multimodal RAG

RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.

AdvancedTier 2Frontier~55 min

Prerequisites

0

Why This Matters

Standard RAG retrieves text chunks and injects them into the context window. But most real-world documents are not pure text. They contain tables, charts, diagrams, images, equations, and complex layouts. A financial report has tables that contradict the surrounding text. A medical paper has figures that contain the key result. A technical manual has diagrams that explain what paragraphs cannot.

If your RAG pipeline can only retrieve text, it is blind to the majority of information in most document collections. Multimodal RAG extends retrieval to handle the full range of content types that appear in real documents.

Mental Model

Think of a research assistant helping you answer questions about a set of documents. A text-only assistant can read the paragraphs but skips every table, chart, and figure. A multimodal assistant can look at the charts, read the tables, interpret the diagrams, and synthesize across all content types.

The engineering challenge is not the language model (modern VLMs can process images and text together). The challenge is the retrieval pipeline: how do you parse, chunk, embed, and retrieve multimodal content so that the right pieces reach the model's context window?

Core Concepts

Definition

Multimodal RAG

Multimodal RAG is a retrieval-augmented generation pipeline that can retrieve and reason over multiple content types: text, images, tables, charts, and structured data. The retriever operates over a multimodal index, and the generator (typically a vision-language model) processes both text and visual inputs.

Definition

Document Parsing

Document parsing is the process of extracting structured content from raw documents (PDFs, Word files, HTML pages). This includes: text extraction, table detection and extraction, figure/chart extraction, layout analysis (headers, footers, columns), and metadata extraction. The quality of parsing directly determines the quality of downstream retrieval.

Definition

Vision-Language Model (VLM)

A vision-language model processes both images and text as input and generates text output. In multimodal RAG, VLMs serve two roles: (1) as retrievers that can embed both text and images into a shared vector space, and (2) as generators that can reason over retrieved images and text together.

The Multimodal RAG Pipeline

A full multimodal RAG pipeline has these stages:

1. Document Parsing

Raw documents (PDFs, scanned images, HTML) must be converted into structured, retrievable units. This is the hardest engineering problem in multimodal RAG.

Text extraction: OCR for scanned documents, PDF text extraction for digital PDFs. Layout-aware extraction preserves reading order and structure.

Table extraction: Detect table boundaries, extract cell contents, and preserve the row/column structure. Tables can be stored as markdown, HTML, or structured data. Naive text extraction destroys table structure.

Figure extraction: Detect figure boundaries, extract the image, and optionally generate a text description (caption or VLM-generated summary). Figures can be stored as images (for VLM retrieval) or as text descriptions (for text-only retrieval).

Layout analysis: Determine which text belongs to which section, column, and page. Multi-column layouts, sidebars, and footnotes require spatial understanding.

2. Multimodal Chunking

After parsing, content must be chunked for retrieval. Multimodal chunking differs from text-only chunking:

Text chunks: Standard semantic or fixed-size chunking on extracted text.

Table chunks: Each table is typically a single chunk, possibly with its caption and surrounding context. Large tables may be split by row groups.

Figure chunks: Each figure is a chunk containing the image and its caption. The image can be stored as a raw image (for VLM embedding) or as a text description.

Hybrid chunks: A chunk that contains a text paragraph, the table it references, and the figure it describes. These preserve cross-modal relationships.

3. Multimodal Embedding

Chunks must be embedded into a vector space for retrieval. Two approaches:

Shared embedding space: Use a model like CLIP or ColPali that maps both text and images into the same vector space. Text queries can retrieve image chunks and vice versa.

Separate embeddings with fusion: Embed text and images separately, then combine scores at retrieval time. Simpler but loses cross-modal relationships.

4. Retrieval and Generation

Given a query, retrieve the top-kk multimodal chunks and inject them into the VLM's context window. The VLM processes text and images together, generating an answer with source attribution.

Proposition

The Retrieval Bottleneck in Multimodal RAG

Statement

In a document collection where a fraction α\alpha of the answer-relevant information is contained in non-text elements (tables, figures, charts), a text-only retrieval pipeline has an upper bound on recall:

Recalltext-only1α+αq\text{Recall}_{\text{text-only}} \leq 1 - \alpha + \alpha \cdot q

where qq is the fraction of non-text information that is adequately captured by text proxies (captions, OCR of tables, generated descriptions). For visual information with no text proxy (q=0q = 0), the recall loss equals α\alpha.

In practice, for technical and scientific documents, α\alpha ranges from 0.2 to 0.5, meaning text-only RAG misses 20-50% of the relevant information.

Intuition

If the answer to a question is in a chart, and your retriever can only search text, it will never find the chart. Text proxies (captions, generated descriptions) partially bridge this gap, but they are lossy: a description of a chart does not contain the precise data points. The gap between text-only and multimodal retrieval is largest for documents that are visually rich.

Why It Matters

This quantifies why text-only RAG fails on real-world document collections. Most enterprise documents (financial reports, technical manuals, research papers) have significant visual content. Ignoring it is not a minor limitation; it is a fundamental gap in the retrieval pipeline.

Failure Mode

The bound assumes that non-text information is genuinely not captured by text. In practice, good captions and OCR can recover substantial information from tables and simple charts. The bound is tightest for complex visualizations (scatter plots, diagrams, photographs) where text descriptions are inherently lossy.

Agentic RAG

Definition

Agentic RAG

Agentic RAG extends the RAG pipeline with an LLM agent that decides when, what, and how to retrieve. Instead of a fixed retrieve-then- generate pipeline, the agent can:

  1. Decide whether retrieval is needed for a given query
  2. Formulate and reformulate search queries
  3. Retrieve from multiple sources (text index, image index, database, API)
  4. Evaluate retrieved results and decide whether to retrieve more
  5. Synthesize across multiple retrieval rounds

The agent uses tool-calling to interact with retrieval systems, making retrieval a decision within a multi-step reasoning process rather than a fixed pipeline stage.

Agentic RAG is particularly valuable for multimodal content because:

  • The agent can decide whether to search text, images, or structured data
  • Failed retrievals can be retried with reformulated queries
  • Complex questions can be decomposed into sub-questions, each targeting a different content type

Reasoning RAG

Definition

Reasoning RAG

Reasoning RAG performs multi-step retrieval with chain-of-thought reasoning between retrieval steps. Instead of a single retrieve-generate cycle, the model:

  1. Retrieves initial evidence
  2. Reasons about what is missing
  3. Retrieves additional evidence to fill gaps
  4. Repeats until sufficient evidence is gathered
  5. Generates the final answer with full chain of reasoning

This handles multi-hop questions where no single chunk contains the full answer.

Example: "How did the company's revenue growth compare to the industry average?" requires (1) retrieving the company's revenue table, (2) retrieving the industry benchmark data, (3) computing and comparing growth rates. A single retrieval step is unlikely to surface both pieces of information.

Build It This Way by Default

For document QA, start with this pipeline: (1) Parse documents with a layout-aware parser (e.g., marker, docling, or unstructured.io) to extract text, tables, and figures. (2) Chunk text semantically; keep tables and figures as individual chunks with their captions. (3) Embed all chunks using a multimodal embedding model (ColPali for visual, or text embeddings plus VLM-generated descriptions). (4) Retrieve top-kk chunks with a hybrid search (dense retrieval plus keyword matching). (5) Generate the answer with a VLM that receives both text and images, and require source attribution in the output. Only add agentic or reasoning RAG when single-hop retrieval demonstrably fails on your eval set.

Common Fake Understanding

RAG does not solve hallucination. It reduces hallucination for in-scope queries where the answer exists in the retrieved documents. But RAG introduces its own failure modes: (1) the retriever fails to find the relevant chunk (recall failure), (2) the model ignores the retrieved evidence and generates from parametric memory anyway (faithfulness failure), (3) the model synthesizes an answer from multiple retrieved chunks that individually do not support it (compositionality hallucination). RAG changes the hallucination distribution; it does not eliminate hallucination. Measuring retrieval recall and answer faithfulness separately is essential.

Evaluation for Multimodal RAG

Evaluating multimodal RAG requires metrics beyond standard QA accuracy:

  • Retrieval recall: Did the retriever surface the relevant chunks? Measured separately for text, table, and image chunks.
  • Faithfulness: Is the generated answer supported by the retrieved evidence? Checked by a judge model or human evaluation.
  • Source attribution: Can the system point to the specific chunk (paragraph, table, figure) that supports each claim?
  • Multimodal coverage: For questions requiring visual information, does the pipeline retrieve and use the relevant images or tables?

Common Confusions

Watch Out

Multimodal RAG is not just adding images to the prompt

Simply including images in the context window is not multimodal RAG. RAG requires a retrieval step: the system must decide which images (out of potentially millions) are relevant to the query and include only those. Without retrieval, you are doing context stuffing, which does not scale beyond a handful of documents.

Watch Out

Better parsing beats better models

The most common failure in multimodal RAG pipelines is bad parsing, not bad generation. If a table is extracted as garbled text, no amount of model quality can recover the information. Investing in document parsing quality typically yields larger gains than upgrading the language model.

Watch Out

VLM-generated descriptions are lossy

A common shortcut is to convert all images to text descriptions and then use text-only RAG. This works for simple images but fails for data-rich content like charts, graphs, and technical diagrams where the description cannot capture the precise information. Use native multimodal retrieval when precision matters.

Summary

  • Real documents are multimodal: text, tables, charts, figures, equations
  • Text-only RAG misses 20-50% of information in technical documents
  • The pipeline: parse, chunk (multimodal), embed (shared space), retrieve, generate
  • Document parsing quality is the most important and most underinvested stage
  • Agentic RAG: LLM decides when and what to retrieve
  • Reasoning RAG: multi-step retrieval with chain-of-thought between steps
  • RAG reduces hallucination for in-scope queries but does not eliminate it
  • Evaluate retrieval recall and faithfulness separately

Exercises

ExerciseCore

Problem

You have a collection of 500 financial reports (PDFs). Each report contains approximately 30 pages of text, 15 tables, and 5 charts. You want to build a RAG pipeline that can answer questions about revenue trends shown in charts. Describe the minimal viable pipeline, identifying which stages are text-only and which require multimodal capabilities.

ExerciseAdvanced

Problem

A user asks: "According to the Q3 report, did marketing spend increase faster than revenue?" This requires retrieving a marketing expense table and a revenue chart from different sections of the same document. Explain why single-hop RAG is likely to fail and design a reasoning RAG approach.

ExerciseResearch

Problem

Propose a method to evaluate whether a multimodal RAG system is actually using retrieved images versus ignoring them and relying on text-only context. Design an experiment that isolates the contribution of visual retrieval.

References

Canonical:

  • Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)

Current:

  • Faysse et al., "ColPali: Efficient Document Retrieval with Vision Language Models" (2024)
  • Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (2024)

Next Topics

The natural next steps from multimodal RAG:

  • Hallucination theory: understanding when retrieval fails to prevent confabulation and why faithfulness to retrieved evidence is not guaranteed
  • Inference systems overview: the serving infrastructure required to run multimodal RAG pipelines at scale

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics