LLM Construction
Multimodal RAG
RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.
Prerequisites
Why This Matters
Standard RAG retrieves text chunks and injects them into the context window. But most real-world documents are not pure text. They contain tables, charts, diagrams, images, equations, and complex layouts. A financial report has tables that contradict the surrounding text. A medical paper has figures that contain the key result. A technical manual has diagrams that explain what paragraphs cannot.
If your RAG pipeline can only retrieve text, it is blind to the majority of information in most document collections. Multimodal RAG extends retrieval to handle the full range of content types that appear in real documents.
Mental Model
Think of a research assistant helping you answer questions about a set of documents. A text-only assistant can read the paragraphs but skips every table, chart, and figure. A multimodal assistant can look at the charts, read the tables, interpret the diagrams, and synthesize across all content types.
The engineering challenge is not the language model (modern VLMs can process images and text together). The challenge is the retrieval pipeline: how do you parse, chunk, embed, and retrieve multimodal content so that the right pieces reach the model's context window?
Core Concepts
Multimodal RAG
Multimodal RAG is a retrieval-augmented generation pipeline that can retrieve and reason over multiple content types: text, images, tables, charts, and structured data. The retriever operates over a multimodal index, and the generator (typically a vision-language model) processes both text and visual inputs.
Document Parsing
Document parsing is the process of extracting structured content from raw documents (PDFs, Word files, HTML pages). This includes: text extraction, table detection and extraction, figure/chart extraction, layout analysis (headers, footers, columns), and metadata extraction. The quality of parsing directly determines the quality of downstream retrieval.
Vision-Language Model (VLM)
A vision-language model processes both images and text as input and generates text output. In multimodal RAG, VLMs serve two roles: (1) as retrievers that can embed both text and images into a shared vector space, and (2) as generators that can reason over retrieved images and text together.
The Multimodal RAG Pipeline
A full multimodal RAG pipeline has these stages:
1. Document Parsing
Raw documents (PDFs, scanned images, HTML) must be converted into structured, retrievable units. This is the hardest engineering problem in multimodal RAG.
Text extraction: OCR for scanned documents, PDF text extraction for digital PDFs. Layout-aware extraction preserves reading order and structure.
Table extraction: Detect table boundaries, extract cell contents, and preserve the row/column structure. Tables can be stored as markdown, HTML, or structured data. Naive text extraction destroys table structure.
Figure extraction: Detect figure boundaries, extract the image, and optionally generate a text description (caption or VLM-generated summary). Figures can be stored as images (for VLM retrieval) or as text descriptions (for text-only retrieval).
Layout analysis: Determine which text belongs to which section, column, and page. Multi-column layouts, sidebars, and footnotes require spatial understanding.
2. Multimodal Chunking
After parsing, content must be chunked for retrieval. Multimodal chunking differs from text-only chunking:
Text chunks: Standard semantic or fixed-size chunking on extracted text.
Table chunks: Each table is typically a single chunk, possibly with its caption and surrounding context. Large tables may be split by row groups.
Figure chunks: Each figure is a chunk containing the image and its caption. The image can be stored as a raw image (for VLM embedding) or as a text description.
Hybrid chunks: A chunk that contains a text paragraph, the table it references, and the figure it describes. These preserve cross-modal relationships.
3. Multimodal Embedding
Chunks must be embedded into a vector space for retrieval. Two approaches:
Shared embedding space: Use a model like CLIP or ColPali that maps both text and images into the same vector space. Text queries can retrieve image chunks and vice versa.
Separate embeddings with fusion: Embed text and images separately, then combine scores at retrieval time. Simpler but loses cross-modal relationships.
4. Retrieval and Generation
Given a query, retrieve the top- multimodal chunks and inject them into the VLM's context window. The VLM processes text and images together, generating an answer with source attribution.
The Retrieval Bottleneck in Multimodal RAG
Statement
In a document collection where a fraction of the answer-relevant information is contained in non-text elements (tables, figures, charts), a text-only retrieval pipeline has an upper bound on recall:
where is the fraction of non-text information that is adequately captured by text proxies (captions, OCR of tables, generated descriptions). For visual information with no text proxy (), the recall loss equals .
In practice, for technical and scientific documents, ranges from 0.2 to 0.5, meaning text-only RAG misses 20-50% of the relevant information.
Intuition
If the answer to a question is in a chart, and your retriever can only search text, it will never find the chart. Text proxies (captions, generated descriptions) partially bridge this gap, but they are lossy: a description of a chart does not contain the precise data points. The gap between text-only and multimodal retrieval is largest for documents that are visually rich.
Why It Matters
This quantifies why text-only RAG fails on real-world document collections. Most enterprise documents (financial reports, technical manuals, research papers) have significant visual content. Ignoring it is not a minor limitation; it is a fundamental gap in the retrieval pipeline.
Failure Mode
The bound assumes that non-text information is genuinely not captured by text. In practice, good captions and OCR can recover substantial information from tables and simple charts. The bound is tightest for complex visualizations (scatter plots, diagrams, photographs) where text descriptions are inherently lossy.
Agentic RAG
Agentic RAG
Agentic RAG extends the RAG pipeline with an LLM agent that decides when, what, and how to retrieve. Instead of a fixed retrieve-then- generate pipeline, the agent can:
- Decide whether retrieval is needed for a given query
- Formulate and reformulate search queries
- Retrieve from multiple sources (text index, image index, database, API)
- Evaluate retrieved results and decide whether to retrieve more
- Synthesize across multiple retrieval rounds
The agent uses tool-calling to interact with retrieval systems, making retrieval a decision within a multi-step reasoning process rather than a fixed pipeline stage.
Agentic RAG is particularly valuable for multimodal content because:
- The agent can decide whether to search text, images, or structured data
- Failed retrievals can be retried with reformulated queries
- Complex questions can be decomposed into sub-questions, each targeting a different content type
Reasoning RAG
Reasoning RAG
Reasoning RAG performs multi-step retrieval with chain-of-thought reasoning between retrieval steps. Instead of a single retrieve-generate cycle, the model:
- Retrieves initial evidence
- Reasons about what is missing
- Retrieves additional evidence to fill gaps
- Repeats until sufficient evidence is gathered
- Generates the final answer with full chain of reasoning
This handles multi-hop questions where no single chunk contains the full answer.
Example: "How did the company's revenue growth compare to the industry average?" requires (1) retrieving the company's revenue table, (2) retrieving the industry benchmark data, (3) computing and comparing growth rates. A single retrieval step is unlikely to surface both pieces of information.
For document QA, start with this pipeline: (1) Parse documents with a layout-aware parser (e.g., marker, docling, or unstructured.io) to extract text, tables, and figures. (2) Chunk text semantically; keep tables and figures as individual chunks with their captions. (3) Embed all chunks using a multimodal embedding model (ColPali for visual, or text embeddings plus VLM-generated descriptions). (4) Retrieve top- chunks with a hybrid search (dense retrieval plus keyword matching). (5) Generate the answer with a VLM that receives both text and images, and require source attribution in the output. Only add agentic or reasoning RAG when single-hop retrieval demonstrably fails on your eval set.
RAG does not solve hallucination. It reduces hallucination for in-scope queries where the answer exists in the retrieved documents. But RAG introduces its own failure modes: (1) the retriever fails to find the relevant chunk (recall failure), (2) the model ignores the retrieved evidence and generates from parametric memory anyway (faithfulness failure), (3) the model synthesizes an answer from multiple retrieved chunks that individually do not support it (compositionality hallucination). RAG changes the hallucination distribution; it does not eliminate hallucination. Measuring retrieval recall and answer faithfulness separately is essential.
Evaluation for Multimodal RAG
Evaluating multimodal RAG requires metrics beyond standard QA accuracy:
- Retrieval recall: Did the retriever surface the relevant chunks? Measured separately for text, table, and image chunks.
- Faithfulness: Is the generated answer supported by the retrieved evidence? Checked by a judge model or human evaluation.
- Source attribution: Can the system point to the specific chunk (paragraph, table, figure) that supports each claim?
- Multimodal coverage: For questions requiring visual information, does the pipeline retrieve and use the relevant images or tables?
Common Confusions
Multimodal RAG is not just adding images to the prompt
Simply including images in the context window is not multimodal RAG. RAG requires a retrieval step: the system must decide which images (out of potentially millions) are relevant to the query and include only those. Without retrieval, you are doing context stuffing, which does not scale beyond a handful of documents.
Better parsing beats better models
The most common failure in multimodal RAG pipelines is bad parsing, not bad generation. If a table is extracted as garbled text, no amount of model quality can recover the information. Investing in document parsing quality typically yields larger gains than upgrading the language model.
VLM-generated descriptions are lossy
A common shortcut is to convert all images to text descriptions and then use text-only RAG. This works for simple images but fails for data-rich content like charts, graphs, and technical diagrams where the description cannot capture the precise information. Use native multimodal retrieval when precision matters.
Summary
- Real documents are multimodal: text, tables, charts, figures, equations
- Text-only RAG misses 20-50% of information in technical documents
- The pipeline: parse, chunk (multimodal), embed (shared space), retrieve, generate
- Document parsing quality is the most important and most underinvested stage
- Agentic RAG: LLM decides when and what to retrieve
- Reasoning RAG: multi-step retrieval with chain-of-thought between steps
- RAG reduces hallucination for in-scope queries but does not eliminate it
- Evaluate retrieval recall and faithfulness separately
Exercises
Problem
You have a collection of 500 financial reports (PDFs). Each report contains approximately 30 pages of text, 15 tables, and 5 charts. You want to build a RAG pipeline that can answer questions about revenue trends shown in charts. Describe the minimal viable pipeline, identifying which stages are text-only and which require multimodal capabilities.
Problem
A user asks: "According to the Q3 report, did marketing spend increase faster than revenue?" This requires retrieving a marketing expense table and a revenue chart from different sections of the same document. Explain why single-hop RAG is likely to fail and design a reasoning RAG approach.
Problem
Propose a method to evaluate whether a multimodal RAG system is actually using retrieved images versus ignoring them and relying on text-only context. Design an experiment that isolates the contribution of visual retrieval.
References
Canonical:
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
Current:
- Faysse et al., "ColPali: Efficient Document Retrieval with Vision Language Models" (2024)
- Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (2024)
Next Topics
The natural next steps from multimodal RAG:
- Hallucination theory: understanding when retrieval fails to prevent confabulation and why faithfulness to retrieved evidence is not guaranteed
- Inference systems overview: the serving infrastructure required to run multimodal RAG pipelines at scale
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Context EngineeringLayer 5
- KV CacheLayer 5
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
Builds on This
- Document IntelligenceLayer 5