Multimodal RAG: ColPali, DSE, Vision-LLM Citations (2026 Architecture)

Q: When should I deploy a vision-RAG pipeline like ColPali instead of staying on text-only RAG with a good parser?

The trigger is a stratified eval that shows the visual-content slice of the workload is materially below the running-text slice after layout-aware extraction has already been deployed. The sequence matters. First, replace the basic text extractor with a layout-aware parser (Surya, Marker, or a managed API like Anthropic Document, AWS Textract, OpenAI File Search); this single change lifts the visual slice from 30–55 percent baseline accuracy to 55–70 percent and costs almost nothing operationally. Second, stratify the eval into text-route queries (prose, policy language, definitions) and vision-route queries (charts, diagrams, tables, screenshots, visual annotations). Third, look at the gap: if the vision-route slice is sub-15 percent of queries and accuracy on it is 60+ percent after layout-aware extraction, the vision pipeline is unjustified and the better investment is hybrid search plus reranking. If the vision-route slice is 15–40 percent of queries and accuracy is 40–60 percent, deploy ColPali or DSE on the subset of the corpus where the visual content is concentrated and use a query classifier to route between the two pipelines. If the vision-route slice is above 40 percent of queries (typical for engineering documentation, financial filings with chart-heavy analysis, insurance underwriting files, regulatory submissions), the vision pipeline is the primary architecture and the text pipeline becomes the fallback for the lexical-match queries that vision is bad at. The mistake teams make is adopting vision-RAG to fix complaints that layout-aware extraction would have solved; the indexing cost premium is 5x to 20x and unjustified if the cheaper fix would have worked.

Q: What is the practical difference between ColPali and DSE (Document Screenshot Embeddings), and which should I pick?

ColPali, from Illuin Technology in mid-2024, applies the late-interaction idea from ColBERT (one vector per token, MaxSim scoring) to vision: a rendered page becomes roughly 1024 patch vectors of 128 dimensions, and a query becomes a set of token vectors of the same dimension; the score for a (query, page) pair is the sum over query tokens of the maximum cosine similarity between that query token and any patch on the page. The advantage is fine-grained spatial detail: the embedding of the chart in the top-right of the page is not destroyed by the embedding of the running text in the body, and query terms can match patches whose visual content corresponds to those terms. DSE (Document Screenshot Embeddings) from Beijing Academy of AI and Microsoft Research takes the same "embed the rendered page" insight but uses single-vector embeddings: one dense vector per page, standard cosine similarity at retrieval time. The trade-offs follow the same pattern as single-vector versus late-interaction in text retrieval. ColPali has higher recall on long visually-heterogeneous pages and on rare-term queries, larger index size (200x to 1000x versus single-vector text), and slower retrieval at scale because MaxSim is more expensive than single-vector cosine. DSE has lower recall on heterogeneous pages, much smaller index (comparable to single-vector text embeddings), and faster retrieval. The production-grade composition that several teams have converged on is DSE as the cheap shortlist on the full corpus and ColPali-style late-interaction reranking on the top 50–100 pages from the DSE shortlist; this composition gives most of the recall advantage of ColPali at a fraction of the indexing and retrieval cost. The empirical winner on any specific corpus depends on whether the corpus matches the fine-tuning distribution of the released models; domain-specific fine-tuning is the high-ROI investment for serious deployments.

Q: What does the hybrid text-plus-vision retrieval architecture look like, and how does the query classifier decide which route to use?

Production multimodal RAG converges on three retrieval routes selected per query by a lightweight classifier. The text-only route runs hybrid search (BM25 plus dense vectors fused with reciprocal rank fusion) and a cross-encoder reranker on a text index built from layout-aware OCR output; this route is cheap and fast and handles prose, policy language, definitions, acronyms, exact-phrase searches, and regulatory citations. The vision-only route runs ColPali MaxSim or DSE cosine similarity against the page-level vision index; this route handles charts, diagrams, tables, screenshots, visual annotations, and any query whose answer lives in pixels rather than tokens. The fused route runs both pipelines and merges the top-k from each with reciprocal rank fusion (score = sum across routes of 1 / (k + rank), typically k=60) before reranking; RRF is parameter-free and robust to the score-distribution differences between text and vision embeddings, which use different models with different score ranges. The classifier is a small LLM call (10–40ms with gpt-4o-mini or Gemini 2.5 Flash) or a fine-tuned BERT-class model that categorises the query into one of the three routes. The calibration discipline is the same as the GraphRAG router and the Model Router pattern: label a few hundred queries against the correct route, train or prompt-engineer the classifier, monitor accuracy and drift in production, retrain as workload distribution shifts. The architectural payoff is large: on mixed workloads the classifier avoids the vision pipeline (the cost driver) on the 60–80 percent of queries that do not need it, reducing total cost by 40–60 percent versus running vision on everything, while preserving the recall lift on the queries that do need vision. The classifier is the cheap glue that makes the architecture economically viable.

Q: How do I handle tables in multimodal RAG — should they go through the text pipeline or the vision pipeline?

Both, selected per-table by whether the structured-extraction pass succeeded. The cheap path is table extraction to Markdown or HTML: a layout-aware parser detects the table region, a table-structure model (Microsoft Table Transformer, the Surya table model, the Marker table extractor) recovers the row-column grid, and the table goes into the text index as a structured chunk with cell boundaries preserved. Retrieval works on these tables because the cell content is now lexically searchable and the Markdown structure preserves the header-to-value relationship; embedding models read structured Markdown tables reasonably well. The cost is one extra model pass per page that contains a table and the recall lift on table-bound questions is large. The vision path is for tables the extractor mangles — tables with multi-line cells, nested headers, sparse layouts with merged cells, hand-drawn tables in scanned documents, tables with footnote-style annotations or visual emphasis that does not survive extraction. For these tables, the ColPali or DSE vision pipeline succeeds where the extractor failed, and the vision-LLM generator reads the table directly from the page image at generation time. The architectural decision is to deploy both paths and measure success rates on the eval set: pages where the extractor produced a clean structured table go to the text pipeline, pages where extraction produced a degraded or empty result go to the vision pipeline. The composition is cheaper than vision-on-everything and more accurate than extraction-on-everything. The tracking metric is per-table extraction confidence (from the parser) plus per-question accuracy stratified by table-success or table-failure pages; the cutover threshold on the confidence score is calibrated against the eval, not chosen in advance.

Q: Why do vision-LLMs hallucinate on charts, and what production patterns reduce the failure rate?

Charts are harder than tables because the meaning of a chart is often a trend or a relationship rather than a discrete value; the vision-LLM has to integrate information across axes, multiple data series, colour-coded categories, and sometimes non-standard scales (logarithmic axes, dual-axis charts, broken axes). The 2026 frontier vision-LLMs (GPT-4o, Gemini 2.5 Pro, Claude 4 Opus, Qwen2-VL) read most chart types reliably at moderate complexity but fail with characteristic patterns on harder cases: stacked bar charts with similar adjacent colours produce wrong attributions of which segment belongs to which value; line charts with multiple overlapping series produce wrong trend assignments; charts with non-standard axes produce confidently wrong absolute values; and charts referenced only by figure-number in the prompt without the actual image attached produce complete fabrications because the model invents the chart content from the figure caption. The production patterns that reduce the failure rate are stacked. Figure-aware retrieval augments every text chunk that references a figure (caption, body-text mention of "Figure 3.2", any "Chart" or "Diagram" reference) with a pointer to the page image at indexing time; when the chunk is retrieved the page image is automatically added to the generation prompt, so the model never has to invent the chart from a figure-number alone. The citation loop catches the wrong-attribution and wrong-trend failures: the model emits a claim like "Q3 revenue grew 18% year-over-year" with a bounding box pointing at the chart region, the verifier pass crops the region and asks "does this support the claim", and the verifier rejects the claim if the chart does not show what was claimed. For genuinely ambiguous charts the verifier rejects and the system either falls back to "the source contains a chart on this topic but the specific value is ambiguous" or escalates to human review, which is sometimes the right outcome rather than guessing.

Q: What is the real cost of a multimodal RAG query versus a text-only RAG query at 2026 prices?

The per-page indexing cost decomposes into page rendering (5–15ms per page on commodity CPU, near-zero cost line item), layout-aware OCR (2–8 pages per second per GPU on Surya on an A100, or $0.50–$2.00 per million pages on managed APIs like Anthropic Document or AWS Textract), and vision embedding (50–150 pages per second per A100 on ColPali, or $5–$15 per million pages on hosted inference); index storage is roughly 200KB per page for ColPali versus 5KB per text chunk, so a 10-million-page corpus stores at roughly 2TB on ColPali versus 50GB on the text baseline. The per-query inference cost decomposes into retrieval (50–200ms p99 for ColPali MaxSim on Vespa or Qdrant at 10-million-page scale, comparable to text retrieval) and generation, which is the dominant line item. A vision-LLM generation prompt with five page images (typical top-5 retrieval) at GPT-4o image pricing is roughly $0.02–$0.05 per query, versus $0.002–$0.005 for the same query on text-only RAG with five text chunks — a 10x to 25x cost multiplier on the generation step. The citation verifier adds another 0.5x to 2x on top of the generation cost. The cost levers that work in production are the same shape as the Cost Engineering LLM Features playbook applied to multimodal: the query classifier avoids the vision pipeline on text-only queries (typical 40–60 percent total-cost reduction on mixed workloads); a smaller verifier model reduces the verifier line item by 60–80 percent; page-image caching via prompt caching reduces cost by 30–50 percent on workloads with hot documents (same source documents queried repeatedly); the DSE shortlist plus ColPali rerank composition reduces embedding and retrieval cost by 3x to 5x versus ColPali on everything. The honest summary is that multimodal RAG is meaningfully more expensive than text-only RAG and the cost engineering levers compound; teams that ignore the levers ship architectures that work but cannot scale economically.