In this work, we introduce Double-Bench, a comprehensive benchmark for evaluating document Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs). We identify critical limitations in existing evaluation approaches and propose solutions to enable more realistic and thorough assessment. Specifically, there are three-fold major contributions:

Comprehensive Problem Analysis. We diagnose four major limitations in existing document RAG evaluation: incomplete scope focusing only on specific components, unrealistic prior knowledge assumptions, ambiguous or non-unique evidence labels, and poorly designed multi-hop query synthesis that fails to evaluate genuine reasoning capabilities.
A Novel Large-Scale Benchmark. We introduce Double-Bench, the first comprehensive evaluation system for multilingual and multimodal document RAG, featuring 3,276 documents (72,880 pages) and 5,168 human-validated single- and multi-hop queries across 6 languages and 4 document types. The benchmark includes exhaustively verified evidence pages, fine-grained component assessment, and dynamic update support to address data contamination.
Extensive Evaluation and Critical Insights. We conducted comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs, and 4 document RAG frameworks, revealing that the gap between text and visual embedding models is narrowing, document RAG frameworks suffer from an "over-confidence dilemma" where they provide answers without sufficient evidence, and retrieval accuracy remains the primary bottleneck rather than generation capabilities.

Abstract

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

Double-Bench Construction

We introduce Double-Bench, a comprehensive multilingual and multimodal benchmark containing 3,276 documents (72,880 pages) and 5,168 human-validated queries across 6 languages and 4 document types, designed to evaluate document RAG systems holistically. The documents span high-quality PDFs, scanned documents, slides, and HTML pages collected from diverse sources including academic papers, CommonCrawl corpus, and Wikipedia entries. The construction of the Double-Bench dataset follows a rigorous three-stage pipeline with extensive human validation and quality control:

Metadata Collection and Preprocessing: This stage involves collecting diverse document types and applying systematic filtering. Raw documents undergo coarse-grained filtering (10-50 pages, language verification using GPT-4o) followed by modality decomposition using tools like Docling and MinerU to split each page into constituent text, table, and figure components. A fine-grained content filter then reviews parsed chunks with adjacent context to ensure semantic coherence and filter out irrelevant content.
Iterative Query Synthesis with Validation: Single-hop queries are generated following four core principles (self-containment, targeting significant unimodal content, no explicit source referencing, variety and naturality) with iterative refinement using GPT-4o and validation against the corpus using high-performance embedding models until queries yield ≤5 ground truth pages. Multi-hop queries employ a knowledge graph-based approach using LightRAG, where LLM agents perform guided graph walks to generate complex reasoning chains, with each step iteratively nested to form grammatically natural questions requiring sequential reasoning.
Post-processing and Human Refinement: All generated queries undergo quality inspection against strict checklists, followed by exhaustive evidence labeling where every page is thoroughly searched and marked as evidence only if directly providing or leading to the answer. Human annotators review and adjust labels with 92% initial agreement rate, ensuring precise ground truth data and benchmark reliability.

Double-Bench Comparison

📄 PDFs, 📰 Scanned documents, 🎯 Slides, 🌐 HTML pages; GT: Ground Truth evidence labels, M.H.: Multi-Hop, Lang.: Supported language number, Dyna.: Support dynamic benchmark update

Benchmarks	Size		Queries				Labels		Evaluation Target			Document
	Doc	Avg. #Pages	Query	Clarity	i.i.d.	M.H.	GT	M.H. Chain	Embed Model	MLLMs	RAG System	Lang.	Dyna.	Type
DocVQA	6,071	1.0	50,000	✗	✗	✗	✗	-	✗	✓	✗	1	✗	📄 📰
MMLongbench-Doc	135	47.5	1,082	✗	✓	✓	✗	✗	✗	✓	✗	1	✗	📄
MMDocIR	6,818	65.1	73,843	✗	✗	✓	✗	✗	✓	✗	✗	1	✗	📄
UDA-QA	2,965	46.3	29,590	✗	✓	✓	✗	✗	✓	✗	✗	1	✗	📄
ViDoRe v1	5,000	1.0	500	✓	✓	✗	✓	-	✓	✗	✗	2	✗	📄 📰
ViDoRe v2	65	48.6	913	✓	✓	✗	✓	-	✓	✗	✗	2	✗	📄
ViDoSeek	1,142	18.4	1,142	✓	✗	✓	✗	✗	✓	✗	✓	1	✗	🎯
REAL-MM-RAG	163	49.1	4,553	✓	✓	✗	✓	-	✓	✗	✗	1	✗	🎯
Double-Bench	3,276	22.3	5,168	✓	✓	✓	✓	✓	✓	✓	✓	6	✓	📄 📰 🎯 🌐

Model	Average	Single Hop	2-Hop	3-Hop
Text Embedding Models
Qwen3-Embedding-4B	0.489	0.699	0.776	0.726	0.852	0.886	0.314	0.598	0.663	0.235	0.531	0.668
NV-Embed-v2	0.443	0.650	0.724	0.626	0.756	0.796	0.333	0.604	0.689	0.240	0.526	0.641
gte-Qwen2-7B-instruct	0.404	0.611	0.697	0.585	0.749	0.804	0.288	0.503	0.603	0.205	0.466	0.588
bge-m3	0.355	0.525	0.591	0.527	0.648	0.695	0.180	0.366	0.428	0.182	0.412	0.502
Visual & Multimodal Embedding Models
colqwen2.5-3b-multilingual	0.533	0.727	0.795	0.778	0.865	0.895	0.326	0.622	0.693	0.277	0.579	0.696
vdr-2b-multi	0.463	0.648	0.725	0.688	0.813	0.847	0.283	0.491	0.589	0.225	0.482	0.606
jina-embeddings-v4	0.451	0.641	0.720	0.671	0.804	0.844	0.264	0.468	0.570	0.222	0.479	0.603
gme-Qwen2-VL-7B-Instruct	0.428	0.614	0.697	0.638	0.775	0.822	0.249	0.472	0.579	0.208	0.449	0.570
colpali-v1.3	0.403	0.571	0.646	0.584	0.679	0.717	0.230	0.440	0.525	0.220	0.469	0.588

Framework	Average	Single Hop	2-Hop	3-Hop
MDocAgent (Han et al., 2025)	0.688	0.645	0.126	0.229	0.830	0.757	0.132	0.111	0.572	0.567	0.065	0.367	0.549	0.532	0.135	0.332
ViDoRAG (Wang et al., 2025)	0.682	0.536	0.138	0.326	0.822	0.623	0.144	0.233	0.539	0.457	0.112	0.431	0.544	0.447	0.137	0.416
M3DOCRAG (Cho et al., 2024)	0.608	0.451	0.121	0.428	0.709	0.538	0.138	0.324	0.490	0.330	0.088	0.582	0.519	0.382	0.110	0.508
Colqwen-gen (Faysse et al., 2024)	0.795	0.604	0.135	0.261	0.895	0.676	0.160	0.164	0.693	0.462	0.143	0.395	0.696	0.554	0.100	0.346

Model & Framework	Arabic	Chinese	English	French	Japanese	Spanish	Average
Text Embedding Models
Qwen3-Embedding-4B	0.685	0.696	0.809	0.646	0.801	0.754	0.732
NV-Embed-v2	0.546	0.654	0.819	0.660	0.662	0.698	0.673
gte-Qwen2-7B-instruct	0.600	0.623	0.721	0.625	0.722	0.657	0.658
bge-m3	0.331	0.451	0.727	0.453	0.486	0.489	0.490
Visual & Multimodal Embedding Models
colqwen2.5-3b-multilingual	0.427	0.694	0.860	0.702	0.786	0.798	0.711
vdr-2b-multi	0.364	0.680	0.782	0.607	0.740	0.711	0.647
jina-embeddings-v4	0.369	0.587	0.792	0.602	0.743	0.685	0.630
gme-Qwen2-VL-7B-Instruct	0.352	0.631	0.750	0.585	0.686	0.693	0.616
colpali-v1.3	0.271	0.300	0.781	0.607	0.335	0.694	0.498
Document RAG System
MDocAgent	0.457	0.679	0.785	0.658	0.707	0.661	0.658
ViDoRAG	0.455	0.676	0.778	0.654	0.702	0.668	0.655
M3DOCRAG	0.377	0.579	0.704	0.558	0.621	0.599	0.573
Colqwen-gen	0.432	0.821	0.793	0.829	0.781	0.771	0.738

Model & Setting	Single Hop	Multi Hop
Models w.o. RAG
Qwen3-32B text-only	0.242	0.488	0.271	0.193	0.293	0.515
Qwen2.5-VL-7B w.o. RAG	0.053	0.557	0.390	0.127	0.168	0.705
GPT-4o w.o. RAG	0.109	0.748	0.144	0.197	0.332	0.472
Qwen2.5-VL-32B w.o. RAG	0.200	0.621	0.179	0.159	0.319	0.521
Llama 4 Maverick w.o. RAG	0.245	0.480	0.275	0.215	0.193	0.592
Models Oracle
Qwen2.5-VL-7B Oracle	0.406	0.490	0.104	0.456	0.241	0.303
GPT-4o Oracle	0.678	0.141	0.181	0.538	0.271	0.191
Llama 4 Maverick Oracle	0.601	0.350	0.049	0.524	0.192	0.284
Qwen2.5-VL-32B Oracle	0.874	0.061	0.066	0.643	0.312	0.045

Empirical Results from Double-Bench

Document-specified embedding model outperform general ones, and gap between text and image embedding models is narrowing.

Double-Bench provides a clear divergence in the retrieval performance of various embedding models. The model rankings within Double-Bench align well with popular text embedding leaderboards MTEB and document retrieval benchmark ViDoRe v2, demonstrating the robustness of our benchmark.

ColQwen2.5-3B significantly outperforms general multimodal embedding models like jina-embeddings-v4 and GME, achieving a 9% higher average hit rate and demonstrating strong potential in document retrieval. Other multimodal embedding models show limited capability, even underperforming compared to the purely textual embedding model Qwen3-Embedding. We attribute this to recent advancements of text embedding community's sophisticated training techniques, including complex multi-stage training, dedicated hard negative sampling, and large-scale high-quality data synthesis. These techniques are difficult to transfer to visual embedding models due to training costs, limited text-and-image data, and model structural constraints. Although visual embedding models have inherent advantages for visual content retrieval, the semantic complexity of document RAG tasks negates this advantage. The critical influence of both visual observation and textual understanding abilities incentive combined strategies such as interleaved embedding models and advanced multimodal understanding pipelines.

Double-Bench is high-quality and low contaminated that MLLMs still needs retrieval details to answer question correctly.

State-of-the-art MLLMs like GPT-4o, Gemini, and Qwen are able to make general responses without context, with 50% to 70% of responses being partially correct. Providing evidence pages to MLLMs substantially boosts accuracy, with 3x to 5x responses being completely correct compared to w.o. RAG setting. This indicates that our benchmark is well-suited for evaluating the retrieval and synthesis components of RAG systems, as it clearly distinguishes context-grounded reasoning from a model's inherent knowledge. Notably, the robust performance of Qwen2.5-VL observed in the upper bound setting, which closely mirrors our benchmark curation pipeline, further suggesting the robustness and effectiveness of our pipeline in identifying correct evidence pages of queries.

Document RAG frameworks bottleneck still lies on retrieval accuracy, where designing advanced strategies may help.

Most frameworks strive to design complex information mining pipelines to extract maximum value from retrieved pages, yet tend to pay little attention to the retrieval stage itself. However, our experiments demonstrate strong correlations between retrieval accuracy and answer accuracy. Equipped with a single MLLM pass, Colqwen-gen even partially outperforms MDocAgent on multi-hop queries, despite the latter seamlessly integrating multiple agents to provide final answers. This underscores the critical importance of optimizing the retrieval stage, potentially through finer-grained document preprocessing, exploiting the hierarchical and semantic structure of documents and developing more powerful or integrated embedding models.

The overconfidence dilemma: trading trustworthiness for answers.

To investigate the bottleneck in existing RAG frameworks, we breakdown each reponse of M3DocRAG and MDocAgent to analyze whether the error comes from retrieval or answering, and look into the trade-off between answering accuracy and the ability to identify insufficient information (also known as honesty).

Our experiments reveal a striking divergence in agent behavior. Simpler agents like M3DocRAG adopt a cautious strategy, answering a lower proportion of queries with successfully retrieved context but reliably identifying retrieval failures and refusing to respond. In contrast, more complex agents like MDocAgent and ViDoRAG exhibit significant overconfidence. While they achieve higher accuracy on retrieval hits, they indiscriminately attempt to answer nearly every query, regardless of whether sufficient information was retrieved. This frequently leads to speculative or entirely hallucinated content when evidence pages are missed.

This observation indicates that recent document RAG development has over-emphasized maximizing answer generation at the expense of "epistemic humility", i.e., the crucial skill of knowing what it doesn't know and admitting when an answer cannot be found. Consequently, we argue that future research should pursue more trustworthy RAG frameworks where identifying informational gaps is as valued as accuracy.

Inference patterns of MLLMs as response model.

We also observe different answering strategy in MLLMs. When directly provided with a multi-hop query, response model tend not to process them hop-by-hop. On the contrary, they first collect signature information---the most distinguishing or identifiable pieces---from the various hops. Following this, models tend to perform a direct inclusion based elimination to arrive the final answer. This mechanism differentiates significantly from our expectation of how models might sequentially solve multi-hop queries. This provides a compelling point of view: merely increasing the number of hops may not increase its difficulty. A case study in Appendix reveals that MLLMs do not process multi-hop queries step-by-step as expected. Instead, they gather key signature information from each hop and use inclusion-based elimination to find the answer. This challenges the assumption that more hops always increase difficulty, suggesting further investigation is needed.

Acknowledgement

Many thanks to members in ONE Lab for their invalueble effort in this project. Also thanks to video game PEAK for the infinite happiness and fun. This website is based on templates in LiveVQA.

BibTeX


        @article{shen2025we,
          title={Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?},
          author={Shen, Wenxuan and Wang, Mingjia and Wang, Yaochen and Chen, Dongping and Yang, Junjie and Wan, Yao and Lin, Weiwei},
          journal={arXiv preprint arXiv:2508.03644},
          year={2025}
        }

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?