MMDR (Multi-Modal Deep Research) is an automated pipeline for end-to-end multimodal deep research, fact verification, and citation-grounded synthesis. It follows a dual-role architecture (Writer & Judge) to generate research reports and evaluate them with strict, grounded criteria.
- FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
- TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim–URL alignment.
- VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
- MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).
- Smart Resume: Skips already-completed tasks to reduce time and API cost.
- Graceful Stop: Safe shutdown via CLI (
stop,exit) orCtrl+C, ensuring partial results are flushed. - Precision Debugging: Run a single case with
--quiz_firstor--quiz_index. - Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.
git clone [https://github.com/YourUsername/MMDR.git](https://github.com/YourUsername/MMDR.git)
cd MMDR
pip install -r requirements.txt
cp env.txt
Example (adjust to your providers/models):
# --- Roles ---
MMDR_REPORT_PROVIDER=gemini # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure # recommended: strong reasoning model
# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o
# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...
Run the first question only to confirm API + paths:
python run_pipeline.py --quiz_first
Process all tasks in quiz.jsonl:
python run_pipeline.py --run_id experiment_v1
Re-run a single item by 1-based index:
python run_pipeline.py --quiz_index 5 --run_id debug_q5
python run_pipeline.py --max_workers 4
| Command | Action |
|---|---|
stop + Enter |
Safely stop after current tasks finish; saves outputs |
Ctrl+C |
Triggers the same graceful shutdown behavior |
Outputs are written to reports_runs/<RUN_ID>/:
reports_runs/experiment_v1/
├── reports/ # Markdown research reports
│ ├── Q1.md
│ └── ...
├── results/
│ └── experiment_v1.jsonl # detailed logs (scores/errors/timings)
├── summary/
│ └── experiment_v1.txt # aggregated stats (pass rate/avg scores)
└── mm/ # multimodal intermediate artifacts
If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:
@article{mmdrbench2025,
title={MMDeepResearch-Bench: Grounded Evaluation and Alignment for Multimodal Deep Research Agents},
author={Anonymous},
journal={arXiv preprint},
year={2025}
}
This project is released under the Apache-2.0 License. See LICENSE.