🚀 MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

MMDR (Multi-Modal Deep Research) is an automated pipeline for end-to-end multimodal deep research, fact verification, and citation-grounded synthesis. It follows a dual-role architecture (Writer & Judge) to generate research reports and evaluate them with strict, grounded criteria.

✨ Key Features

🔬 Evaluation Framework

FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim–URL alignment.
- VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).

🛠️ Engineering & Usability

Smart Resume: Skips already-completed tasks to reduce time and API cost.
Graceful Stop: Safe shutdown via CLI (stop, exit) or Ctrl+C, ensuring partial results are flushed.
Precision Debugging: Run a single case with --quiz_first or --quiz_index.
Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.

📦 Installation

1) Clone

git clone [https://github.com/YourUsername/MMDR.git](https://github.com/YourUsername/MMDR.git)
cd MMDR

2) Install dependencies

pip install -r requirements.txt

⚙️ Configuration

1) Create `.env`

cp env.txt

2) Edit `.env`

Example (adjust to your providers/models):

# --- Roles ---
MMDR_REPORT_PROVIDER=gemini       # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure         # recommended: strong reasoning model

# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o

# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...

🚀 Usage

1) Quick verification (recommended first run)

Run the first question only to confirm API + paths:

python run_pipeline.py --quiz_first

2) Full batch run

Process all tasks in quiz.jsonl:

python run_pipeline.py --run_id experiment_v1

3) Targeted debugging

Re-run a single item by 1-based index:

python run_pipeline.py --quiz_index 5 --run_id debug_q5

4) Parallel mode

python run_pipeline.py --max_workers 4

🎮 Runtime Controls

Command	Action
`stop` + Enter	Safely stop after current tasks finish; saves outputs
`Ctrl+C`	Triggers the same graceful shutdown behavior

📂 Output Structure

Outputs are written to reports_runs/<RUN_ID>/:

reports_runs/experiment_v1/
├── reports/                  # Markdown research reports
│   ├── Q1.md
│   └── ...
├── results/
│   └── experiment_v1.jsonl   # detailed logs (scores/errors/timings)
├── summary/
│   └── experiment_v1.txt     # aggregated stats (pass rate/avg scores)
└── mm/                       # multimodal intermediate artifacts

🧾 Citation

If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:

@article{mmdrbench2025,
  title={MMDeepResearch-Bench: Grounded Evaluation and Alignment for Multimodal Deep Research Agents},
  author={Anonymous},
  journal={arXiv preprint},
  year={2025}
}

📜 License

This project is released under the Apache-2.0 License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

✨ Key Features

🔬 Evaluation Framework

🛠️ Engineering & Usability

📦 Installation

1) Clone

2) Install dependencies

⚙️ Configuration

1) Create `.env`

2) Edit `.env`

🚀 Usage

1) Quick verification (recommended first run)

2) Full batch run

3) Targeted debugging

4) Parallel mode

🎮 Runtime Controls

📂 Output Structure

🧾 Citation

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
detail		detail
README.md		README.md
env.txt		env.txt
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

AIoT-MLSys-Lab/MMDeepResearch-Bench

Folders and files

Latest commit

History

Repository files navigation

🚀 MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

✨ Key Features

🔬 Evaluation Framework

🛠️ Engineering & Usability

📦 Installation

1) Clone

2) Install dependencies

⚙️ Configuration

1) Create .env

2) Edit .env

🚀 Usage

1) Quick verification (recommended first run)

2) Full batch run

3) Targeted debugging

4) Parallel mode

🎮 Runtime Controls

📂 Output Structure

🧾 Citation

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1) Create `.env`

2) Edit `.env`

Packages