Skip to content

AIoT-MLSys-Lab/MMDeepResearch-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

License Python 3.8+ Benchmark: MMDR-Bench

MMDR (Multi-Modal Deep Research) is an automated pipeline for end-to-end multimodal deep research, fact verification, and citation-grounded synthesis. It follows a dual-role architecture (Writer & Judge) to generate research reports and evaluate them with strict, grounded criteria.


✨ Key Features

🔬 Evaluation Framework

  • FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
  • TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim–URL alignment.
    • VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
  • MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).

🛠️ Engineering & Usability

  • Smart Resume: Skips already-completed tasks to reduce time and API cost.
  • Graceful Stop: Safe shutdown via CLI (stop, exit) or Ctrl+C, ensuring partial results are flushed.
  • Precision Debugging: Run a single case with --quiz_first or --quiz_index.
  • Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.

📦 Installation

1) Clone

git clone [https://github.com/YourUsername/MMDR.git](https://github.com/YourUsername/MMDR.git)
cd MMDR

2) Install dependencies

pip install -r requirements.txt

⚙️ Configuration

1) Create .env

cp env.txt

2) Edit .env

Example (adjust to your providers/models):

# --- Roles ---
MMDR_REPORT_PROVIDER=gemini       # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure         # recommended: strong reasoning model

# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o

# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...

🚀 Usage

1) Quick verification (recommended first run)

Run the first question only to confirm API + paths:

python run_pipeline.py --quiz_first

2) Full batch run

Process all tasks in quiz.jsonl:

python run_pipeline.py --run_id experiment_v1

3) Targeted debugging

Re-run a single item by 1-based index:

python run_pipeline.py --quiz_index 5 --run_id debug_q5

4) Parallel mode

python run_pipeline.py --max_workers 4

🎮 Runtime Controls

Command Action
stop + Enter Safely stop after current tasks finish; saves outputs
Ctrl+C Triggers the same graceful shutdown behavior

📂 Output Structure

Outputs are written to reports_runs/<RUN_ID>/:

reports_runs/experiment_v1/
├── reports/                  # Markdown research reports
│   ├── Q1.md
│   └── ...
├── results/
│   └── experiment_v1.jsonl   # detailed logs (scores/errors/timings)
├── summary/
│   └── experiment_v1.txt     # aggregated stats (pass rate/avg scores)
└── mm/                       # multimodal intermediate artifacts


🧾 Citation

If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:

@article{mmdrbench2025,
  title={MMDeepResearch-Bench: Grounded Evaluation and Alignment for Multimodal Deep Research Agents},
  author={Anonymous},
  journal={arXiv preprint},
  year={2025}
}

📜 License

This project is released under the Apache-2.0 License. See LICENSE.

About

MMDeepResearch-Bench (MMDR)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages