Skip to content

Commit d9393b8

Browse files
authored
Merge pull request #172 from truthstriver/develop-upload-retake-codes
[feature] upload retake codes
2 parents 62ed31f + e563a54 commit d9393b8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+7203
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/dataset
2+
/results
3+
*/__pycache__
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# 🌟 AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding
2+
[![Paper](https://img.shields.io/badge/arXiv-2503.12559-b31b1b.svg)](https://arxiv.org/abs/2503.12559)
3+
*Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression*
4+
5+
<p align="center">
6+
<img src="misc/flexreduc_pipeline.png" alt="AdaReTaKe Framework" width="70%">
7+
</p>
8+
9+
---
10+
11+
## 🔍 Overview
12+
**AdaReTaKe** is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:
13+
**Extends context capacity** from 256 to **2048 frames**
14+
**Theoretically minimizes compression loss** via adaptive ratio allocation
15+
**Outperforms SOTA** by **+2.3% (7B)** and **+2.8% (72B)** on four benchmarks
16+
17+
---
18+
19+
## 🎯 Key Contributions
20+
| Feature | Innovation |
21+
|---------|------------|
22+
| **Adaptive Redundancy Reduction** | Layer-wise + timestamp-wise compression for maximal context retention |
23+
| **Scalability** | Validated on 7B to 72B MLLMs with consistent gains |
24+
| **Theoretical Guarantee** | Compression ratio allocation minimizes the loss upper bound |
25+
26+
---
27+
28+
## 🛠️ Setup
29+
30+
### 🌐 Environment
31+
```bash
32+
# For GPU users
33+
conda create -n retake python=3.11
34+
pip install -r requirements.txt
35+
36+
# For NPU users (e.g., Ascend)
37+
conda env create -f environment_npu.yaml
38+
39+
# Additional dependencies
40+
pip install git+https://github.com/huggingface/transformers.git@f3f6c86582611976e72be054675e2bf0abb5f775
41+
apt-get install ffmpeg # Required for full video processing
42+
```
43+
44+
---
45+
46+
## 🚦 Quick Start
47+
48+
### 1️⃣ Configure Paths
49+
Edit `demo.py`:
50+
```python
51+
hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"
52+
# NPU users: config_path = 'configs/demo_npu.yaml'
53+
```
54+
55+
### 2️⃣ (Optional) Convert LLaVA-Video Weights
56+
```bash
57+
python scripts/utils/convert_llava_video_weights_to_hf.py \
58+
--text_model_id /path_to/Qwen2-7B-Instruct \
59+
--vision_model_id /path_to/siglip-so400m-patch14-384 \
60+
--output_hub_path /path_to/llava-video-qwen2-7b-hf \
61+
--old_state_dict_id /path_to/LLaVAVideoQwen2_7B
62+
```
63+
64+
### 3️⃣ Run Demo
65+
```bash
66+
python demo.py
67+
```
68+
69+
---
70+
71+
## 📈 Reproduce Results
72+
73+
### Dataset Preparation
74+
- [VideoMME](docs/prepare_videomme.md)
75+
- [MLVU](docs/prepare_mlvu.md)
76+
- [LongVideoBench](docs/prepare_longvideobench.md)
77+
- [LVBench](docs/prepare_lvbench.md)
78+
79+
### Evaluation Scripts
80+
```bash
81+
# Example for VideoMME (adjust for other datasets)
82+
bash scripts/infer_eval.sh ${Qwen2.5-VL-7B-PATH} configs/qwen2_5_vl/flexreduc_qwen2-5-vl_videomme.yaml 8
83+
```
84+
*Results saved in `./results`*
85+
86+
---
87+
88+
## Citation
89+
Please cite the repository if you use the data collection, code and experimental findings in this repository.
90+
91+
```bibtex
92+
@misc{wang2025retakereducingtemporalknowledge,
93+
title={ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding},
94+
author={Xiao Wang and Qingyi Si and Jianlong Wu and Shiyu Zhu and Li Cao and Liqiang Nie},
95+
year={2025},
96+
eprint={2412.20504},
97+
archivePrefix={arXiv},
98+
primaryClass={cs.CV},
99+
url={https://arxiv.org/abs/2412.20504},
100+
}
101+
@misc{wang2025adaretakeadaptiveredundancyreduction,
102+
title={AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding},
103+
author={Xiao Wang and Qingyi Si and Jianlong Wu and Shiyu Zhu and Li Cao and Liqiang Nie},
104+
year={2025},
105+
eprint={2503.12559},
106+
archivePrefix={arXiv},
107+
primaryClass={cs.CV},
108+
url={https://arxiv.org/abs/2503.12559},
109+
}
110+
```
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
### model
2+
method: retake
3+
scaling_factor: 4
4+
attn_implementation: "flash_attention_2"
5+
longvideo_kwargs: {
6+
'frame_chunk_size': 64,
7+
'chunked_prefill_frames': 32,
8+
# KVCache compression
9+
'kvcache_compression': True,
10+
'kvcache_compression_kwargs': {
11+
'compression_method': 'stdvidlkv',
12+
'dynamic_compression_ratio': True,
13+
'prompt_guided_compression': True,
14+
'pos_embed_reforge': False,
15+
'max_input_length': 16000,
16+
# Temporal
17+
'enable_temporal_adaptation': True,
18+
'temporal_adaptation_ratio': 4,
19+
# Layer
20+
'budget_allocation_method': 'adakv',
21+
},
22+
}
23+
24+
### data
25+
sample_fps: 4
26+
max_num_frames: 2048
27+
longsize_resolution: 448
28+
29+
### generate
30+
do_sample: false
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
### model
2+
method: retake
3+
scaling_factor: 4
4+
# attn_implementation: "sdpa"
5+
attn_implementation: "eager" # If your NPU does not support sdpa attention
6+
longvideo_kwargs: {
7+
'frame_chunk_size': 16,
8+
'chunked_prefill_frames': 16,
9+
# KVCache compression
10+
'kvcache_compression': True,
11+
'kvcache_compression_kwargs': {
12+
'compression_method': 'stdvidlkv',
13+
'dynamic_compression_ratio': True,
14+
'prompt_guided_compression': True,
15+
'pos_embed_reforge': False,
16+
'max_input_length': 16000,
17+
# Temporal
18+
'enable_temporal_adaptation': True,
19+
'temporal_adaptation_ratio': 4,
20+
# Layer
21+
'budget_allocation_method': 'adakv',
22+
},
23+
}
24+
25+
### data
26+
sample_fps: 4
27+
max_num_frames: 2048
28+
longsize_resolution: 448
29+
30+
### generate
31+
do_sample: false
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
attn_implementation: "flash_attention_2"
5+
6+
### dataset
7+
dataset_name: lvbench
8+
anno_file: dataset/lvbench/lvbench.json
9+
dataloader_num_workers: 4
10+
11+
### data
12+
sample_fps: 2
13+
max_num_frames: 64
14+
longsize_resolution: 682 # short-side can be 384
15+
16+
### generate
17+
do_sample: false
18+
19+
### output
20+
output_dir: results/llava-video_lvbench_f64_2fps_r682/base
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
attn_implementation: "flash_attention_2"
5+
6+
### dataset
7+
dataset_name: mlvu
8+
anno_file: dataset/mlvu/mlvu.json
9+
dataloader_num_workers: 4
10+
11+
### data
12+
sample_fps: 2
13+
max_num_frames: 64
14+
longsize_resolution: 682 # short-side can be 384
15+
16+
### generate
17+
do_sample: false
18+
19+
### output
20+
output_dir: results/llava-video_mlvu_f64_2fps_r682/base
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
attn_implementation: "flash_attention_2"
5+
6+
### dataset
7+
dataset_name: videomme
8+
anno_file: dataset/video_mme/video_mme.json
9+
dataloader_num_workers: 4
10+
11+
### data
12+
sample_fps: 2
13+
max_num_frames: 64
14+
longsize_resolution: 682 # short-side can be 384
15+
16+
### generate
17+
do_sample: false
18+
19+
### output
20+
output_dir: results/llava-video_video_mme_f64_2fps_r682/base
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
scaling_factor: 4
5+
attn_implementation: "flash_attention_2"
6+
longvideo_kwargs: {
7+
'frame_chunk_size': 32,
8+
'chunked_prefill_frames': 32,
9+
# Keyframe compression
10+
'visual_compression': True,
11+
'visual_compression_kwargs': {
12+
'compression_ratio': 1.0,
13+
'compression_method': 'Keyframe',
14+
'patch_sync': False,
15+
'return_keyframe_mask': True
16+
},
17+
# KVCache compression
18+
'kvcache_compression': True,
19+
'kvcache_compression_kwargs': {
20+
'dynamic_compression_ratio': True,
21+
'compression_method': 'pivotkv',
22+
'pos_embed_reforge': True,
23+
'max_input_length': 40000
24+
},
25+
}
26+
27+
### dataset
28+
dataset_name: lvbench
29+
anno_file: dataset/lvbench/lvbench.json
30+
dataloader_num_workers: 4
31+
32+
### data
33+
sample_fps: 2
34+
max_num_frames: 1024
35+
longsize_resolution: 682
36+
37+
### generate
38+
do_sample: false
39+
40+
### output
41+
output_dir: results/llava-video_f1024_2fps_r682/retake_dp1-async_pivot-40k
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
scaling_factor: 4
5+
attn_implementation: "flash_attention_2"
6+
longvideo_kwargs: {
7+
'frame_chunk_size': 32,
8+
'chunked_prefill_frames': 32,
9+
# Keyframe compression
10+
'visual_compression': True,
11+
'visual_compression_kwargs': {
12+
'compression_ratio': 1.0,
13+
'compression_method': 'Keyframe',
14+
'patch_sync': False,
15+
'return_keyframe_mask': True
16+
},
17+
# KVCache compression
18+
'kvcache_compression': True,
19+
'kvcache_compression_kwargs': {
20+
'dynamic_compression_ratio': True,
21+
'compression_method': 'pivotkv',
22+
'pos_embed_reforge': True,
23+
'max_input_length': 40000
24+
},
25+
}
26+
27+
### dataset
28+
dataset_name: mlvu
29+
anno_file: dataset/mlvu/mlvu.json
30+
dataloader_num_workers: 4
31+
32+
### data
33+
sample_fps: 2
34+
max_num_frames: 1024
35+
longsize_resolution: 682
36+
37+
### generate
38+
do_sample: false
39+
40+
### output
41+
output_dir: results/llava-video_rope4_mlvu_f1024_2fps_r682/retake_dp1-async_pivot-40k
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
### model
2+
model_name: llava_video
3+
method: retake
4+
scaling_factor: 4
5+
attn_implementation: "flash_attention_2"
6+
longvideo_kwargs: {
7+
'frame_chunk_size': 32,
8+
'chunked_prefill_frames': 32,
9+
# Keyframe compression
10+
'visual_compression': True,
11+
'visual_compression_kwargs': {
12+
'compression_ratio': 1.0,
13+
'compression_method': 'Keyframe',
14+
'patch_sync': False,
15+
'return_keyframe_mask': True
16+
},
17+
# KVCache compression
18+
'kvcache_compression': True,
19+
'kvcache_compression_kwargs': {
20+
'dynamic_compression_ratio': True,
21+
'compression_method': 'pivotkv',
22+
'pos_embed_reforge': True,
23+
'max_input_length': 40000
24+
},
25+
}
26+
27+
### dataset
28+
dataset_name: videomme
29+
anno_file: dataset/video_mme/video_mme.json
30+
dataloader_num_workers: 4
31+
32+
### data
33+
sample_fps: 2
34+
max_num_frames: 1024
35+
longsize_resolution: 682
36+
37+
### generate
38+
do_sample: false
39+
40+
### output
41+
output_dir: results/llava-video_rope4_video_mme_f1024_2fps_r682/retake_dp1-async_pivot-40k

0 commit comments

Comments
 (0)