ModelEngine-Group
diff --git a/‎ucm/sandbox/sparse/retake/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/README.md‎
Lines changed: 110 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/README.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/demo.yaml‎
Lines changed: 30 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/demo.yaml‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/demo_npu.yaml‎
Lines changed: 31 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/demo_npu.yaml‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_lvbench.yaml‎
Lines changed: 20 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_lvbench.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_mlvu.yaml‎
Lines changed: 20 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_mlvu.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_videomme.yaml‎
Lines changed: 20 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/llava-video_videomme.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_lvbench.yaml‎
Lines changed: 41 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_lvbench.yaml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_mlvu.yaml‎
Lines changed: 41 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_mlvu.yaml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_videomme.yaml‎
Lines changed: 41 additions & 0 deletions b/‎ucm/sandbox/sparse/retake/configs/llava_video/retake_llava-video_videomme.yaml‎
Lines changed: 41 additions & 0 deletions
@@ -0,0 +1,3 @@
+/dataset
+/results
+*/__pycache__
@@ -0,0 +1,110 @@
+# 🌟 AdaReTaKe: Adaptive Redundancy Reduction for Long-Context Video-Language Understanding  
+[![Paper](https://img.shields.io/badge/arXiv-2503.12559-b31b1b.svg)](https://arxiv.org/abs/2503.12559)  
+*Breaking the "Memory Wall" for MLLMs with Adaptive Video Compression*
+
+<p align="center">
+  <img src="misc/flexreduc_pipeline.png" alt="AdaReTaKe Framework" width="70%">
+</p>
+
+---
+
+## 🔍 Overview  
+**AdaReTaKe** is an advanced video compression framework designed for Multimodal Large Language Models (MLLMs). By adaptively reducing uneven visual redundancy across timestamps and model layers, it:   
+✅ **Extends context capacity** from 256 to **2048 frames**  
+✅ **Theoretically minimizes compression loss** via adaptive ratio allocation  
+✅ **Outperforms SOTA** by **+2.3% (7B)** and **+2.8% (72B)** on four benchmarks  
+
+---
+
+## 🎯 Key Contributions  
+| Feature | Innovation |
+|---------|------------|
+| **Adaptive Redundancy Reduction** | Layer-wise + timestamp-wise compression for maximal context retention |
+| **Scalability** | Validated on 7B to 72B MLLMs with consistent gains |
+| **Theoretical Guarantee** | Compression ratio allocation minimizes the loss upper bound |
+
+---
+
+## 🛠️ Setup  
+
+### 🌐 Environment  
+```bash
+# For GPU users
+conda create -n retake python=3.11
+pip install -r requirements.txt
+
+# For NPU users (e.g., Ascend)
+conda env create -f environment_npu.yaml
+
+# Additional dependencies
+pip install git+https://github.com/huggingface/transformers.git@f3f6c86582611976e72be054675e2bf0abb5f775
+apt-get install ffmpeg  # Required for full video processing
+```
+
+---
+
+## 🚦 Quick Start  
+
+### 1️⃣ Configure Paths  
+Edit `demo.py`:  
+```python
+hf_qwen2vl7b_path = "your/local/path/to/Qwen2-VL-7B-Instruct"  
+# NPU users: config_path = 'configs/demo_npu.yaml'
+```
+
+### 2️⃣ (Optional) Convert LLaVA-Video Weights  
+```bash
+python scripts/utils/convert_llava_video_weights_to_hf.py \
+  --text_model_id /path_to/Qwen2-7B-Instruct \
+  --vision_model_id /path_to/siglip-so400m-patch14-384 \
+  --output_hub_path /path_to/llava-video-qwen2-7b-hf \
+  --old_state_dict_id /path_to/LLaVAVideoQwen2_7B
+```
+
+### 3️⃣ Run Demo  
+```bash
+python demo.py
+```
+
+---
+
+## 📈 Reproduce Results  
+
+### Dataset Preparation  
+- [VideoMME](docs/prepare_videomme.md)  
+- [MLVU](docs/prepare_mlvu.md)  
+- [LongVideoBench](docs/prepare_longvideobench.md)  
+- [LVBench](docs/prepare_lvbench.md)  
+
+### Evaluation Scripts  
+```bash
+# Example for VideoMME (adjust for other datasets)
+bash scripts/infer_eval.sh ${Qwen2.5-VL-7B-PATH} configs/qwen2_5_vl/flexreduc_qwen2-5-vl_videomme.yaml 8
+```
+*Results saved in `./results`*
+
+---
+
+## Citation
+Please cite the repository if you use the data collection, code and experimental findings in this repository.
+
+```bibtex 
+@misc{wang2025retakereducingtemporalknowledge,
+      title={ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding}, 
+      author={Xiao Wang and Qingyi Si and Jianlong Wu and Shiyu Zhu and Li Cao and Liqiang Nie},
+      year={2025},
+      eprint={2412.20504},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2412.20504}, 
+}
+@misc{wang2025adaretakeadaptiveredundancyreduction,
+      title={AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding}, 
+      author={Xiao Wang and Qingyi Si and Jianlong Wu and Shiyu Zhu and Li Cao and Liqiang Nie},
+      year={2025},
+      eprint={2503.12559},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2503.12559}, 
+}
+```
@@ -0,0 +1,30 @@
+### model
+method: retake
+scaling_factor: 4
+attn_implementation: "flash_attention_2"
+longvideo_kwargs: {
+  'frame_chunk_size': 64,
+  'chunked_prefill_frames': 32,
+  # KVCache compression
+  'kvcache_compression': True,
+  'kvcache_compression_kwargs': {
+    'compression_method': 'stdvidlkv',
+    'dynamic_compression_ratio': True,
+    'prompt_guided_compression': True,
+    'pos_embed_reforge': False,
+    'max_input_length': 16000,
+    # Temporal
+    'enable_temporal_adaptation': True,
+    'temporal_adaptation_ratio': 4,
+    # Layer
+    'budget_allocation_method': 'adakv',
+  },
+}
+
+### data
+sample_fps: 4
+max_num_frames: 2048
+longsize_resolution: 448
+
+### generate
+do_sample: false
@@ -0,0 +1,31 @@
+### model
+method: retake
+scaling_factor: 4
+# attn_implementation: "sdpa"
+attn_implementation: "eager" # If your NPU does not support sdpa attention
+longvideo_kwargs: {
+  'frame_chunk_size': 16,
+  'chunked_prefill_frames': 16,
+  # KVCache compression
+  'kvcache_compression': True,
+  'kvcache_compression_kwargs': {
+    'compression_method': 'stdvidlkv',
+    'dynamic_compression_ratio': True,
+    'prompt_guided_compression': True,
+    'pos_embed_reforge': False,
+    'max_input_length': 16000,
+    # Temporal
+    'enable_temporal_adaptation': True,
+    'temporal_adaptation_ratio': 4,
+    # Layer
+    'budget_allocation_method': 'adakv',
+  },
+}
+
+### data
+sample_fps: 4
+max_num_frames: 2048
+longsize_resolution: 448
+
+### generate
+do_sample: false
@@ -0,0 +1,20 @@
+### model
+model_name: llava_video
+method: retake
+attn_implementation: "flash_attention_2"
+
+### dataset
+dataset_name: lvbench
+anno_file: dataset/lvbench/lvbench.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 64
+longsize_resolution: 682 # short-side can be 384
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_lvbench_f64_2fps_r682/base
@@ -0,0 +1,20 @@
+### model
+model_name: llava_video
+method: retake
+attn_implementation: "flash_attention_2"
+
+### dataset
+dataset_name: mlvu
+anno_file: dataset/mlvu/mlvu.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 64
+longsize_resolution: 682 # short-side can be 384
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_mlvu_f64_2fps_r682/base
@@ -0,0 +1,20 @@
+### model
+model_name: llava_video
+method: retake
+attn_implementation: "flash_attention_2"
+
+### dataset
+dataset_name: videomme
+anno_file: dataset/video_mme/video_mme.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 64
+longsize_resolution: 682 # short-side can be 384
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_video_mme_f64_2fps_r682/base
@@ -0,0 +1,41 @@
+### model
+model_name: llava_video
+method: retake
+scaling_factor: 4
+attn_implementation: "flash_attention_2"
+longvideo_kwargs: {
+  'frame_chunk_size': 32,
+  'chunked_prefill_frames': 32,
+  # Keyframe compression
+  'visual_compression': True,
+  'visual_compression_kwargs': {
+    'compression_ratio': 1.0,
+    'compression_method': 'Keyframe',
+    'patch_sync': False,
+    'return_keyframe_mask': True
+  },
+  # KVCache compression
+  'kvcache_compression': True,
+  'kvcache_compression_kwargs': {
+    'dynamic_compression_ratio': True,
+    'compression_method': 'pivotkv',
+    'pos_embed_reforge': True,
+    'max_input_length': 40000
+  },
+}
+
+### dataset
+dataset_name: lvbench
+anno_file: dataset/lvbench/lvbench.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 1024
+longsize_resolution: 682
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_f1024_2fps_r682/retake_dp1-async_pivot-40k
@@ -0,0 +1,41 @@
+### model
+model_name: llava_video
+method: retake
+scaling_factor: 4
+attn_implementation: "flash_attention_2"
+longvideo_kwargs: {
+  'frame_chunk_size': 32,
+  'chunked_prefill_frames': 32,
+  # Keyframe compression
+  'visual_compression': True,
+  'visual_compression_kwargs': {
+    'compression_ratio': 1.0,
+    'compression_method': 'Keyframe',
+    'patch_sync': False,
+    'return_keyframe_mask': True
+  },
+  # KVCache compression
+  'kvcache_compression': True,
+  'kvcache_compression_kwargs': {
+    'dynamic_compression_ratio': True,
+    'compression_method': 'pivotkv',
+    'pos_embed_reforge': True,
+    'max_input_length': 40000
+  },
+}
+
+### dataset
+dataset_name: mlvu
+anno_file: dataset/mlvu/mlvu.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 1024
+longsize_resolution: 682
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_rope4_mlvu_f1024_2fps_r682/retake_dp1-async_pivot-40k
@@ -0,0 +1,41 @@
+### model
+model_name: llava_video
+method: retake
+scaling_factor: 4
+attn_implementation: "flash_attention_2"
+longvideo_kwargs: {
+  'frame_chunk_size': 32,
+  'chunked_prefill_frames': 32,
+  # Keyframe compression
+  'visual_compression': True,
+  'visual_compression_kwargs': {
+    'compression_ratio': 1.0,
+    'compression_method': 'Keyframe',
+    'patch_sync': False,
+    'return_keyframe_mask': True
+  },
+  # KVCache compression
+  'kvcache_compression': True,
+  'kvcache_compression_kwargs': {
+    'dynamic_compression_ratio': True,
+    'compression_method': 'pivotkv',
+    'pos_embed_reforge': True,
+    'max_input_length': 40000
+  },
+}
+
+### dataset
+dataset_name: videomme
+anno_file: dataset/video_mme/video_mme.json
+dataloader_num_workers: 4
+
+### data
+sample_fps: 2
+max_num_frames: 1024
+longsize_resolution: 682
+
+### generate
+do_sample: false
+
+### output
+output_dir: results/llava-video_rope4_video_mme_f1024_2fps_r682/retake_dp1-async_pivot-40k
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+/dataset`
	`2`	`+/results`
	`3`	`+*/__pycache__`