Skip to content

Commit 62685d6

Browse files
authored
Merge pull request #182 from leideng/develop-kvcomp-rebase
[Feat] Add KVComp sparse attention implementation in UCM
2 parents 1da231f + dd7f640 commit 62685d6

14 files changed

+1819
-0
lines changed

ucm/integration/vllm/ucm_sparse/factory.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,6 @@ def create_sparse_method(
4444

4545
# Register available sparse methods
4646
UcmSparseFactory.register_sparse_method("ESA", "ucm.ucm_sparse.esa", "ESA")
47+
UcmSparseFactory.register_sparse_method(
48+
"KvComp", "ucm.sandbox.sparse.kvcomp.kvcomp", "KvComp"
49+
)
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# KVComp: Hash-Aware Top-k Attention for Scalable Large Model Inference
2+
3+
<div align="center">
4+
5+
![KVComp Scheme](figs/kvcomp_scheme.jpg)
6+
7+
**🚀 Hash-Aware Sparse Attention Algorithm | 📄 ACL 2025 Paper | ⚡ NPU/GPU Hardware-Efficient**
8+
9+
[![Paper](https://img.shields.io/badge/Paper-ACL%202025-blue)](paper/kvcomp-ACL-2025-paper.pdf)
10+
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
11+
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
12+
13+
</div>
14+
15+
## 🌟 What is KVComp (HATA)?
16+
17+
**KVComp** (Key-Value Compression) is a groundbreaking sparse attention algorithm that revolutionizes large language model inference through **Hash-Aware Top-k Attention**. Published at ACL 2025, our method achieves unprecedented efficiency by intelligently selecting the most relevant kv cache blocks using trainable hash-based similarity computation.
18+
19+
### 🎯 Key Innovations
20+
21+
- **🔍 Hash-Aware Similarity**: Uses trainable hash functions to compute attention relevance, which is significantly faster than exact attention score $QK$ computation
22+
- **⚡ Hardware-Efficient**: Optimized for both CUDA and NPU architectures with specialized kernels
23+
- **🎛️ Adaptive Sparsity**: Layer-wise sparsity ratios that adapt to model characteristics
24+
- **🔄 Dynamic Retrieval**: Real-time **query-aware** block selection based on query-key similarity
25+
- **💾 Memory-Efficient**: Dramatically reduces KV cache HBM peak usage by leveraing UCM's offloading capability
26+
27+
### 🔥 Key Results
28+
- **3-5x speedup** in attention computation for long sequences
29+
- **Minimal accuracy loss** (< 2%) on downstream tasks
30+
- **Scalable to 128K+ context lengths** with linear complexity
31+
32+
## 🏆 Performance Highlights
33+
34+
<div align="center">
35+
36+
### End-to-End Performance
37+
![End-to-End Performance](figs/kvcomp_end_to_end_performance.jpg)
38+
39+
### Single Layer Performance
40+
![Single Layer Performance](figs/kvcomp_single_layer_performance.jpg)
41+
42+
</div>
43+
44+
## 📈 Accuracy Benchmarks
45+
46+
47+
<div align="center">
48+
49+
### LongBench Evaluation
50+
![LongBench Results](figs/kvcomp_longbench.jpg)
51+
52+
</div>
53+
54+
55+
56+
## 🧠 How It Works
57+
58+
### Core Algorithm
59+
60+
KVComp operates through a sophisticated three-stage process:
61+
62+
1. **🔐 Hash Encoding**: Convert attention keys and queries into compact hash codes
63+
2. **🎯 Similarity Computation**: Use efficient hash-based similarity to identify relevant blocks
64+
3. **📦 Selective Loading**: Load only the top-k most relevant KV blocks for attention
65+
66+
```python
67+
# Simplified algorithm flow
68+
def kvcomp_attention(query, key_cache, top_k_ratio):
69+
# 1. Hash encoding
70+
hash_query = hash_encoder.compute_hash(query)
71+
hash_keys = hash_encoder.compute_hash(key_cache)
72+
73+
# 2. Similarity computation
74+
scores = hamming_score(hash_query, hash_keys)
75+
76+
# 3. Top-k selection
77+
topk_blocks = torch.topk(scores, int(len(key_cache) * top_k_ratio))
78+
79+
# 4. Selective attention
80+
return attention(query, key_cache[topk_blocks], value_cache[topk_blocks])
81+
```
82+
83+
84+
### 🏗️ Architecture
85+
86+
The algorithm maintains three critical windows:
87+
- **Initial Window**: First few blocks (always loaded)
88+
- **Sparse Window**: Top-k selected blocks (dynamically chosen)
89+
- **Local Window**: Recent blocks (always loaded)
90+
91+
This design ensures both **efficiency** and **accuracy** by preserving essential context while sparsifying the middle range.
92+
93+
## 🚀 Quick Start
94+
95+
### Installation
96+
97+
KVComp is part of the UCM Sparse Attention module. For installation instructions, please refer to the [UCM's top-level README](../../../../README.md). Once UCM is installed, KVComp is naturally supported by running the following example python scripts.
98+
99+
```bash
100+
python ucm/sandbox/sparse/kvcomp/offline_inference_kvcomp.py
101+
```
102+
103+
### Basic Usage
104+
Similr to UCM's `offline_inference_esa.py` examples. We only need to specify `ucm_sparse_method` to be `KVComp` and specify a KVComp config file in `kvcomp_config_path`, as shown below.
105+
106+
```python
107+
...
108+
ktc = KVTransferConfig(
109+
kv_connector=name,
110+
kv_connector_module_path=module_path,
111+
kv_role="kv_both",
112+
kv_connector_extra_config={
113+
"ucm_connector_name": "UcmDram",
114+
"ucm_connector_config": {
115+
"max_cache_size": 5368709120,
116+
"kv_block_size": 262144,
117+
},
118+
"ucm_sparse_method": "KvComp",
119+
"kvcomp_config_path": "configs/kvcomp_qwen3_4B_config.json",
120+
},
121+
)
122+
...
123+
```
124+
125+
### Configuration
126+
KvComp need a json configuration file. We have already included several configs in `configs` folder, including Qwen3-4B, Qwen3-32B, and QwQ-32B.
127+
128+
```json
129+
{
130+
"model_name": "Qwen/Qwen3-4B",
131+
"is_mla": false,
132+
"hash_weight_type": "random",
133+
"num_hidden_layers": 36,
134+
"seq_len_threshhold": 2048,
135+
"chunk_size": 128,
136+
"chunk_repre_method": "max",
137+
"head_dim": 128,
138+
"hash_bits": 128,
139+
"top_k_ratio_per_layer": [0.3, 0.3, ... , 0.3],
140+
"top_k_index_reuse": [-1, -1, ... , -1],
141+
"must_select_blocks": [0, -2, -1],
142+
}
143+
```
144+
145+
## 📊 Supported Models
146+
147+
| Model | Size | Hash Bits | Top-k Ratio | Performance Gain |
148+
|-------|------|-----------|-------------|------------------|
149+
| Qwen3-4B | 4B | 128 | 0.3 | xx |
150+
| Qwen3-32B | 32B | 128 | 0.3 | xx |
151+
| QwQ-32B | 32B | 128 | 0.3 | xx |
152+
| DeepSeek-R1 | 671B | 512+64 | 0.3 | xx |
153+
154+
## 🔧 Advanced Features
155+
156+
157+
### Custom Hash Weights
158+
```python
159+
# Use pre-trained hash weights
160+
config.set_hash_weight(custom_hash_weights)
161+
```
162+
163+
### Hardware Optimization
164+
- **CUDA**: Optimized kernels with bit-packing, hamming score, and top-k selection
165+
- **NPU**: Native `npu_sign_bits_pack` operations, optimized fused kernels for hamming_dist_top_k and kv_select.
166+
- **CPU**: SIMD-optimized implementations
167+
168+
169+
170+
171+
## 🎓 Citation
172+
173+
If you use KvComp in your research, please cite our ACL 2025 paper:
174+
175+
```bibtex
176+
@inproceedings{kvcomp2025,
177+
title={HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference},
178+
author={[Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li]},
179+
booktitle={Proceedings of ACL 2025},
180+
year={2025}
181+
}
182+
```
183+
184+
## 🤝 Contributing
185+
186+
We welcome contributions! Please see our [Contributing Guide](../../../../docs/source/developer_guide/contributing.md) for details.
187+
188+
189+
---
190+
191+
<div align="center">
192+
193+
**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!**
194+
195+
</div>
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
{
2+
"model_name": "Qwen/Qwen3-32B",
3+
"is_mla": false,
4+
"hash_weight_type": "random",
5+
"num_hidden_layers": 64,
6+
"seq_len_threshhold": 2048,
7+
"chunk_size": 128,
8+
"chunk_repre_method": "max",
9+
"head_dim": 128,
10+
"hash_bits": 128,
11+
"top_k_ratio_per_layer": [
12+
1,
13+
1,
14+
0.3,
15+
0.3,
16+
0.3,
17+
0.3,
18+
0.3,
19+
0.3,
20+
0.3,
21+
0.3,
22+
0.3,
23+
0.3,
24+
0.3,
25+
0.3,
26+
0.3,
27+
0.3,
28+
0.3,
29+
0.3,
30+
0.3,
31+
0.3,
32+
0.3,
33+
0.3,
34+
0.3,
35+
0.3,
36+
0.3,
37+
0.3,
38+
0.3,
39+
0.3,
40+
0.3,
41+
0.3,
42+
0.3,
43+
0.3,
44+
0.3,
45+
0.3,
46+
0.3,
47+
0.3,
48+
0.3,
49+
0.3,
50+
0.3,
51+
0.3,
52+
0.3,
53+
0.3,
54+
0.3,
55+
0.3,
56+
0.3,
57+
0.3,
58+
0.3,
59+
0.3,
60+
0.3,
61+
0.3,
62+
0.3,
63+
0.3,
64+
0.3,
65+
0.3,
66+
0.3,
67+
0.3,
68+
0.3,
69+
0.3,
70+
0.3,
71+
0.3,
72+
0.3,
73+
1,
74+
1,
75+
1
76+
],
77+
"top_k_index_reuse": [
78+
-1,
79+
-1,
80+
-1,
81+
-1,
82+
-1,
83+
-1,
84+
-1,
85+
-1,
86+
-1,
87+
-1,
88+
-1,
89+
-1,
90+
-1,
91+
-1,
92+
-1,
93+
-1,
94+
-1,
95+
-1,
96+
-1,
97+
-1,
98+
-1,
99+
-1,
100+
-1,
101+
-1,
102+
-1,
103+
-1,
104+
-1,
105+
-1,
106+
-1,
107+
-1,
108+
-1,
109+
-1,
110+
-1,
111+
-1,
112+
-1,
113+
-1,
114+
-1,
115+
-1,
116+
-1,
117+
-1,
118+
-1,
119+
-1,
120+
-1,
121+
-1,
122+
-1,
123+
-1,
124+
-1,
125+
-1,
126+
-1,
127+
-1,
128+
-1,
129+
-1,
130+
-1,
131+
-1,
132+
-1,
133+
-1,
134+
-1,
135+
-1,
136+
-1,
137+
-1,
138+
-1,
139+
-1,
140+
-1,
141+
-1
142+
],
143+
"must_select_blocks": [
144+
0,
145+
-2,
146+
-1
147+
],
148+
"hash_weight": null,
149+
"kv_lora_rank": null,
150+
"qk_rope_head_dim": null,
151+
"hash_bits_kv_lora": null,
152+
"hash_bits_qk_rope": null,
153+
"hash_weight_kv_lora": null,
154+
"hash_weight_qk_rope": null
155+
}

0 commit comments

Comments
 (0)