|
| 1 | +# KVComp: Hash-Aware Top-k Attention for Scalable Large Model Inference |
| 2 | + |
| 3 | +<div align="center"> |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +**🚀 Hash-Aware Sparse Attention Algorithm | 📄 ACL 2025 Paper | ⚡ NPU/GPU Hardware-Efficient** |
| 8 | + |
| 9 | +[](paper/kvcomp-ACL-2025-paper.pdf) |
| 10 | +[](LICENSE) |
| 11 | +[](https://python.org) |
| 12 | + |
| 13 | +</div> |
| 14 | + |
| 15 | +## 🌟 What is KVComp (HATA)? |
| 16 | + |
| 17 | +**KVComp** (Key-Value Compression) is a groundbreaking sparse attention algorithm that revolutionizes large language model inference through **Hash-Aware Top-k Attention**. Published at ACL 2025, our method achieves unprecedented efficiency by intelligently selecting the most relevant kv cache blocks using trainable hash-based similarity computation. |
| 18 | + |
| 19 | +### 🎯 Key Innovations |
| 20 | + |
| 21 | +- **🔍 Hash-Aware Similarity**: Uses trainable hash functions to compute attention relevance, which is significantly faster than exact attention score $QK$ computation |
| 22 | +- **⚡ Hardware-Efficient**: Optimized for both CUDA and NPU architectures with specialized kernels |
| 23 | +- **🎛️ Adaptive Sparsity**: Layer-wise sparsity ratios that adapt to model characteristics |
| 24 | +- **🔄 Dynamic Retrieval**: Real-time **query-aware** block selection based on query-key similarity |
| 25 | +- **💾 Memory-Efficient**: Dramatically reduces KV cache HBM peak usage by leveraing UCM's offloading capability |
| 26 | + |
| 27 | +### 🔥 Key Results |
| 28 | +- **3-5x speedup** in attention computation for long sequences |
| 29 | +- **Minimal accuracy loss** (< 2%) on downstream tasks |
| 30 | +- **Scalable to 128K+ context lengths** with linear complexity |
| 31 | + |
| 32 | +## 🏆 Performance Highlights |
| 33 | + |
| 34 | +<div align="center"> |
| 35 | + |
| 36 | +### End-to-End Performance |
| 37 | + |
| 38 | + |
| 39 | +### Single Layer Performance |
| 40 | + |
| 41 | + |
| 42 | +</div> |
| 43 | + |
| 44 | +## 📈 Accuracy Benchmarks |
| 45 | + |
| 46 | + |
| 47 | +<div align="center"> |
| 48 | + |
| 49 | +### LongBench Evaluation |
| 50 | + |
| 51 | + |
| 52 | +</div> |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | +## 🧠 How It Works |
| 57 | + |
| 58 | +### Core Algorithm |
| 59 | + |
| 60 | +KVComp operates through a sophisticated three-stage process: |
| 61 | + |
| 62 | +1. **🔐 Hash Encoding**: Convert attention keys and queries into compact hash codes |
| 63 | +2. **🎯 Similarity Computation**: Use efficient hash-based similarity to identify relevant blocks |
| 64 | +3. **📦 Selective Loading**: Load only the top-k most relevant KV blocks for attention |
| 65 | + |
| 66 | +```python |
| 67 | +# Simplified algorithm flow |
| 68 | +def kvcomp_attention(query, key_cache, top_k_ratio): |
| 69 | + # 1. Hash encoding |
| 70 | + hash_query = hash_encoder.compute_hash(query) |
| 71 | + hash_keys = hash_encoder.compute_hash(key_cache) |
| 72 | + |
| 73 | + # 2. Similarity computation |
| 74 | + scores = hamming_score(hash_query, hash_keys) |
| 75 | + |
| 76 | + # 3. Top-k selection |
| 77 | + topk_blocks = torch.topk(scores, int(len(key_cache) * top_k_ratio)) |
| 78 | + |
| 79 | + # 4. Selective attention |
| 80 | + return attention(query, key_cache[topk_blocks], value_cache[topk_blocks]) |
| 81 | +``` |
| 82 | + |
| 83 | + |
| 84 | +### 🏗️ Architecture |
| 85 | + |
| 86 | +The algorithm maintains three critical windows: |
| 87 | +- **Initial Window**: First few blocks (always loaded) |
| 88 | +- **Sparse Window**: Top-k selected blocks (dynamically chosen) |
| 89 | +- **Local Window**: Recent blocks (always loaded) |
| 90 | + |
| 91 | +This design ensures both **efficiency** and **accuracy** by preserving essential context while sparsifying the middle range. |
| 92 | + |
| 93 | +## 🚀 Quick Start |
| 94 | + |
| 95 | +### Installation |
| 96 | + |
| 97 | +KVComp is part of the UCM Sparse Attention module. For installation instructions, please refer to the [UCM's top-level README](../../../../README.md). Once UCM is installed, KVComp is naturally supported by running the following example python scripts. |
| 98 | + |
| 99 | +```bash |
| 100 | +python ucm/sandbox/sparse/kvcomp/offline_inference_kvcomp.py |
| 101 | +``` |
| 102 | + |
| 103 | +### Basic Usage |
| 104 | +Similr to UCM's `offline_inference_esa.py` examples. We only need to specify `ucm_sparse_method` to be `KVComp` and specify a KVComp config file in `kvcomp_config_path`, as shown below. |
| 105 | + |
| 106 | +```python |
| 107 | +... |
| 108 | +ktc = KVTransferConfig( |
| 109 | + kv_connector=name, |
| 110 | + kv_connector_module_path=module_path, |
| 111 | + kv_role="kv_both", |
| 112 | + kv_connector_extra_config={ |
| 113 | + "ucm_connector_name": "UcmDram", |
| 114 | + "ucm_connector_config": { |
| 115 | + "max_cache_size": 5368709120, |
| 116 | + "kv_block_size": 262144, |
| 117 | + }, |
| 118 | + "ucm_sparse_method": "KvComp", |
| 119 | + "kvcomp_config_path": "configs/kvcomp_qwen3_4B_config.json", |
| 120 | + }, |
| 121 | + ) |
| 122 | +... |
| 123 | +``` |
| 124 | + |
| 125 | +### Configuration |
| 126 | +KvComp need a json configuration file. We have already included several configs in `configs` folder, including Qwen3-4B, Qwen3-32B, and QwQ-32B. |
| 127 | + |
| 128 | +```json |
| 129 | +{ |
| 130 | + "model_name": "Qwen/Qwen3-4B", |
| 131 | + "is_mla": false, |
| 132 | + "hash_weight_type": "random", |
| 133 | + "num_hidden_layers": 36, |
| 134 | + "seq_len_threshhold": 2048, |
| 135 | + "chunk_size": 128, |
| 136 | + "chunk_repre_method": "max", |
| 137 | + "head_dim": 128, |
| 138 | + "hash_bits": 128, |
| 139 | + "top_k_ratio_per_layer": [0.3, 0.3, ... , 0.3], |
| 140 | + "top_k_index_reuse": [-1, -1, ... , -1], |
| 141 | + "must_select_blocks": [0, -2, -1], |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +## 📊 Supported Models |
| 146 | + |
| 147 | +| Model | Size | Hash Bits | Top-k Ratio | Performance Gain | |
| 148 | +|-------|------|-----------|-------------|------------------| |
| 149 | +| Qwen3-4B | 4B | 128 | 0.3 | xx | |
| 150 | +| Qwen3-32B | 32B | 128 | 0.3 | xx | |
| 151 | +| QwQ-32B | 32B | 128 | 0.3 | xx | |
| 152 | +| DeepSeek-R1 | 671B | 512+64 | 0.3 | xx | |
| 153 | + |
| 154 | +## 🔧 Advanced Features |
| 155 | + |
| 156 | + |
| 157 | +### Custom Hash Weights |
| 158 | +```python |
| 159 | +# Use pre-trained hash weights |
| 160 | +config.set_hash_weight(custom_hash_weights) |
| 161 | +``` |
| 162 | + |
| 163 | +### Hardware Optimization |
| 164 | +- **CUDA**: Optimized kernels with bit-packing, hamming score, and top-k selection |
| 165 | +- **NPU**: Native `npu_sign_bits_pack` operations, optimized fused kernels for hamming_dist_top_k and kv_select. |
| 166 | +- **CPU**: SIMD-optimized implementations |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | + |
| 171 | +## 🎓 Citation |
| 172 | + |
| 173 | +If you use KvComp in your research, please cite our ACL 2025 paper: |
| 174 | + |
| 175 | +```bibtex |
| 176 | +@inproceedings{kvcomp2025, |
| 177 | + title={HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference}, |
| 178 | + author={[Ping Gong, Jiawei Yi, Shengnan Wang, Juncheng Zhang, Zewen Jin, Ouxiang Zhou, Ruibo Liu, Guanbin Xu, Youhui Bai, Bowen Ye, Kun Yuan, Tong Yang, Gong Zhang, Renhai Chen, Feng Wu, Cheng Li]}, |
| 179 | + booktitle={Proceedings of ACL 2025}, |
| 180 | + year={2025} |
| 181 | +} |
| 182 | +``` |
| 183 | + |
| 184 | +## 🤝 Contributing |
| 185 | + |
| 186 | +We welcome contributions! Please see our [Contributing Guide](../../../../docs/source/developer_guide/contributing.md) for details. |
| 187 | + |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +<div align="center"> |
| 192 | + |
| 193 | +**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!** |
| 194 | + |
| 195 | +</div> |
0 commit comments