LiteCache

A lightweight, high-performance Key-Value cache implementation for Large Language Models with PagedAttention support. Built for educational purposes and portfolio demonstration.

📁 Project Structure

litecache/
├── README.md
├── pyproject.toml
├── .gitignore
├── LICENSE
│
├── litecache/
│   ├── __init__.py
│   ├── config.py                    # Configuration classes
│   ├── block_manager.py             # Block allocation and management
│   ├── cache/
│   │   ├── __init__.py
│   │   ├── base.py                  # Abstract cache interface
│   │   ├── paged_attention.py      # PagedAttention implementation
│   │   └── utils.py                 # Cache utilities
│   ├── kernels/
│   │   ├── __init__.py
│   │   ├── triton_kernels.py       # Triton GPU kernels
│   │   └── torch_fallback.py       # PyTorch CPU/fallback implementations
│   ├── models/
│   │   ├── __init__.py
│   │   ├── adapter.py              # Model integration adapter
│   │   └── hooks.py                # HuggingFace integration hooks
│   └── memory/
│       ├── __init__.py
│       ├── allocator.py            # Physical block allocator
│       └── sequence.py             # Logical sequence management
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py                 # Pytest fixtures
│   ├── test_block_manager.py
│   ├── test_cache.py
│   ├── test_kernels.py
│   ├── test_integration.py
│   └── test_models.py
│
├── benchmarks/
│   ├── __init__.py
│   ├── run_benchmarks.py
│   ├── throughput.py
│   └── memory_profile.py
│
└── examples/
    ├── basic_usage.py
    ├── huggingface_integration.py
    └── benchmark_comparison.py

🎯 Overview

LiteCache implements an efficient KV cache system for LLM inference, featuring:

PagedAttention: Memory-efficient attention mechanism with block-based memory management
Pluggable Architecture: Easy to extend with different caching mechanisms (RadixAttention, StreamingLLM, etc.)
GPU Acceleration: Triton kernels for optimized GPU operations
Model Agnostic: Clean adapter interface for integration with existing models
Quantization Support: FP16/BF16 precision modes

🚀 Features

Current Implementation

✅ PagedAttention cache with block management
✅ Copy-on-Write (CoW) for shared prefixes
✅ Triton GPU kernels with PyTorch fallback
✅ HuggingFace Transformers integration
✅ Support for lightweight decoder-only models (GPT-2, TinyLlama, Phi, Qwen)
✅ FP16/BF16 quantization support
✅ Comprehensive test suite

Future Roadmap

🔄 Dynamic batching and continuous batching
🔄 RadixAttention for prefix caching
🔄 Multi-GPU support
🔄 FP8 quantization
🔄 Speculative decoding integration

📋 Requirements

Python 3.10+
PyTorch 2.0+
Triton 2.0+ (for GPU support)
CUDA 11.8+ (for GPU support)
transformers (HuggingFace)

🛠️ Installation

Prerequisites

Python 3.10+
PyTorch 2.0+ (CPU or GPU version)
CUDA 11.8+ (for GPU support only)

Quick Install

For CPU (development on laptop):

git clone https://github.com/yourusername/litecache.git
cd litecache
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install PyTorch CPU version
uv pip install torch --index-url https://download.pytorch.org/whl/cpu

# Install litecache with dev dependencies
uv pip install -e ".[dev]"

For GPU (production/benchmarking):

git clone https://github.com/yourusername/litecache.git
cd litecache
uv venv
source .venv/bin/activate

# Install PyTorch GPU version with CUDA 11.8
uv pip install torch triton --index-url https://download.pytorch.org/whl/cu118

# Install litecache with dev dependencies
uv pip install -e ".[dev]"

Using the setup script:

chmod +x setup.sh
./setup.sh cpu   # or ./setup.sh gpu

📖 Quick Start

Basic Usage

from litecache import PagedAttentionCache, CacheConfig
import torch

# Configure cache
config = CacheConfig(
    block_size=16,          # tokens per block
    num_blocks=1024,        # total blocks
    num_heads=32,
    head_dim=128,
    num_layers=32,
    dtype=torch.float16,
    device="cuda"
)

# Initialize cache
cache = PagedAttentionCache(config)

# Allocate sequence
seq_id = cache.allocate_sequence(seq_len=512)

# Use in attention computation
attention_output = cache.paged_attention(
    query=q,              # [batch, num_heads, seq_len, head_dim]
    block_tables=...,     # [batch, max_blocks]
    context_lens=...      # [batch]
)

# Free when done
cache.free_sequence(seq_id)

Integration with HuggingFace Models

from litecache.models import KVCacheAdapter
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Wrap with cache adapter
cached_model = KVCacheAdapter(model, cache_config=config)

# Generate with efficient caching
outputs = cached_model.generate(
    input_ids=input_ids,
    max_length=100,
    temperature=0.7
)

🧪 Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_cache.py -v

# Run with coverage
pytest tests/ --cov=litecache --cov-report=html

📊 Benchmarks

# Run throughput benchmarks
python benchmarks/run_benchmarks.py --model gpt2 --batch-size 1

# Compare with baseline
python examples/benchmark_comparison.py

Expected improvements over standard HuggingFace KV cache:

Memory Efficiency: ~40-50% reduction in peak memory usage
Throughput: ~1.5-2x tokens/second for long sequences
Batch Scaling: Better memory scaling with increasing batch sizes

🏗️ Architecture

Core Components

Block Manager (block_manager.py)
- Physical memory allocation
- Free block tracking
- Block recycling
Cache Backend (cache/paged_attention.py)
- KV tensor storage
- Logical-to-physical block mapping
- Attention computation orchestration
Triton Kernels (kernels/triton_kernels.py)
- Paged attention kernel
- Block copy operations
- Optimized memory access patterns
Model Adapter (models/adapter.py)
- Framework-agnostic integration
- Transparent cache management
- Generation loop handling

Design Principles

Extensibility: Abstract base classes for cache backends
Performance: Triton kernels with PyTorch fallback
Correctness: Comprehensive test coverage
Usability: Simple API with sane defaults

📚 Documentation

Detailed documentation is available in the /docs folder (coming soon):

Architecture deep-dive
API reference
Performance tuning guide
Kernel implementation details

🤝 Contributing

This is primarily an educational project, but suggestions and improvements are welcome!

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

🙏 Acknowledgments

This project draws inspiration from:

vLLM - PagedAttention implementation
SGLang - RadixAttention concepts
FlexFlow - Research foundations

Note: This is a portfolio/educational project demonstrating systems programming and ML optimization skills. For production use cases, consider battle-tested solutions like vLLM or SGLang.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiteCache

📁 Project Structure

🎯 Overview

🚀 Features

Current Implementation

Future Roadmap

📋 Requirements

🛠️ Installation

Prerequisites

Quick Install

📖 Quick Start

Basic Usage

Integration with HuggingFace Models

🧪 Testing

📊 Benchmarks

🏗️ Architecture

Core Components

Design Principles

📚 Documentation

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
litecache		litecache
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

chkda/litecache

Folders and files

Latest commit

History

Repository files navigation

LiteCache

📁 Project Structure

🎯 Overview

🚀 Features

Current Implementation

Future Roadmap

📋 Requirements

🛠️ Installation

Prerequisites

Quick Install

📖 Quick Start

Basic Usage

Integration with HuggingFace Models

🧪 Testing

📊 Benchmarks

🏗️ Architecture

Core Components

Design Principles

📚 Documentation

🤝 Contributing

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages