A Python library for memory-efficient file reading through speculative precaching. Instead of loading entire datasets into memory, this framework uses a user defined predictive function to anticipate data access patterns and proactively read and cache the most likely needed next items.
This library is designed for scenarios where you need to process large files or datasets sequentially or with predictable access patterns, but cannot or do not want to load everything into memory at once.
Primary use case: Video frame analysis and MOT (Multiple Object Tracking) data processing, where users typically navigate through frames sequentially but may jump to specific positions. The library includes optimized providers and predictors for this scenario.
Other applications: Any situation where you can predict future data access patterns - time series analysis, log file processing, document processing pipelines, or any sequential data where memory usage needs to be controlled. The library is designed to be flexible and can be used in a wide range of scenarios.
Rather than reactive caching (loading data only after it's requested), this system implements speculative precaching:
- Predict: Uses a user defined predictive function to identify the most likely next items
- Prefetch: Loads predicted data in background thread before it's needed
- Serve: Returns cached data when requested (if prediction was correct) or loads synchronously as fallback
- Manage: Automatically evicts old data to maintain memory limits
from dynamic_prefetching_cache.predictors import DynamicDataPredictor
from dynamic_prefetching_cache.providers import MOTDataProvider
from dynamic_prefetching_cache.cache import DynamicPrefetchingCache
provider = MOTDataProvider("examples/data/large_data.txt") # Note: Use generate_large_mot_data.py to generate data
predictor = DynamicDataPredictor(possible_jumps=[-5, -1, 1, 5, 15])
# Create cache with automatic resource management
with DynamicPrefetchingCache(provider, predictor, max_keys_cached=512) as cache:
for key in range(100):
data = cache.get(key) # Returns immediately if prefetched, else loads synchronously
print(data)
# Monitor performance
stats = cache.stats()
print(f"Hit rate: {stats['hits'] / (stats['hits'] + stats['misses']):.2%}")Implement this interface to connect your data source:
class MyDataProvider:
def load(self, key: int) -> Any:
"""Load data for the given key. Must be thread-safe."""
# May be reading line(s) from a file or database
# May be parsing a line from a text file
# See src/dynamic_prefetching_cache/providers.py for an example
return fetch_from_file_or_database(key)
def get_available_frames(self) -> set[int]:
"""Return set of valid keys."""
return {1, 2, 3, 4, 5}
def get_total_frames(self) -> int:
"""Return total number of available keys."""
return 5
def get_stats(self) -> dict:
"""Return provider statistics."""
return {"status": "ok"}Implement this interface to define prediction logic:
class MyAccessPredictor:
def get_likelihoods(self, current_key: int, history: list[int]) -> dict[int, float]:
"""Return likelihood scores for potential next keys."""
return {
current_key + 1: 0.8, # High likelihood
current_key + 2: 0.3, # Medium likelihood
current_key - 1: 0.1, # Low likelihood
}The library includes several ready-to-use predictors:
DistanceDecayPredictor: Simple distance-based prediction with configurable decay ratesDynamicDistanceDecayPredictor: Forward-biased predictor optimized for media playbackDynamicDataPredictor: Advanced predictor with jump detection and history analysis
from dynamic_prefetching_cache.predictors import DynamicDataPredictor
# Optimized for video/media navigation patterns
predictor = DynamicDataPredictor(
possible_jumps=[-15, -5, -1, 1, 5, 15, 30], # Common seek distances
forward_bias=2.0, # Favor forward progression
jump_boost=5.0, # Boost exact jump targets
proximity_boost=2.0 # Boost areas near jump targets
)High-performance provider for MOT (Multiple Object Tracking) data files:
from dynamic_prefetching_cache.providers import MOTDataProvider
# Optimized for MOT format files with built-in indexing and direct seeking
provider = MOTDataProvider('data/tracking_results.txt')
# Includes comprehensive statistics
stats = provider.get_stats()
print(f"Indexed frames: {stats['total_frames']}, avg direct load: {stats['avg_direct_load_time']:.6f}s")cache = DynamicPrefetchingCache(
provider=my_provider,
predictor=my_predictor,
max_keys_cached=1000, # Maximum items in cache
max_keys_prefetched=8, # Max queued prefetch keys (queue size)
history_size=30, # Access history for prediction
eviction_policy=EvictionPolicyOldest, # Cache eviction strategy
on_event=my_event_handler # Optional event monitoring
)Monitor cache operations for debugging and optimization:
def handle_cache_events(event_name: str, **kwargs):
if event_name == 'prefetch_error':
logger.warning(f"Prefetch failed for key {kwargs['key']}: {kwargs['error']}")
elif event_name == 'cache_evict':
logger.debug(f"Evicted key {kwargs['key']} from cache")
cache = DynamicPrefetchingCache(provider, predictor, on_event=handle_cache_events)Available events: cache_load_start/complete/error, prefetch_start/success/error, cache_evict, worker_error
stats = cache.stats()
print(f"Cache hits: {stats['hits']}")
print(f"Cache misses: {stats['misses']}")
print(f"Hit rate: {stats['hits'] / (stats['hits'] + stats['misses']):.2%}")
print(f"Queued prefetch keys: {stats['active_prefetch_tasks']}")get()method is thread-safe for concurrent access- Background worker thread handles all prefetching operations
close()method ensures clean resource cleanup- All internal state is properly synchronized
from dynamic_prefetching_cache.cache import DynamicPrefetchingCache
from dynamic_prefetching_cache.predictors import DynamicDataPredictor
from dynamic_prefetching_cache.providers import MOTDataProvider
# Set up for video frame analysis
provider = MOTDataProvider('examples/data/example_data.txt')
predictor = DynamicDataPredictor(possible_jumps=[-5, -1, 1, 5, 15])
with DynamicPrefetchingCache(provider, predictor, max_keys_cached=200) as cache:
for frame_id in range(100):
detections = cache.get(frame_id)
print(f"Frame {frame_id}: {len(detections.detections)} objects detected")python examples/visable_example.pypython examples/profile_example.py# Install from PyPI (when published)
pip install dynamic-prefetching-cache
# Or install from source
git clone <repository-url>
cd dynamic-prefetching-cache
pip install -e .
# Install with development dependencies
pip install -e ".[dev]"
# Install with example dependencies (for GUI demo)
pip install -e ".[examples]"- Implement DataProvider - Connect to your data source
- Choose or implement AccessPredictor - Define prediction logic for your use case
- Configure cache parameters - Set memory limits and prefetch behavior
- Use
cache.get(key)- The system handles prefetching automatically
The library abstracts away the complexity of memory management, concurrent prefetching, and prediction logic, allowing you to focus on your core data processing tasks.
The repository includes a script to generate realistic MOT (Multiple Object Tracking) format test data for testing and development. This eliminates the need to download or upload large data files.
# Generate small test file (100 tracks, 1000 frames, ~1MB)
python scripts/generate_large_mot_data.py -o examples/data/test_data.txt -t 100 -f 1000
# Generate medium test file (500 tracks, 10000 frames, ~50MB)
python scripts/generate_large_mot_data.py -o examples/data/medium_data.txt -t 500 -f 10000
# Generate large test file (1000 tracks, 100000 frames, ~500MB)
python scripts/generate_large_mot_data.py -o examples/data/large_data.txt -t 1000 -f 100000
# Custom data generation with full options
python scripts/generate_large_mot_data.py \
--output examples/data/custom_data.txt \
--tracks 200 \ # Number of object tracks
--frames 5000 \ # Number of frames
--width 1920 \ # Image width (pixels)
--height 1080 \ # Image height (pixels)
--min-track-length 10 \ # Minimum track duration
--max-track-length 200 \ # Maximum track duration
--seed 42 # Random seed for reproducibility