|
| 1 | +# True Async Paging in async-cassandra |
| 2 | + |
| 3 | +## Key Concepts |
| 4 | + |
| 5 | +### 1. Always Use Context Managers (CRITICAL) |
| 6 | + |
| 7 | +```python |
| 8 | +# ✅ CORRECT - Prevents resource leaks |
| 9 | +async with await session.execute_stream("SELECT * FROM table") as result: |
| 10 | + async for row in result: |
| 11 | + await process_row(row) |
| 12 | + |
| 13 | +# ❌ WRONG - Will leak resources! |
| 14 | +result = await session.execute_stream("SELECT * FROM table") |
| 15 | +async for row in result: # Missing context manager! |
| 16 | + await process_row(row) |
| 17 | +``` |
| 18 | + |
| 19 | +### 2. How Paging Actually Works |
| 20 | + |
| 21 | +The Cassandra driver implements **true streaming** with these characteristics: |
| 22 | + |
| 23 | +- **On-Demand Fetching**: Pages are fetched as you consume data, NOT all at once |
| 24 | +- **Async Fetching**: While you process page N, the driver can fetch page N+1 |
| 25 | +- **Memory Efficient**: Only one page is held in memory at a time |
| 26 | +- **No Pre-fetching All Data**: The driver doesn't load the entire result set |
| 27 | + |
| 28 | +### 3. Page Size Recommendations |
| 29 | + |
| 30 | +```python |
| 31 | +# Small Pages (1000-5000 rows) |
| 32 | +# ✅ Best for: Real-time processing, low memory usage, better responsiveness |
| 33 | +# ❌ Trade-off: More network round trips |
| 34 | +config = StreamConfig(fetch_size=1000) |
| 35 | + |
| 36 | +# Medium Pages (5000-10000 rows) |
| 37 | +# ✅ Best for: General purpose, good balance |
| 38 | +config = StreamConfig(fetch_size=5000) |
| 39 | + |
| 40 | +# Large Pages (10000-50000 rows) |
| 41 | +# ✅ Best for: Bulk exports, batch processing, fewer round trips |
| 42 | +# ❌ Trade-off: Higher memory usage, slower first results |
| 43 | +config = StreamConfig(fetch_size=20000) |
| 44 | +``` |
| 45 | + |
| 46 | +### 4. LIMIT vs Paging |
| 47 | + |
| 48 | +**You don't need LIMIT with paging!** |
| 49 | + |
| 50 | +```python |
| 51 | +# ❌ UNNECESSARY - fetch_size already controls data flow |
| 52 | +stmt = await session.prepare("SELECT * FROM users LIMIT ?") |
| 53 | +async with await session.execute_stream(stmt, [1000]) as result: |
| 54 | + # This limits total results, not page size! |
| 55 | + |
| 56 | +# ✅ CORRECT - Let paging handle the data flow |
| 57 | +stmt = await session.prepare("SELECT * FROM users") |
| 58 | +config = StreamConfig(fetch_size=1000) # This controls page size |
| 59 | +async with await session.execute_stream(stmt, stream_config=config) as result: |
| 60 | + # Process all data efficiently, page by page |
| 61 | +``` |
| 62 | + |
| 63 | +### 5. Processing Patterns |
| 64 | + |
| 65 | +#### Row-by-Row Processing |
| 66 | +```python |
| 67 | +# Process each row as it arrives |
| 68 | +async with await session.execute_stream("SELECT * FROM large_table") as result: |
| 69 | + async for row in result: |
| 70 | + await process_row(row) # Non-blocking, pages fetched as needed |
| 71 | +``` |
| 72 | + |
| 73 | +#### Page-by-Page Processing |
| 74 | +```python |
| 75 | +# Process entire pages at once (e.g., for batch operations) |
| 76 | +config = StreamConfig(fetch_size=5000) |
| 77 | +async with await session.execute_stream("SELECT * FROM large_table", stream_config=config) as result: |
| 78 | + async for page in result.pages(): |
| 79 | + # Process entire page (list of rows) |
| 80 | + await bulk_insert_to_warehouse(page) |
| 81 | +``` |
| 82 | + |
| 83 | +### 6. Common Misconceptions |
| 84 | + |
| 85 | +**Myth**: "The driver pre-fetches all pages" |
| 86 | +**Reality**: Pages are fetched on-demand as you consume data |
| 87 | + |
| 88 | +**Myth**: "I need LIMIT to control memory usage" |
| 89 | +**Reality**: `fetch_size` controls memory usage, LIMIT just limits total results |
| 90 | + |
| 91 | +**Myth**: "Larger pages are always better" |
| 92 | +**Reality**: It depends on your use case - see recommendations above |
| 93 | + |
| 94 | +**Myth**: "I can skip the context manager" |
| 95 | +**Reality**: Context managers are MANDATORY to prevent resource leaks |
| 96 | + |
| 97 | +### 7. Performance Tips |
| 98 | + |
| 99 | +1. **Match fetch_size to your processing speed** |
| 100 | + - Fast processing → larger pages |
| 101 | + - Slow processing → smaller pages |
| 102 | + |
| 103 | +2. **Use page callbacks for monitoring** |
| 104 | + ```python |
| 105 | + config = StreamConfig( |
| 106 | + fetch_size=5000, |
| 107 | + page_callback=lambda page_num, total_rows: |
| 108 | + logger.info(f"Processing page {page_num}, total: {total_rows:,}") |
| 109 | + ) |
| 110 | + ``` |
| 111 | + |
| 112 | +3. **Consider network latency** |
| 113 | + - High latency → larger pages (fewer round trips) |
| 114 | + - Low latency → smaller pages are fine |
| 115 | + |
| 116 | +4. **Monitor memory usage** |
| 117 | + - Each page holds `fetch_size` rows in memory |
| 118 | + - Adjust based on row size and available memory |
0 commit comments