Skip to content

perf(core): Deep dive optimizations for hot path#355

Open
cofin wants to merge 66 commits intomainfrom
feat/performance
Open

perf(core): Deep dive optimizations for hot path#355
cofin wants to merge 66 commits intomainfrom
feat/performance

Conversation

@cofin
Copy link
Member

@cofin cofin commented Feb 2, 2026

This PR implements deep dive optimizations identified in the core-hotpath-opt flow.

Key Changes

  • Query Cache (_qc_*): LRU cache for prepared statements - bypasses SQL parsing and parameter transformation on repeated queries
  • Micro-caching: Single-slot cache in SQLProcessor to bypass dictionary lookups for repeated queries
  • String Fast Paths: Internal SQL object caching for raw string statements in prepare_statement
  • Parameter Optimization: Optimized SQL.copy to fast-track parameter updates and streamlined parameter fingerprinting
  • Observability: Added is_idle check to bypass expensive instrumentation overhead when disabled
  • Result Construction: Optimized ExecutionResult creation and metadata handling

Benchmark Results (10k rows, sqlite)

┏━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Driver ┃ Library    ┃ Scenario          ┃ Time (s) ┃ % Slower vs Raw ┃
┡━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ sqlite │ raw        │ initialization    │   0.0018 │               — │
│ sqlite │ sqlspec    │ initialization    │   0.0035 │           98.2% │
│ sqlite │ sqlalchemy │ initialization    │   0.0852 │         4694.1% │
│ sqlite │ raw        │ write_heavy       │   0.0817 │               — │
│ sqlite │ sqlspec    │ write_heavy       │   0.0773 │           -5.4% │
│ sqlite │ sqlalchemy │ write_heavy       │   0.0493 │          -39.6% │
│ sqlite │ raw        │ read_heavy        │   0.0238 │               — │
│ sqlite │ sqlspec    │ read_heavy        │   0.0393 │           65.1% │
│ sqlite │ sqlalchemy │ read_heavy        │   0.0346 │           45.6% │
│ sqlite │ raw        │ iterative_inserts │   0.0188 │               — │
│ sqlite │ sqlspec    │ iterative_inserts │   0.2947 │         1465.1% │
│ sqlite │ sqlalchemy │ iterative_inserts │   0.4270 │         2167.7% │
│ sqlite │ raw        │ repeated_queries  │   9.1935 │               — │
│ sqlite │ sqlspec    │ repeated_queries  │   9.3915 │            2.2% │
│ sqlite │ sqlalchemy │ repeated_queries  │   9.9289 │            8.0% │
└────────┴────────────┴───────────────────┴──────────┴─────────────────┘

How to interpret these results

Scenario What it tests sqlspec sqlalchemy
write_heavy Bulk insert via execute_many -5% (faster!) -40%
read_heavy Bulk read via fetchall +65% +46%
iterative_inserts Individual inserts in a loop +1465% +2167%
repeated_queries Same SELECT with varying params +2.2% +8.0%

Key insight: The repeated_queries scenario shows the query cache in action. When the same SQL statement is executed repeatedly with different parameters:

  1. First execution: Full parsing, parameter transformation, and statement preparation
  2. Subsequent executions: Cache hit → skip parsing → directly bind new parameters

This reduces sqlspec's overhead from ~1500% (iterative inserts) to just ~2% (repeated queries).

Why iterative_inserts is slow

Each call to session.execute() must:

  • Parse the SQL string
  • Transform parameters to native format
  • Build the Statement object
  • Execute and build result

For bulk operations, use execute_many() which amortizes this cost across all rows.

Benchmark Tooling

Added scripts/bench.py (originally from @euri10's PR #354) with enhancements:

uv run python scripts/bench.py --driver sqlite --rows 10000

Scenarios:

  • initialization - Connection and table setup overhead
  • write_heavy - Bulk insert via execute_many
  • read_heavy - Bulk insert + fetchall
  • iterative_inserts - Individual execute calls in a loop
  • repeated_queries - Single-row queries with varying params (tests query cache)

@cofin cofin changed the title perf(core): Deep dive optimizations for hot path (~42% faster) perf(core): Deep dive optimizations for hot path Feb 3, 2026
cofin added 29 commits February 3, 2026 15:40
- Add internal SQL object cache for string statements
- Optimize SQL.copy to bypass initialization
- Implement micro-cache in SQLProcessor for repeated queries
- Optimize observability idle check
- Streamline parameter processing and result construction
- Remove unnecessary dict() copy in _unpack_parse_cache_entry
- Remove expression.copy() on parse cache store (only copy on retrieve when needed)
- Defer expression.copy() to _apply_ast_transformers when transformers active
- Fast type dispatch (type(x) is dict) vs ABC isinstance checks
- Remove sorted() for dict keys in structural fingerprinting (use insertion order)
- Cache is_idle check in ObservabilityRuntime (lifecycle/observers immutable)
- Use frozenset intersection for parameter char detection in validator
- Optimize ParameterProfile.styles computation for single-style case

Benchmark (10,000 INSERTs):
- Before: ~20x slowdown vs raw sqlite3
- After: ~15.5x slowdown (tuple params), ~18.8x (dict params)
- Function calls reduced: 1.33M → 1.18M (11% fewer)
- isinstance() calls reduced: 280k → 200k (28% fewer)
Add benchmark functions to isolate SQLGlot overhead:
- bench_sqlite_sqlglot: Cached SQL (minimal overhead)
- bench_sqlite_sqlglot_copy: expression.copy() per call
- bench_sqlite_sqlglot_nocache: .sql() regeneration per call

These help identify whether overhead comes from SQLGlot
parsing/generation vs SQLSpec's own processing.

Key findings:
- SQLGlot cached parsing adds ~0% overhead
- expression.copy() per call: 16x overhead (synthetic)
- SQLSpec actual overhead: distributed across pipeline
cofin added 16 commits February 3, 2026 15:40
- Updated type hints to use the new syntax for union types in driver.py, _async.py, and _common.py.
- Improved readability by formatting long lines and breaking them into multiple lines in driver.py and _common.py.
- Removed unnecessary comments and cleaned up import statements in config.py and typing.py.
- Enhanced exception handling in AsyncMigrationCommands to use async input for user confirmation.
- Refactored logic in CorrelationExtractor to simplify return statements.
- Updated the write_fixture_async function to use AsyncPath for resolving paths asynchronously.
- Improved test readability and consistency in test_sync_adapters.py and test_fast_path.py by formatting long lines.
- Create new sqlspec/driver/_query_cache.py module
- Move CachedQuery namedtuple and QueryCache class
- Rename _QueryCache to QueryCache (now public)
- Rename _FAST_PATH_QUERY_CACHE_SIZE to QC_MAX_SIZE
- Add clear() and __len__() methods to QueryCache
- Update test imports
- Remove unused OrderedDict import from _common.py

Part of driver-arch-cleanup PRD, Chapter 1: qc-extract
Attribute renames:
- _fast_path_binder → _qc_binder
- _fast_path_enabled → _qc_enabled
- _query_cache → _qc

Method renames:
- _update_fast_path_flag → _update_qc_flag
- _fast_rebind → qc_rebind
- _build_fast_statement → qc_build
- _try_cached_compiled → qc_lookup
- _execute_compiled → qc_execute
- _maybe_cache_fast_path → qc_store
- _configure_fast_path_binder → _configure_qc_binder

Test file renamed: test_fast_path.py → test_query_cache.py

Part of driver-arch-cleanup PRD, Chapter 2: qc-rename
…ation

Move eligibility checks and preparation logic from qc_lookup into new
qc_prepare method in _common.py. This eliminates ~15 lines of duplicated
logic between sync and async implementations.

Before: qc_lookup in both _common.py and _async.py contained identical
eligibility checking, cache lookup, rebinding, and statement building.

After: qc_prepare does all preparation work, qc_lookup becomes a thin
wrapper that calls qc_prepare then qc_execute.

Chapter 3 of driver-arch-cleanup_20260203 PRD.
cofin and others added 13 commits February 3, 2026 21:42
Move eligibility validation from qc_prepare (hot lookup path) to
qc_store (store path, executed once per unique query).

Before: qc_prepare had 6 condition checks including needs_static_script_compilation
and many-params guard.

After: qc_prepare has only 2 essential checks:
1. _qc_enabled flag
2. cache lookup + param count match

All detailed validation happens at store time, ensuring only valid
queries enter the cache in the first place.

Chapter 4 of driver-arch-cleanup_20260203 PRD.
The base class _qc_execute now handles the full fast-path execution:
- Removed SqliteDriver.qc_execute (redundant with base class)
- Removed AiosqliteDriver.qc_execute (redundant with base class)
- Renamed qc_lookup -> _qc_lookup (internal API)
- Added unreachable assertion to _qc_execute (all paths return/raise)
- Fixed return type cast in execute() fast-path

The `is_script`/`is_many` branches were dead code since _qc_store
filters them out before caching.
Add comprehensive benchmark tooling originally contributed by euri10
in PR #354, with enhancements for testing query cache effectiveness.

Scenarios:
- initialization: Connection and table setup overhead
- write_heavy: Bulk insert performance (execute_many)
- read_heavy: Bulk read with fetchall
- repeated_queries: Single-row queries with varying params (tests _qc_*)

Compares: raw driver vs sqlspec vs SQLAlchemy
Drivers: sqlite (asyncpg requires PostgreSQL server)

Usage:
  uv run python scripts/bench.py --driver sqlite --rows 10000

Co-authored-by: euri10 <benoit.barthelet@gmail.com>
- Remove SQLSPEC_RS_INSTALLED flag and get_sqlspec_rs() from _typing.py
- Remove _configure_qc_binder() method and calls from config.py
- Remove _qc_binder attribute and fast_path_binder handling from driver
- Simplify qc_rebind() to use Python-only parameter binding
- Fix anyio.to_thread.run_sync pyright errors in migrations
- Fix _fast_path_enabled -> _qc_enabled rename in tests
- Remove test_cached_compiled_binder_override test (tested removed feature)

The query cache (_qc_*) optimizations remain fully functional - only the
speculative Rust binder hook was removed until sqlspec_rs is ready.
Add aiosqlite scenarios to benchmark script:
- initialization, write_heavy, read_heavy
- iterative_inserts, repeated_queries
- raw aiosqlite, sqlspec, and sqlalchemy variants

Note: Revealed a bug in sqlspec aiosqlite pool - connections are not
properly isolated between different database paths. See issue tracking.
- Fix "table already exists" errors by ensuring pools are closed
  before temp files are deleted
- Add leak detection helper `_check_pool_leak()` to detect
  connection leaks in benchmarks
- Use `delete=False` with NamedTemporaryFile and manually unlink
  after pool.close_pool() to ensure proper cleanup order
- Add DROP_TEST_TABLE to all aiosqlite scenarios for consistency

Closes #360
- Add fast path for recently-used connections (skip full health check)
- Inline mark_as_in_use/mark_as_idle to reduce method call overhead
- Skip asyncio.wait_for wrapper on acquire when connection is available
- Skip timeout wrapper on release rollback (SQLite rollback is fast)
- Check pool capacity without lock first before acquiring lock
- Check closed state directly instead of through property

Also add --pool-size parameter to benchmark CLI for testing different
pool configurations.

Results (repeated_queries with 1000 rows):
- Before: 95.7% slower than raw
- After:  43.9% slower than raw (2.2x improvement)
- Add raw, sqlspec, and sqlalchemy duckdb scenarios for all 5 benchmarks
- Fix temp file handling for duckdb (needs to create file itself)
- Add duckdb_engine lazy import for sqlalchemy compatibility
- Confirms duckdb pool is already efficient (thread-local design)

Results show duckdb sqlspec overhead is 3-12% vs raw driver,
compared to 20-30% for aiosqlite after optimization.
Thread-local pools (sqlite, duckdb) don't need the same
hot-path optimization as queue-based pools (aiosqlite).
- Move duckdb-engine from dev to benchmarks group
- Add aiosqlite to benchmarks group for async benchmark scenarios
- dev group includes benchmarks via include-group
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants