Add bloom filters and disable compression by shefeek-jinnah · Pull Request #98 · hotdata-dev/runtimedb

shefeek-jinnah · 2026-02-04T10:51:33Z

Enable bloom filters for all columns (1% FPP)
Disable ZSTD compression for faster reads

Benchmark results (SF=1.0):

Point lookups: 2.8x faster with bloom filters
Range queries: 8% faster without compression

Bloom Filter Performance (TPC-H SF=1.0, ~1GB data)

Point lookup queries on primary keys:

Query	With Bloom	Without Bloom	Speedup
`SELECT * FROM lineitem WHERE l_orderkey = 12345`	1.46ms	4.47ms	3.1x
`SELECT * FROM orders WHERE o_orderkey = 54321`	1.00ms	3.85ms	3.9x
`SELECT * FROM customer WHERE c_custkey = 7500`	2.28ms	5.28ms	2.3x
`SELECT * FROM lineitem WHERE l_orderkey = 99999 AND l_linenumber = 3`	1.25ms	3.27ms	2.6x
Total	6ms	17ms	2.8x

Uncompressed vs ZSTD (TPC-H SF=1.0)

Metric	Uncompressed	ZSTD	Difference
Cached read (TPC-H queries)	182ms	198ms	8% faster
Write time	3.81s	2.92s	30% slower
Disk usage	~3x larger	baseline

Benchmark Results

TPC-H benchmark (SF=1.0, ~1GB data) with all 22 queries. Bloom filters enabled.

Summary

Configuration	Warmup (Write)	Cached (Read)	vs Baseline
UNCOMPRESSED	5.94s	1.168s	baseline
LZ4	5.72s	1.189s	+2% slower
SNAPPY	5.75s	1.226s	+5% slower
ZSTD(1)	6.15s	1.323s	+13% slower
ZSTD(3)	6.35s	1.350s	+16% slower

Per-Query Cached Read Performance (ms)

Query	Pattern	UNCOMPRESSED	LZ4	SNAPPY	ZSTD(1)	ZSTD(3)
Q1	Aggregation	67	65	67	68	69
Q2	Correlated subquery	48	47	49	54	54
Q3	3-way join	43	45	44	49	50
Q4	EXISTS subquery	26	28	28	30	31
Q5	6-way join	48	51	50	57	57
Q6	Scan + filter	19	19	19	21	21
Q7	6-way join + CASE	75	74	79	81	83
Q8	8-way join	59	59	61	69	71
Q9	6-way join + LIKE	86	89	92	99	99
Q10	4-way join	60	63	69	81	82
Q11	Nested subquery	34	36	36	37	38
Q12	3-way join + CASE	64	64	65	67	69
Q13	LEFT OUTER join	79	88	99	113	112
Q14	2-way join	22	22	23	25	26
Q15	CTE + MAX	30	29	31	33	34
Q16	NOT IN subquery	27	26	26	29	29
Q17	AVG subquery	66	70	68	71	74
Q18	IN + HAVING	97	90	92	96	105
Q19	OR predicates	53	54	54	58	58
Q20	IN + EXISTS	38	40	44	47	47
Q21	EXISTS + NOT EXISTS	102	102	100	107	106
Q22	NOT EXISTS + SUBSTRING	26	27	29	30	32
Total		1.168s	1.189s	1.226s	1.323s	1.350s

Configuration	Cache Size	Cached Read	Size vs Baseline	Read vs Baseline
UNCOMPRESSED	939 MB	1.148s	baseline	baseline
LZ4	657 MB	1.210s	-30%	+5% slower
SNAPPY	644 MB	1.306s	-31%	+14% slower
ZSTD(1)	583 MB	1.344s	-38%	+17% slower
ZSTD(3)	576 MB	1.319s	-39%	+15% slower

zfarrell · 2026-02-04T16:02:49Z

Very cool, thanks @shefeek-jinnah!
A couple thoughts:

what do you think about going through a couple more iterations of compression algorithms and levels? e.g. set zstd level to 2, 1...-3 and re-run test. also try snappy, etc. And then add them to the comparison tables above. Maybe there's a sweet spot?
Do you think we should make any of this configurable? I'm leaning towards no for now, but know that we'll likely need to add it in the future. But curious your thoughts.

shefeek-jinnah · 2026-02-06T02:42:59Z

LZ4
Hi @zfarrell
I’m planning to go with LZ4 for compression due to its low CPU overhead and fast decompression, which suits our read-heavy access patterns.I’ll also be adding a Bloom filter to quickly rule out non-existent keys and reduce unnecessary data scans. Please let me know your thoughts.

zfarrell · 2026-02-06T04:36:27Z

LZ4
Hi @zfarrell
I’m planning to go with LZ4 for compression due to its low CPU overhead and fast decompression, which suits our read-heavy access patterns.I’ll also be adding a Bloom filter to quickly rule out non-existent keys and reduce unnecessary data scans. Please let me know your thoughts.

Love it! 🚀

shefeek-jinnah requested review from anoop-narang and zfarrell February 4, 2026 10:53

shefeek-jinnah changed the title ~~feat(parquet): add bloom filters and disable compression~~ Add bloom filters and disable compression Feb 4, 2026

shefeek-jinnah force-pushed the shefeek/uncompress_and_bloom_filter branch 2 times, most recently from 0965e7f to 541a2dc Compare February 5, 2026 17:50

feat(parquet): add bloom filters and LZ4 compression

92de061

shefeek-jinnah force-pushed the shefeek/uncompress_and_bloom_filter branch from 541a2dc to 92de061 Compare February 6, 2026 05:17

shefeek-jinnah merged commit 8a64340 into main Feb 6, 2026
7 checks passed

shefeek-jinnah deleted the shefeek/uncompress_and_bloom_filter branch February 6, 2026 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add bloom filters and disable compression#98

Add bloom filters and disable compression#98
shefeek-jinnah merged 1 commit intomainfrom
shefeek/uncompress_and_bloom_filter

shefeek-jinnah commented Feb 4, 2026 •

edited

Loading

Uh oh!

zfarrell commented Feb 4, 2026 •

edited

Loading

Uh oh!

shefeek-jinnah commented Feb 6, 2026 •

edited

Loading

Uh oh!

zfarrell commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

shefeek-jinnah commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bloom Filter Performance (TPC-H SF=1.0, ~1GB data)

Uncompressed vs ZSTD (TPC-H SF=1.0)

Benchmark Results

Summary

Per-Query Cached Read Performance (ms)

Uh oh!

zfarrell commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shefeek-jinnah commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zfarrell commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shefeek-jinnah commented Feb 4, 2026 •

edited

Loading

zfarrell commented Feb 4, 2026 •

edited

Loading

shefeek-jinnah commented Feb 6, 2026 •

edited

Loading