Skip to content

Comments

Add bloom filters and disable compression#98

Merged
shefeek-jinnah merged 1 commit intomainfrom
shefeek/uncompress_and_bloom_filter
Feb 6, 2026
Merged

Add bloom filters and disable compression#98
shefeek-jinnah merged 1 commit intomainfrom
shefeek/uncompress_and_bloom_filter

Conversation

@shefeek-jinnah
Copy link
Contributor

@shefeek-jinnah shefeek-jinnah commented Feb 4, 2026

  • Enable bloom filters for all columns (1% FPP)
  • Disable ZSTD compression for faster reads

Benchmark results (SF=1.0):

  • Point lookups: 2.8x faster with bloom filters
  • Range queries: 8% faster without compression

Bloom Filter Performance (TPC-H SF=1.0, ~1GB data)

Point lookup queries on primary keys:

Query With Bloom Without Bloom Speedup
SELECT * FROM lineitem WHERE l_orderkey = 12345 1.46ms 4.47ms 3.1x
SELECT * FROM orders WHERE o_orderkey = 54321 1.00ms 3.85ms 3.9x
SELECT * FROM customer WHERE c_custkey = 7500 2.28ms 5.28ms 2.3x
SELECT * FROM lineitem WHERE l_orderkey = 99999 AND l_linenumber = 3 1.25ms 3.27ms 2.6x
Total 6ms 17ms 2.8x

Uncompressed vs ZSTD (TPC-H SF=1.0)

Metric Uncompressed ZSTD Difference
Cached read (TPC-H queries) 182ms 198ms 8% faster
Write time 3.81s 2.92s 30% slower
Disk usage ~3x larger baseline

Benchmark Results

TPC-H benchmark (SF=1.0, ~1GB data) with all 22 queries. Bloom filters enabled.

Summary

Configuration Warmup (Write) Cached (Read) vs Baseline
UNCOMPRESSED 5.94s 1.168s baseline
LZ4 5.72s 1.189s +2% slower
SNAPPY 5.75s 1.226s +5% slower
ZSTD(1) 6.15s 1.323s +13% slower
ZSTD(3) 6.35s 1.350s +16% slower

Per-Query Cached Read Performance (ms)

Query Pattern UNCOMPRESSED LZ4 SNAPPY ZSTD(1) ZSTD(3)
Q1 Aggregation 67 65 67 68 69
Q2 Correlated subquery 48 47 49 54 54
Q3 3-way join 43 45 44 49 50
Q4 EXISTS subquery 26 28 28 30 31
Q5 6-way join 48 51 50 57 57
Q6 Scan + filter 19 19 19 21 21
Q7 6-way join + CASE 75 74 79 81 83
Q8 8-way join 59 59 61 69 71
Q9 6-way join + LIKE 86 89 92 99 99
Q10 4-way join 60 63 69 81 82
Q11 Nested subquery 34 36 36 37 38
Q12 3-way join + CASE 64 64 65 67 69
Q13 LEFT OUTER join 79 88 99 113 112
Q14 2-way join 22 22 23 25 26
Q15 CTE + MAX 30 29 31 33 34
Q16 NOT IN subquery 27 26 26 29 29
Q17 AVG subquery 66 70 68 71 74
Q18 IN + HAVING 97 90 92 96 105
Q19 OR predicates 53 54 54 58 58
Q20 IN + EXISTS 38 40 44 47 47
Q21 EXISTS + NOT EXISTS 102 102 100 107 106
Q22 NOT EXISTS + SUBSTRING 26 27 29 30 32
Total 1.168s 1.189s 1.226s 1.323s 1.350s
Configuration Cache Size Cached Read Size vs Baseline Read vs Baseline
UNCOMPRESSED 939 MB 1.148s baseline baseline
LZ4 657 MB 1.210s -30% +5% slower
SNAPPY 644 MB 1.306s -31% +14% slower
ZSTD(1) 583 MB 1.344s -38% +17% slower
ZSTD(3) 576 MB 1.319s -39% +15% slower

@shefeek-jinnah shefeek-jinnah changed the title feat(parquet): add bloom filters and disable compression Add bloom filters and disable compression Feb 4, 2026
@zfarrell
Copy link
Contributor

zfarrell commented Feb 4, 2026

Very cool, thanks @shefeek-jinnah!
A couple thoughts:

  • what do you think about going through a couple more iterations of compression algorithms and levels? e.g. set zstd level to 2, 1...-3 and re-run test. also try snappy, etc. And then add them to the comparison tables above. Maybe there's a sweet spot?
  • Do you think we should make any of this configurable? I'm leaning towards no for now, but know that we'll likely need to add it in the future. But curious your thoughts.

@shefeek-jinnah shefeek-jinnah force-pushed the shefeek/uncompress_and_bloom_filter branch 2 times, most recently from 0965e7f to 541a2dc Compare February 5, 2026 17:50
@shefeek-jinnah
Copy link
Contributor Author

shefeek-jinnah commented Feb 6, 2026

LZ4
Hi @zfarrell
I’m planning to go with LZ4 for compression due to its low CPU overhead and fast decompression, which suits our read-heavy access patterns.I’ll also be adding a Bloom filter to quickly rule out non-existent keys and reduce unnecessary data scans. Please let me know your thoughts.

@zfarrell
Copy link
Contributor

zfarrell commented Feb 6, 2026

LZ4
Hi @zfarrell
I’m planning to go with LZ4 for compression due to its low CPU overhead and fast decompression, which suits our read-heavy access patterns.I’ll also be adding a Bloom filter to quickly rule out non-existent keys and reduce unnecessary data scans. Please let me know your thoughts.

Love it! 🚀

@shefeek-jinnah shefeek-jinnah force-pushed the shefeek/uncompress_and_bloom_filter branch from 541a2dc to 92de061 Compare February 6, 2026 05:17
@shefeek-jinnah shefeek-jinnah merged commit 8a64340 into main Feb 6, 2026
7 checks passed
@shefeek-jinnah shefeek-jinnah deleted the shefeek/uncompress_and_bloom_filter branch February 6, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants