Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Jan 14, 2026

Fixes a potential bug where decoding a large FSST and VarBin arrays results in an invalid VarBinViewArray.

When you have a large buffer that is, currently we generate a new VBV with the single buffer plus some views built against it. There will be trouble if the buffer is > 2GiB though.

This PR splits out a separate build_views function that takes a max_buffer_len parameter and as it generates views, it splits (zero-copy) the underlying buffer into segments of no more than max_buffer_len.

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 14, 2026

Merging this PR will degrade performance by 29.75%

⚡ 2 improved benchmarks
❌ 7 regressed benchmarks
✅ 1245 untouched benchmarks
⏩ 1254 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation canonical_into_nullable[(10000, 10, 0.0)] 529.8 µs 445.1 µs +19.04%
Simulation canonical_into_non_nullable[(10000, 100, 0.0)] 1.9 ms 2.7 ms -29.75%
Simulation canonical_into_nullable[(10000, 100, 0.0)] 4.9 ms 4.1 ms +19.84%
Simulation canonical_into_non_nullable[(10000, 100, 0.1)] 3.7 ms 4.5 ms -18.15%
Simulation canonical_into_non_nullable[(10000, 100, 0.01)] 2.1 ms 2.9 ms -27.39%
Simulation into_canonical_non_nullable[(10000, 100, 0.1)] 3.8 ms 4.6 ms -17.55%
Simulation into_canonical_non_nullable[(10000, 100, 0.0)] 1.9 ms 2.7 ms -29.31%
Simulation into_canonical_non_nullable[(10000, 100, 0.01)] 2.2 ms 3 ms -26.7%
Simulation into_canonical_nullable[(10000, 100, 0.0)] 4.4 ms 5.2 ms -15.41%

Comparing fsst-canonical (78cea9c) with develop (5483037)

Open in CodSpeed

Footnotes

  1. 1254 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y force-pushed the fsst-canonical branch 5 times, most recently from af7d307 to 6a79200 Compare January 14, 2026 20:27
let len = len.as_();
assert!(len <= max_buffer_len, "values cannot exceed max_buffer_len");

if (offset + len) > max_buffer_len {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually very conservative, and doesn't account for the presence of inlined strings.

There isn't really a way to get around this without lots of data copying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW this is already a problem on develop as well, except there everything lives in one giant buffer which sometimes is too large for VarBinView)

@a10y a10y added the fix label Jan 14, 2026
@a10y a10y requested a review from gatesn January 14, 2026 20:35
@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.89%. Comparing base (4bbafe7) to head (14d4764).

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

)]
fn fsst_decode_views(fsst_array: &FSSTArray, buf_index: u32) -> (ByteBuffer, Buffer<BinaryView>) {
/// Maximum number of buffer bytes that can be referenced by a single `BinaryView`
const MAX_BUFFER_LEN: usize = i32::MAX as usize;
Copy link
Contributor

@AdamGS AdamGS Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a Java heresy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, there's some ambiguity here with the spec and arrow-rs diverges from arrow-cpp

Arrow-rs uses u32 for all of these fields.

But, the spec states that all offsets should be treated as signed. That's how the Java implementation works, and also how the arrow-cpp implementation works https://github.com/apache/arrow/blob/7820f672edbbd516661740db9c355f2bc42bf602/cpp/src/arrow/util/binary_view_util.h#L52-L58

So just to be on the safe side I use i32 here instead of u32

Base automatically changed from bufferhandles to develop January 15, 2026 14:15
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
a10y added 2 commits January 20, 2026 12:26
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y changed the title fix: FSST canonicalize when > 2GB of buffers fix: canonicalize VarBin/FSST with >2GB buffers Jan 20, 2026
use crate::arrays::build_views::build_views;

#[test]
fn test_to_canonical_large() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the test that takes a while to run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this one is very small b/c i artificially constrain the max buffer len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now that we're hitting the same codepath for varbin/fsst it's ok

@a10y a10y merged commit c655272 into develop Jan 20, 2026
50 of 52 checks passed
@a10y a10y deleted the fsst-canonical branch January 20, 2026 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants