-
Notifications
You must be signed in to change notification settings - Fork 118
fix: canonicalize VarBin/FSST with >2GB buffers #5961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merging this PR will degrade performance by 29.75%
Performance Changes
Comparing Footnotes
|
af7d307 to
6a79200
Compare
encodings/fsst/src/canonical.rs
Outdated
| let len = len.as_(); | ||
| assert!(len <= max_buffer_len, "values cannot exceed max_buffer_len"); | ||
|
|
||
| if (offset + len) > max_buffer_len { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually very conservative, and doesn't account for the presence of inlined strings.
There isn't really a way to get around this without lots of data copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(FWIW this is already a problem on develop as well, except there everything lives in one giant buffer which sometimes is too large for VarBinView)
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
encodings/fsst/src/canonical.rs
Outdated
| )] | ||
| fn fsst_decode_views(fsst_array: &FSSTArray, buf_index: u32) -> (ByteBuffer, Buffer<BinaryView>) { | ||
| /// Maximum number of buffer bytes that can be referenced by a single `BinaryView` | ||
| const MAX_BUFFER_LEN: usize = i32::MAX as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a Java heresy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, there's some ambiguity here with the spec and arrow-rs diverges from arrow-cpp
Arrow-rs uses u32 for all of these fields.
But, the spec states that all offsets should be treated as signed. That's how the Java implementation works, and also how the arrow-cpp implementation works https://github.com/apache/arrow/blob/7820f672edbbd516661740db9c355f2bc42bf602/cpp/src/arrow/util/binary_view_util.h#L52-L58
So just to be on the safe side I use i32 here instead of u32
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
| use crate::arrays::build_views::build_views; | ||
|
|
||
| #[test] | ||
| fn test_to_canonical_large() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the test that takes a while to run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this one is very small b/c i artificially constrain the max buffer len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now that we're hitting the same codepath for varbin/fsst it's ok
Fixes a potential bug where decoding a large FSST and VarBin arrays results in an invalid VarBinViewArray.
When you have a large buffer that is, currently we generate a new VBV with the single buffer plus some views built against it. There will be trouble if the buffer is > 2GiB though.
This PR splits out a separate
build_viewsfunction that takes amax_buffer_lenparameter and as it generates views, it splits (zero-copy) the underlying buffer into segments of no more thanmax_buffer_len.