-
Notifications
You must be signed in to change notification settings - Fork 414
Pyarrow data type, default to small type and fix large type override #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| def list(self, list_type: ListType, element_result: pa.DataType) -> pa.DataType: | ||
| element_field = self.field(list_type.element_field, element_result) | ||
| return pa.large_list(value_type=element_field) | ||
| return pa.list_(value_type=element_field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we need to change this. We use schema_to_pyarrow in many places:
Schema.as_arrow(), this can be problematic when people already allocate buffers that are larger than what fits in the small ones._ConvertToArrowExpression.{visit_in,visit_not_in}, I checked manually, and it looks like we can mix large and normal types here :)ArrowProjectionVisitorhas the issue similar to what you've described in Arrow: Infer the types when reading #1669 (comment). I think the other way around is also an issue. If you would promote alarge_string, it would now produce abinaryand not alarge_binary.ArrowScan.to_table()will return the schema when there is no data, both small and large are okay.DataScan.to_arrow_batch_reader(), I think we should always update to the large type. Since this is streaming, we don't know upfront if the small buffers are big enough, therefore it is safe to go with the large ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Just coming back to this PR. Is there a reason why we'd want to default to large_list?
The difference between list_ and large_list is the number of elements supported by the list. According to the large_list docs,
Unless you need to represent data larger than 2**31 elements, you should prefer list_().
2**31 is 2_147_483_648, 2 billion items in the list seems pretty rare.
I did a small experiment, this works with list_
import pyarrow as pa
import numpy as np
size = 2**31 - 2
pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))
but this will crash python, and would require large_list
import pyarrow as pa
import numpy as np
size = 2**31 - 1
pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this one. I don't think the list and large_list are the problem, but string and large_string. It is possible to have a buffer that contains more than 2GB of strings.
Co-authored-by: Fokko Driesprong <fokko@apache.org>
e4d7972 to
79a80c2
Compare
Rationale for this change
#1669 made the change to infer the type when reading, and not default pyarrow data types to the large type. Originally, default to large type was introduced by #986.
I found a bug in #1669 where type promotion from string->binary defaults to large_binary (#1669 (comment)). Which led to to find that we still use large type in
_ConvertToArrowSchema. Furthermore, I found that we did not respectPYARROW_USE_LARGE_TYPES_ON_READ=Truewhen reading.This PR is a continuation of #1669.
pyarrow.use-large-types-on-readto default valueFalse_ConvertToArrowSchemato use small data type instead of largePYARROW_USE_LARGE_TYPES_ON_READis enabled (set toTrue),ArrowScanandArrowProjectionVisitorand should cast to large typePYARROW_USE_LARGE_TYPES_ON_READtoTrueThis PR should help us infer the data type when reading while keeping the
PYARROW_USE_LARGE_TYPES_ON_READoverride behavior until deprecation.Are these changes tested?
Yes
Are there any user-facing changes?
No