Pyarrow data type, default to small type and fix large type override #1859

kevinjqliu · 2025-03-27T21:05:55Z

Rationale for this change

#1669 made the change to infer the type when reading, and not default pyarrow data types to the large type. Originally, default to large type was introduced by #986.

I found a bug in #1669 where type promotion from string->binary defaults to large_binary (#1669 (comment)). Which led to to find that we still use large type in _ConvertToArrowSchema. Furthermore, I found that we did not respect PYARROW_USE_LARGE_TYPES_ON_READ=True when reading.

This PR is a continuation of #1669.

Change docs for pyarrow.use-large-types-on-read to default value False
Change _ConvertToArrowSchema to use small data type instead of large
When PYARROW_USE_LARGE_TYPES_ON_READ is enabled (set to True), ArrowScan and ArrowProjectionVisitor and should cast to large type
Add back test for setting PYARROW_USE_LARGE_TYPES_ON_READ to True

This PR should help us infer the data type when reading while keeping the PYARROW_USE_LARGE_TYPES_ON_READ override behavior until deprecation.

Are these changes tested?

Yes

Are there any user-facing changes?

No

mkdocs/docs/configuration.md

Fokko · 2025-03-31T20:37:38Z

pyiceberg/io/pyarrow.py

    def list(self, list_type: ListType, element_result: pa.DataType) -> pa.DataType:
        element_field = self.field(list_type.element_field, element_result)
-        return pa.large_list(value_type=element_field)
+        return pa.list_(value_type=element_field)


I'm not convinced that we need to change this. We use schema_to_pyarrow in many places:

Schema.as_arrow(), this can be problematic when people already allocate buffers that are larger than what fits in the small ones.

_ConvertToArrowExpression.{visit_in,visit_not_in}, I checked manually, and it looks like we can mix large and normal types here :)

ArrowProjectionVisitor has the issue similar to what you've described in Arrow: Infer the types when reading #1669 (comment). I think the other way around is also an issue. If you would promote a large_string, it would now produce a binary and not a large_binary.

ArrowScan.to_table()will return the schema when there is no data, both small and large are okay.

DataScan.to_arrow_batch_reader(), I think we should always update to the large type. Since this is streaming, we don't know upfront if the small buffers are big enough, therefore it is safe to go with the large ones.

@Fokko Just coming back to this PR. Is there a reason why we'd want to default to large_list?

The difference between list_ and large_list is the number of elements supported by the list. According to the large_list docs,

Unless you need to represent data larger than 2**31 elements, you should prefer list_().

2**31 is 2_147_483_648, 2 billion items in the list seems pretty rare.

I did a small experiment, this works with list_

import pyarrow as pa import numpy as np size = 2**31 - 2 pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))

but this will crash python, and would require large_list

import pyarrow as pa import numpy as np size = 2**31 - 1 pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))

Sorry, I missed this one. I don't think the list and large_list are the problem, but string and large_string. It is possible to have a buffer that contains more than 2GB of strings.

Co-authored-by: Fokko Driesprong <fokko@apache.org>

kevinjqliu requested review from Fokko and sungwy March 27, 2025 21:05

Fokko reviewed Mar 31, 2025

View reviewed changes

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

Fokko reviewed Mar 31, 2025

View reviewed changes

kevinjqliu and others added 6 commits April 25, 2025 08:20

fix small type

9dcece4

update doc

4380c3e

make PYARROW_USE_LARGE_TYPES_ON_READ work

191ff95

ensure large type

06f7857

Update mkdocs/docs/configuration.md

b84d6e4

Co-authored-by: Fokko Driesprong <fokko@apache.org>

a few more

79a80c2

kevinjqliu force-pushed the kevinjqliu/default-small-type branch from e4d7972 to 79a80c2 Compare April 26, 2025 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pyarrow data type, default to small type and fix large type override #1859

Pyarrow data type, default to small type and fix large type override #1859

Uh oh!

kevinjqliu commented Mar 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fokko Mar 31, 2025

Uh oh!

kevinjqliu Apr 26, 2025

Uh oh!

Fokko Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pyarrow data type, default to small type and fix large type override #1859

Are you sure you want to change the base?

Pyarrow data type, default to small type and fix large type override #1859

Uh oh!

Conversation

kevinjqliu commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Fokko Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu commented Mar 27, 2025 •

edited

Loading