You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Rationale for this change
Time to give this another go 😆
When reading a Parquet file using PyArrow, there is some metadata stored
in the Parquet file to either make it a large type (eg `large_string`,
or a normal type (`string`). The difference is that the large types use
a 64 bit offset to encode their arrays. This is not always needed, and
we can could first check all the in the types of which it is stored, and
let PyArrow decide here:
https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579
In PyArrow today we just bump everything to a large type, which might
lead to additional memory consumption because it allocates an int64
array to allocate the offsets, instead of an int32.
I thought we would be good to go for this now with the new lower bound
of PyArrow to 17. But, it looks like we still have to wait for Arrow 18
to fix the issue with the `date` types:
apache/arrow#43183Fixes: apache#1049
### Are these changes tested?
Yes, existing tests :)
### Are there any user-facing changes?
Before, PyIceberg would always return the large Arrow types (eg,
`large_string` instead of `string`). After this change, it will return
the type it was written with.
=="""locations: map<large_string, struct<latitude: double not null, longitude: double not null, altitude: double>>
1474
-
child 0, entries: struct<key: large_string not null, value: struct<latitude: double not null, longitude: double not null, altitude: double> not null> not null
1475
-
child 0, key: large_string not null
1473
+
=="""locations: map<string, struct<latitude: double not null, longitude: double not null, altitude: double>>
1474
+
child 0, entries: struct<key: string not null, value: struct<latitude: double not null, longitude: double not null, altitude: double> not null> not null
1475
+
child 0, key: string not null
1476
1476
child 1, value: struct<latitude: double not null, longitude: double not null, altitude: double> not null
0 commit comments