-
Notifications
You must be signed in to change notification settings - Fork 414
Open
Description
Apache Iceberg version
main (development)
Please describe the bug 🐞
In the development version, I noticed that the to_arrow_batch_reader method casts all string types to large_string, whereas the to_arrow method returns the schema as defined in the parquet file. At first glance, it looks like a bug, likely a regression from #1669
Here is a script you can use to reproduce the issue:
import pyarrow as pa
from pyiceberg.catalog import load_catalog
from uuid import uuid4
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType
catalog = load_catalog("default")
df = pa.Table.from_pylist(
[
{"city": "Amsterdam", "lat": 52.371807, "long": 4.896029},
{"city": "San Francisco", "lat": 37.773972, "long": -122.431297},
{"city": "Drachten", "lat": 53.11254, "long": 6.0989},
{"city": "Paris", "lat": 48.864716, "long": 2.349014},
],
)
schema = Schema(
NestedField(1, "city", StringType(), required=False),
NestedField(2, "lat", DoubleType(), required=False),
NestedField(3, "long", DoubleType(), required=False),
)
tbl = catalog.create_table(f"default.cities-{uuid4()}", schema=schema)
tbl.overwrite(df)
schema_to_arrow = tbl.scan().to_arrow().schema
schema_to_arrow_batch_reader = tbl.scan().to_arrow_batch_reader().schema
print("schema_to_arrow == schema_to_arrow_batch_reader", schema_to_arrow == schema_to_arrow_batch_reader)
print("\nschema_to_arrow")
print(schema_to_arrow)
print("\nschema_to_arrow_batch_reader")
print(schema_to_arrow_batch_reader)output:
schema_to_arrow == schema_to_arrow_batch_reader False
schema_to_arrow:
city: string
lat: double
long: double
schema_to_arrow_batch_reader:
city: large_string
-- field metadata --
PARQUET:field_id: '1'
lat: double
-- field metadata --
PARQUET:field_id: '2'
long: double
-- field metadata --
PARQUET:field_id: '3'
Notice that in to_arrow schema says city: string, while in to_arrow_batch_reader it's city: large_string
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time
Metadata
Metadata
Assignees
Labels
No labels