Skip to content

to_arrow_batch_reader returns a different schema than to_arrow #2250

@enkidulan

Description

@enkidulan

Apache Iceberg version

main (development)

Please describe the bug 🐞

In the development version, I noticed that the to_arrow_batch_reader method casts all string types to large_string, whereas the to_arrow method returns the schema as defined in the parquet file. At first glance, it looks like a bug, likely a regression from #1669

Here is a script you can use to reproduce the issue:

import pyarrow as pa
from pyiceberg.catalog import load_catalog
from uuid import uuid4
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType

catalog = load_catalog("default")

df = pa.Table.from_pylist(
    [
        {"city": "Amsterdam", "lat": 52.371807, "long": 4.896029},
        {"city": "San Francisco", "lat": 37.773972, "long": -122.431297},
        {"city": "Drachten", "lat": 53.11254, "long": 6.0989},
        {"city": "Paris", "lat": 48.864716, "long": 2.349014},
    ],
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)

tbl = catalog.create_table(f"default.cities-{uuid4()}", schema=schema)

tbl.overwrite(df)


schema_to_arrow = tbl.scan().to_arrow().schema

schema_to_arrow_batch_reader = tbl.scan().to_arrow_batch_reader().schema

print("schema_to_arrow == schema_to_arrow_batch_reader", schema_to_arrow == schema_to_arrow_batch_reader)
print("\nschema_to_arrow")
print(schema_to_arrow)
print("\nschema_to_arrow_batch_reader")
print(schema_to_arrow_batch_reader)

output:

schema_to_arrow == schema_to_arrow_batch_reader False

schema_to_arrow:
city: string
lat: double
long: double

schema_to_arrow_batch_reader:
city: large_string                               
  -- field metadata --
  PARQUET:field_id: '1'
lat: double
  -- field metadata --
  PARQUET:field_id: '2'
long: double
  -- field metadata --
  PARQUET:field_id: '3'

Notice that in to_arrow schema says city: string, while in to_arrow_batch_reader it's city: large_string

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions