Skip to content

Conversation

@ldsantos0911
Copy link

Rationale for this change

When attempting to write a PyArrow dataframe that has a null field corresponding to a nullable Table field, I see this error:

ValueError: Mismatch in fields:
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Table field              ┃ Dataframe field           ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ❌ │ 1: col1: optional string │ 1: col1: optional unknown │
│ ✅ │ 2: col2: optional string │ 2: col2: optional string  │
└────┴──────────────────────────┴───────────────────────────┘

Per the V3 spec, UnknownType (as pa.null() is cast) should be promotable to any type. This change attempts to unblock situations where a DataFrame may end up having a null type for an optional field while also incorporating the V3 spec.

Note: This issue is related but was based on an older version. Nonetheless, the underlying situation is still blocked.

Are these changes tested?

Yes. Added a couple of relevant unit tests. Additionally, I did a manual sanity check:

Python 3.10.18 (main, Jun  4 2025, 08:06:20) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> from pyiceberg.catalog import load_catalog
>>> from pyiceberg.schema import Schema
>>> from pyiceberg.types import StringType, NestedField
>>> warehouse_path = '/tmp/warehouse'
>>> catalog = load_catalog('default', **{'type': 'sql', 'uri': f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", 'warehouse': f"file://{warehouse_path}"})
>>> table = catalog.load_table('default.test2')
>>> table.schema()
Schema(NestedField(field_id=1, name='col1', field_type=StringType(), required=False), NestedField(field_id=2, name='col2', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])
>>> df = pa.Table.from_pylist([{'col1': None, 'col2': 'blah'}])
>>> df.schema
col1: null
col2: string
>>> table.append(df)
>>> x = table.scan().to_arrow()
>>> x
pyarrow.Table
col1: large_string
col2: string
----
col1: [[null]]
col2: [["blah"]]

Are there any user-facing changes?

Yes, this will allow UnknownType promotion.

@ldsantos0911 ldsantos0911 marked this pull request as ready for review June 26, 2025 21:56
@Fokko Fokko self-requested a review June 27, 2025 06:19
@gabeiglio
Copy link
Contributor

Thanks for working on this! The changes LGTM. Tagging @Fokko for another pair of eyes :)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 2 different issues here.

For V3 tables, yes, I think we can map pyarrow's null type to UnknownType
For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.

The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table

import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ],
    schema=schema
)

df.schema

@ldsantos0911
Copy link
Author

I think there are 2 different issues here.

For V3 tables, yes, I think we can map pyarrow's null type to UnknownType For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.

The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table

import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ],
    schema=schema
)

df.schema

Thank you for that additional context regarding V1/V2; I believe that makes sense. As far as V3 tables go, it looks like null is already being cast to UnknownType. However, as far as I can tell, the missing link is supporting type promotion to make it so UnknownType is evaluated as compatible with XType. Otherwise, as noted above, we see ValueError: Mismatch in fields.


@promote.register(UnknownType)
def _(file_type: UnknownType, read_type: IcebergType) -> IcebergType:
return read_type # Per V3 Spec, "Unknown" can be promoted to any type
Copy link
Contributor

@Fokko Fokko Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read the spec carefully, I think it only allows for promoting to primitive types:

image

So, it cannot be promoted to a {List,Map,Struct}Type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted this as promotions available for a primitive type rather than promotions to a primitive type. i.e. UnknownType is a primitive type and these are the types it can be promoted to. It should be easy enough to change, so I'm fine either way. But, to me it would make sense that if a {List,Map,Struct}Type field is nullable and a null PyArrow field is being evaluated for compatibility, that evaluation should succeed. What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to extend this logic a bit:

Suggested change
return read_type # Per V3 Spec, "Unknown" can be promoted to any type
# Per V3 Spec, "Unknown" can be promoted to any Primitive type
if isinstance(read_type, PrimitiveType):
return file_type
else:
raise ResolveError(f"Cannot promote {file_type} to {read_type}")


def test_schema_conversion_null_column(table_schema_simple: Schema) -> None:
pyarrow_schema: pa.Schema = schema_to_pyarrow(table_schema_simple)
new_field = pyarrow_schema.field(2).with_type(pa.null()) # Make the optional boolean field null for testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a new schema with just a field with a NestedType instead? I don't think we want to re-assign fields as that makes the test harder te read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can make a new schema. I think I was just trying to get it working and forgot to clean this up 😀

@Fokko
Copy link
Contributor

Fokko commented Aug 12, 2025

@ldsantos0911 Thanks for working on this, left a few comments, but this PR makes a lot of sense to me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants