-
Notifications
You must be signed in to change notification settings - Fork 414
Adding promotion for UnknownType per V3+ spec #2155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for working on this! The changes LGTM. Tagging @Fokko for another pair of eyes :) |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are 2 different issues here.
For V3 tables, yes, I think we can map pyarrow's null type to UnknownType
For V1/V2 tables, I think its correct that visit_pyarrow should throw for pyarrow null type. There's not an equivalent iceberg type mapping for pyarrow's null type.
The example in #2119, the workaround is to make sure that the pyarrow table does not contain null type. you can do that by explicitly passing the schema when creating the pyarrow table
import pyarrow as pa
# table created with the below pyarrow schema
schema = pa.schema(
[
pa.field("col1", pa.string(), nullable=True),
]
)
df = pa.Table.from_pylist(
[
{"col1": None}
],
schema=schema
)
df.schema
Thank you for that additional context regarding V1/V2; I believe that makes sense. As far as V3 tables go, it looks like |
|
|
||
| @promote.register(UnknownType) | ||
| def _(file_type: UnknownType, read_type: IcebergType) -> IcebergType: | ||
| return read_type # Per V3 Spec, "Unknown" can be promoted to any type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted this as promotions available for a primitive type rather than promotions to a primitive type. i.e. UnknownType is a primitive type and these are the types it can be promoted to. It should be easy enough to change, so I'm fine either way. But, to me it would make sense that if a {List,Map,Struct}Type field is nullable and a null PyArrow field is being evaluated for compatibility, that evaluation should succeed. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to extend this logic a bit:
| return read_type # Per V3 Spec, "Unknown" can be promoted to any type | |
| # Per V3 Spec, "Unknown" can be promoted to any Primitive type | |
| if isinstance(read_type, PrimitiveType): | |
| return file_type | |
| else: | |
| raise ResolveError(f"Cannot promote {file_type} to {read_type}") |
|
|
||
| def test_schema_conversion_null_column(table_schema_simple: Schema) -> None: | ||
| pyarrow_schema: pa.Schema = schema_to_pyarrow(table_schema_simple) | ||
| new_field = pyarrow_schema.field(2).with_type(pa.null()) # Make the optional boolean field null for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a new schema with just a field with a NestedType instead? I don't think we want to re-assign fields as that makes the test harder te read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can make a new schema. I think I was just trying to get it working and forgot to clean this up 😀
|
@ldsantos0911 Thanks for working on this, left a few comments, but this PR makes a lot of sense to me 👍 |

Rationale for this change
When attempting to write a PyArrow dataframe that has a
nullfield corresponding to a nullable Table field, I see this error:Per the V3 spec,
UnknownType(aspa.null()is cast) should be promotable to any type. This change attempts to unblock situations where a DataFrame may end up having a null type for an optional field while also incorporating the V3 spec.Note: This issue is related but was based on an older version. Nonetheless, the underlying situation is still blocked.
Are these changes tested?
Yes. Added a couple of relevant unit tests. Additionally, I did a manual sanity check:
Are there any user-facing changes?
Yes, this will allow UnknownType promotion.