Skip to content

When writing data from a PyArrow DataFrame, how should we handle 'null' Fields? #2119

@ldsantos0911

Description

@ldsantos0911

Question

import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ]
)

table.overwrite(df)

In the above example, we encounter an error like this UnsupportedPyArrowTypeException: Column 'col1' has an unsupported type: null, with underlying cause

in _ConvertToIceberg.primitive(self, primitive)
   1211     return FixedType(primitive.byte_width)
-> 1213 raise TypeError(f"Unsupported type: {primitive}")

TypeError: Unsupported type: null

Is there any reason we wouldn't want to support the case where pyarrow has marked a Field as null? As a workaround/fix, I was thinking that we could exclude pa.null() Fields in visit_pyarrow(obj: pa.StructType, visitor: PyArrowSchemaVisitor[T]). This way, the column would effectively be missing and any required/nullable enforcement would be performed accordingly. Would this have any undesired consequences?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions