Skip to content

Enable stats collection for nested fields and use write.metadata.metrics.max-inferred-column-defaults to control stats growth #2699

@greenlaw

Description

@greenlaw

Feature Request / Improvement

I recently discovered that full stats collection (i.e. lower_bounds/upper_bounds) is explicitly disabled in PyIceberg for nested (i.e. struct child) fields.

This change was made in this PR and specifically this commit.

It seems that this change may have been made to limit the number of fields whose stats are collected when default-full stats collection is enabled. However, after discussion it seems that simply adding support for the write.metadata.metrics.max-inferred-column-defaults table property would be the preferred way to control stats growth. If this is implemented, re-enabling stats collection for nested fields should be a non-issue.

Stats collection for nested struct fields is important for schemas like GeoParquet which store important primitive fields (in this case, bounding box xmin, ymin, xmax, ymax) using structs.

See also this slack thread for discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions