[core] Add required Iceberg Field IDs to support ID-based column pruning #5981

muttcg · 2025-07-29T12:18:39Z

Purpose

Linked issue: close #5875

Apache Iceberg specification for Avro requires to have field-ids in order to support ID-based column pruning.

For example Google Big Query engine as well as PyIceberg don't work at all without field-ids causing critical issues:

Error while reading data, error message: The Apache Avro library failed to read data with the following error: Cannot resolve: { "type": "record", "name": "r2_null_value_counts", "fields": [ { "name": "key", "type": "int" }, { "name": "value", "type": "long" } ] } with { "type": "record", "name": "k121_v122", "fields": [ { "name": "key", "type": "int", "field-id": 121 }, { "name": "value", "type": "long", "field-id": 122 } ] }; Failed to dispatch pruner query for ***.example.flair_paimon.meta original query id: *** with new query_id: *** File: bigstore/*****-test/table_v2/iceberg/example/flair_paimon/metadata/7d8b21d4-e536-49cd-bc1c-ee68b154c178-m8.avro

The change affects only Iceberg Avro schema creation, bringing new Iceberg custom properties following the specification, as well as fixing the wrong ID for partitions values.

Tests

All existing tests must work. Added org.apache.paimon.iceberg.IcebergCompatibilityTest.testIcebergAvroFieldIds test to cover all required ids.

API and Format

No

Documentation

…pruning. https://iceberg.apache.org/spec/#avro

LsomeYeah

Thanks a lot for your contribution! Looks good to me!

Additionally, I was wondering if you could try reading the data using PyIceberg after this change? I ran some tests locally and found the following:

compress strategy

snappy: When the manifest's compression strategy is set to snappy, I encountered ValueError: Checksum failure when reading manifests.
null: When the manifest's compression strategy is null, the manifests can be read correctly without any issues.

file format

avro: I ran into errors when trying to read Avro-formatted data files. This seems to be due to PyArrow not supporting the reading of Avro files.
parquet: when the manifest's compression strategy is null and the data file's format is Parquet, I was able to successfully read the data using PyIceberg.

muttcg · 2025-07-30T08:24:47Z

@LsomeYeah Thank you for checking
We found and fixed another bug in PyIceberg apache/iceberg-python#2252
Initially we tested with 'deflate', as I understand by default nobody uses avro compression for Iceberg manifests or so

LsomeYeah · 2025-07-30T08:59:45Z

@LsomeYeah Thank you for checking We found and fixed another bug in PyIceberg apache/iceberg-python#2252 Initially we tested with 'deflate', as I understand by default nobody uses avro compression for Iceberg manifests or so

Thanks for your reply.
The default compression for Iceberg manifest files in Paimon is Snappy. Perhaps we should consider changing the default compression for iceberg manifest file later. CC @JingsongLi

LsomeYeah

+1

muttcg · 2025-07-30T09:13:53Z

@LsomeYeah
I have re-checked Google BQ and Polars(uses PyIceberg and PyArrow under the hood) using fix and deflate, and everything works as I expect. Examples I used:

BQ query:

SELECT * 
FROM `***.example.flair_paimon_iceberg_v7`
WHERE partition_date = "2024-07-01"
AND partition_hh = "17"
AND app_version = '10.335.0'
ORDER BY country DESC
LIMIT 1000

Polars query:

import polars as pl

table_path = "gs://***/table_v2/iceberg/example/flair_paimon_iceberg_v7/metadata/v3.metadata.json"

df = (
  pl.scan_iceberg(table_path)
  .filter(
    (pl.col("partition_date") == "2024-07-01") &
    (pl.col("partition_hh") == "17") &
    (pl.col("app_version") == "10.335.0")
  )
  .sort("country", descending=True)
  .limit(1000)
  .collect()
)

print(df)

brunsgaard · 2025-07-30T09:41:40Z

I am excited about this PR, thanks @muttcg and @LsomeYeah for looking into this ❤️ 🥇

JingsongLi

Thanks @muttcg and @LsomeYeah , +1

* github/master: (41 commits) [Python] Support data writer for PyPaimon (apache#5997) [Python] Support scan and plan for PyPaimon (apache#5996) [flink-cdc] Provide option to disable use of source primary keys if primary keys in action command are not specified for CDC ingestion. (apache#5793) Revert "[core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994)" (apache#5995) [core] Add total compaction count metric (apache#5963) [hotfix] Rename to SchemaManager.applyRenameColumnsToOptions [core] fix column rename when columns referenced by table options. (apache#5964) [core] Log a warning for invalid partition values instead of throwing an exception when enable partition mark done. (apache#5978) [core] Add required Field IDs to support ID-based column pruning (apache#5981) [core] Row-tracking row should keep their row_id and sequence_number in compaction (apache#5991) [core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994) [format] Introduce 'write.batch-memory' to control memory in arrow (apache#5988) [flink] Change filesystem.job-level-settings.enabled default value to true (apache#5971) [clone] support including some tables when clone all tables in a catalog or database. (apache#5993) [iceberg] Support TINYINT and SMALLINT in Iceberg Compatibility (apache#5984) [Python] Support snapshot and manifest for PyPaimon (apache#5987) [python] Change Schema to TableSchema in Class GetTableResponse. (apache#5990) [core] Introduce 'compaction.total-size-threshold' to do full compaction (apache#5973) [Python] Support filesystem catalog for PyPaimon (apache#5986) [core] Add lance table type for rest catalog (apache#5977) ...

muttcg added 2 commits July 29, 2025 13:45

[core] Fixes #5875 Add required Field IDs to support ID-based column …

a2269ff

…pruning. https://iceberg.apache.org/spec/#avro

Merge remote-tracking branch 'origin/master' into fix/5875_iceberg

62e60ad

joao-krauchenco-mottu approved these changes Jul 29, 2025

View reviewed changes

LsomeYeah reviewed Jul 30, 2025

View reviewed changes

LsomeYeah approved these changes Jul 30, 2025

View reviewed changes

JingsongLi approved these changes Jul 31, 2025

View reviewed changes

JingsongLi merged commit 9126349 into apache:master Jul 31, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Add required Iceberg Field IDs to support ID-based column pruning #5981

[core] Add required Iceberg Field IDs to support ID-based column pruning #5981

Uh oh!

muttcg commented Jul 29, 2025 •

edited

Loading

Uh oh!

LsomeYeah left a comment

Uh oh!

muttcg commented Jul 30, 2025 •

edited

Loading

Uh oh!

LsomeYeah commented Jul 30, 2025

Uh oh!

LsomeYeah left a comment

Uh oh!

muttcg commented Jul 30, 2025 •

edited

Loading

Uh oh!

brunsgaard commented Jul 30, 2025

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[core] Add required Iceberg Field IDs to support ID-based column pruning #5981

[core] Add required Iceberg Field IDs to support ID-based column pruning #5981

Uh oh!

Conversation

muttcg commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

LsomeYeah left a comment

Choose a reason for hiding this comment

Uh oh!

muttcg commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LsomeYeah commented Jul 30, 2025

Uh oh!

LsomeYeah left a comment

Choose a reason for hiding this comment

Uh oh!

muttcg commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brunsgaard commented Jul 30, 2025

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

muttcg commented Jul 29, 2025 •

edited

Loading

muttcg commented Jul 30, 2025 •

edited

Loading

muttcg commented Jul 30, 2025 •

edited

Loading