Skip to content

Conversation

@muttcg
Copy link
Contributor

@muttcg muttcg commented Jul 29, 2025

Purpose

Linked issue: close #5875

Apache Iceberg specification for Avro requires to have field-ids in order to support ID-based column pruning.

For example Google Big Query engine as well as PyIceberg don't work at all without field-ids causing critical issues:

Error while reading data, error message: The Apache Avro library failed to read data with the following error: Cannot resolve: { "type": "record", "name": "r2_null_value_counts", "fields": [ { "name": "key", "type": "int" }, { "name": "value", "type": "long" } ] } with { "type": "record", "name": "k121_v122", "fields": [ { "name": "key", "type": "int", "field-id": 121 }, { "name": "value", "type": "long", "field-id": 122 } ] }; Failed to dispatch pruner query for ***.example.flair_paimon.meta original query id: *** with new query_id: *** File: bigstore/*****-test/table_v2/iceberg/example/flair_paimon/metadata/7d8b21d4-e536-49cd-bc1c-ee68b154c178-m8.avro

The change affects only Iceberg Avro schema creation, bringing new Iceberg custom properties following the specification, as well as fixing the wrong ID for partitions values.

Tests

All existing tests must work. Added org.apache.paimon.iceberg.IcebergCompatibilityTest.testIcebergAvroFieldIds test to cover all required ids.

API and Format

No

Documentation

Copy link
Contributor

@LsomeYeah LsomeYeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution! Looks good to me!

Additionally, I was wondering if you could try reading the data using PyIceberg after this change? I ran some tests locally and found the following:

  1. compress strategy
  • snappy: When the manifest's compression strategy is set to snappy, I encountered ValueError: Checksum failure when reading manifests.
  • null: When the manifest's compression strategy is null, the manifests can be read correctly without any issues.
  1. file format
  • avro: I ran into errors when trying to read Avro-formatted data files. This seems to be due to PyArrow not supporting the reading of Avro files.
  • parquet: when the manifest's compression strategy is null and the data file's format is Parquet, I was able to successfully read the data using PyIceberg.

@muttcg
Copy link
Contributor Author

muttcg commented Jul 30, 2025

@LsomeYeah Thank you for checking
We found and fixed another bug in PyIceberg apache/iceberg-python#2252
Initially we tested with 'deflate', as I understand by default nobody uses avro compression for Iceberg manifests or so

@LsomeYeah
Copy link
Contributor

@LsomeYeah Thank you for checking We found and fixed another bug in PyIceberg apache/iceberg-python#2252 Initially we tested with 'deflate', as I understand by default nobody uses avro compression for Iceberg manifests or so

Thanks for your reply.
The default compression for Iceberg manifest files in Paimon is Snappy. Perhaps we should consider changing the default compression for iceberg manifest file later. CC @JingsongLi

Copy link
Contributor

@LsomeYeah LsomeYeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@muttcg
Copy link
Contributor Author

muttcg commented Jul 30, 2025

@LsomeYeah
I have re-checked Google BQ and Polars(uses PyIceberg and PyArrow under the hood) using fix and deflate, and everything works as I expect. Examples I used:

BQ query:

SELECT * 
FROM `***.example.flair_paimon_iceberg_v7`
WHERE partition_date = "2024-07-01"
AND partition_hh = "17"
AND app_version = '10.335.0'
ORDER BY country DESC
LIMIT 1000

Polars query:

import polars as pl

table_path = "gs://***/table_v2/iceberg/example/flair_paimon_iceberg_v7/metadata/v3.metadata.json"

df = (
  pl.scan_iceberg(table_path)
  .filter(
    (pl.col("partition_date") == "2024-07-01") &
    (pl.col("partition_hh") == "17") &
    (pl.col("app_version") == "10.335.0")
  )
  .sort("country", descending=True)
  .limit(1000)
  .collect()
)

print(df)

@brunsgaard
Copy link

I am excited about this PR, thanks @muttcg and @LsomeYeah for looking into this ❤️ 🥇

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @muttcg and @LsomeYeah , +1

@JingsongLi JingsongLi merged commit 9126349 into apache:master Jul 31, 2025
21 checks passed
jerry-024 added a commit to jerry-024/paimon that referenced this pull request Aug 1, 2025
* github/master: (41 commits)
  [Python] Support data writer for PyPaimon (apache#5997)
  [Python] Support scan and plan for PyPaimon (apache#5996)
  [flink-cdc] Provide option to disable use of source primary keys if primary keys in action command are not specified for CDC ingestion. (apache#5793)
  Revert "[core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994)" (apache#5995)
  [core] Add total compaction count metric (apache#5963)
  [hotfix] Rename to SchemaManager.applyRenameColumnsToOptions
  [core] fix column rename when columns referenced by table options. (apache#5964)
  [core] Log a warning for invalid partition values instead of throwing an exception when enable partition mark done.  (apache#5978)
  [core] Add required Field IDs to support ID-based column pruning (apache#5981)
  [core] Row-tracking row should keep their row_id and sequence_number in compaction (apache#5991)
  [core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994)
  [format] Introduce 'write.batch-memory' to control memory in arrow (apache#5988)
  [flink] Change filesystem.job-level-settings.enabled default value to true (apache#5971)
  [clone] support including some tables when clone all tables in a catalog or database. (apache#5993)
  [iceberg] Support TINYINT and SMALLINT in Iceberg Compatibility (apache#5984)
  [Python] Support snapshot and manifest for PyPaimon (apache#5987)
  [python] Change Schema to TableSchema in Class GetTableResponse.  (apache#5990)
  [core] Introduce 'compaction.total-size-threshold' to do full compaction (apache#5973)
  [Python] Support filesystem catalog for PyPaimon (apache#5986)
  [core] Add lance table type for rest catalog (apache#5977)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] field-id missing for Iceberg-compatible manifest Avro schema

5 participants