-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[core] Add required Iceberg Field IDs to support ID-based column pruning #5981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LsomeYeah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your contribution! Looks good to me!
Additionally, I was wondering if you could try reading the data using PyIceberg after this change? I ran some tests locally and found the following:
- compress strategy
snappy: When the manifest's compression strategy is set tosnappy, I encounteredValueError: Checksum failurewhen reading manifests.null: When the manifest's compression strategy isnull, the manifests can be read correctly without any issues.
- file format
avro: I ran into errors when trying to read Avro-formatted data files. This seems to be due to PyArrow not supporting the reading of Avro files.parquet: when the manifest's compression strategy isnulland the data file's format is Parquet, I was able to successfully read the data using PyIceberg.
|
@LsomeYeah Thank you for checking |
Thanks for your reply. |
LsomeYeah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
@LsomeYeah BQ query: Polars query: |
|
I am excited about this PR, thanks @muttcg and @LsomeYeah for looking into this ❤️ 🥇 |
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @muttcg and @LsomeYeah , +1
* github/master: (41 commits) [Python] Support data writer for PyPaimon (apache#5997) [Python] Support scan and plan for PyPaimon (apache#5996) [flink-cdc] Provide option to disable use of source primary keys if primary keys in action command are not specified for CDC ingestion. (apache#5793) Revert "[core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994)" (apache#5995) [core] Add total compaction count metric (apache#5963) [hotfix] Rename to SchemaManager.applyRenameColumnsToOptions [core] fix column rename when columns referenced by table options. (apache#5964) [core] Log a warning for invalid partition values instead of throwing an exception when enable partition mark done. (apache#5978) [core] Add required Field IDs to support ID-based column pruning (apache#5981) [core] Row-tracking row should keep their row_id and sequence_number in compaction (apache#5991) [core] Add compaction.force-wait to support force waiting compaction finish when preparing commit (apache#5994) [format] Introduce 'write.batch-memory' to control memory in arrow (apache#5988) [flink] Change filesystem.job-level-settings.enabled default value to true (apache#5971) [clone] support including some tables when clone all tables in a catalog or database. (apache#5993) [iceberg] Support TINYINT and SMALLINT in Iceberg Compatibility (apache#5984) [Python] Support snapshot and manifest for PyPaimon (apache#5987) [python] Change Schema to TableSchema in Class GetTableResponse. (apache#5990) [core] Introduce 'compaction.total-size-threshold' to do full compaction (apache#5973) [Python] Support filesystem catalog for PyPaimon (apache#5986) [core] Add lance table type for rest catalog (apache#5977) ...
Purpose
Linked issue: close #5875
Apache Iceberg specification for Avro requires to have field-ids in order to support ID-based column pruning.
For example Google Big Query engine as well as PyIceberg don't work at all without field-ids causing critical issues:
The change affects only Iceberg Avro schema creation, bringing new Iceberg custom properties following the specification, as well as fixing the wrong ID for
partitionsvalues.Tests
All existing tests must work. Added org.apache.paimon.iceberg.IcebergCompatibilityTest.testIcebergAvroFieldIds test to cover all required ids.
API and Format
No
Documentation