-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Feature Request / Improvement
The current spec defines Avro file metadata requirements for manifest files in a clear table:
| v1 | v2 | Key | Value |
|---|---|---|---|
| required | required | schema |
JSON representation of the table schema at the time the manifest was written |
| optional | required | schema-id |
ID of the schema used to write the manifest as a string |
| required | required | partition-spec |
JSON representation of the partition spec used to write the manifest |
| optional | required | partition-spec-id |
ID of the partition spec used to write the manifest as a string |
| optional | required | format-version |
Table format version number of the manifest as a string |
| required | content |
Type of content files tracked by the manifest: "data" or "deletes" |
But manifest list files have no equivalent specification for their Avro metadata, despite the Java implementation writing metadata such as format-version, snapshot-id, parent-snapshot-id, and sequence-number to manifest list files since 2020.
For manifests: #913 added format-version to code (2020-04), and #1499 added the spec (2020-10).
For manifest lists: #907 added format-version to code (2020-04), but there are no corresponding spec changes.
As a result, implementations have no standard way to detect the format version from a manifest list file alone. They are forced to either infer the version based on the presence of certain fields, or simply trust the table metadata version. The latter is unreliable in upgrade scenarios where a v2 table may contain v1 snapshots, introducing unnecessary complexity.
The following table might be a reasonable addition, though I'm not entirely certain about the requirements:
| v1 | v2 | v3 | Key | Value |
|---|---|---|---|---|
| required | required | required | snapshot-id |
The snapshot ID for this manifest list as a string |
| required | required | required | parent-snapshot-id |
The parent snapshot ID as a string |
| required | required | sequence-number |
The sequence number of the snapshot as a string | |
| required | first-row-id |
The first row ID for row lineage as a string | ||
| optional | required | required | format-version |
Table format version number as a string |
Query engine
None
Willingness to contribute
- I can contribute this improvement/feature independently
- I would be willing to contribute this improvement/feature with guidance from the Iceberg community
- I cannot contribute this improvement/feature at this time