Skip to content

Spec: Add implementation note for determining manifest list format version #14926

@yshcz

Description

@yshcz

Feature Request / Improvement

The current spec defines Avro file metadata requirements for manifest files in a clear table:

v1 v2 Key Value
required required schema JSON representation of the table schema at the time the manifest was written
optional required schema-id ID of the schema used to write the manifest as a string
required required partition-spec JSON representation of the partition spec used to write the manifest
optional required partition-spec-id ID of the partition spec used to write the manifest as a string
optional required format-version Table format version number of the manifest as a string
required content Type of content files tracked by the manifest: "data" or "deletes"

But manifest list files have no equivalent specification for their Avro metadata, despite the Java implementation writing metadata such as format-version, snapshot-id, parent-snapshot-id, and sequence-number to manifest list files since 2020.

For manifests: #913 added format-version to code (2020-04), and #1499 added the spec (2020-10).
For manifest lists: #907 added format-version to code (2020-04), but there are no corresponding spec changes.

As a result, implementations have no standard way to detect the format version from a manifest list file alone. They are forced to either infer the version based on the presence of certain fields, or simply trust the table metadata version. The latter is unreliable in upgrade scenarios where a v2 table may contain v1 snapshots, introducing unnecessary complexity.

The following table might be a reasonable addition, though I'm not entirely certain about the requirements:

v1 v2 v3 Key Value
required required required snapshot-id The snapshot ID for this manifest list as a string
required required required parent-snapshot-id The parent snapshot ID as a string
required required sequence-number The sequence number of the snapshot as a string
required first-row-id The first row ID for row lineage as a string
optional required required format-version Table format version number as a string

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions