Support merge manifests on writes (MergeAppend) #363

HonahX · 2024-02-04T08:53:06Z

Add MergeAppendFiles. This PR will enable the following configurations:

commit.manifest-merge.enabled: Controls whether to automatically merge manifests on writes.
commit.manifest.min-count-to-merge: Minimum number of manifests to accumulate before merging.
commit.manifest.target-size-bytes: Target size when merging manifest files.

Since commit.manifest-merge.enabled is default to True, we need to make MergeAppend as the default way to append data to align with the property definition and java implementation

Fokko

Great start @HonahX Maybe we want to see if there are any things we can split out, such as the rolling manifest writer.

pyiceberg/table/__init__.py

Fokko · 2024-02-05T19:58:43Z

pyiceberg/table/__init__.py

+        # TODO: need to re-consider the name here: manifest containing positional deletes and manifest containing deleted entries
+        unmerged_deletes_manifests = [manifest for manifest in existing_manifests if manifest.content == ManifestContent.DELETES]
+
+        data_manifest_merge_manager = ManifestMergeManager(


We're changing the append operation from a fast-append to a regular append when it hits a threshold. I would be more comfortable with keeping the compaction separate. This way we know that an append/overwrite is always fast and in constant time. For example, if you have a process that appends data, you know how fast it will run (actually it is a function of the number of manifests).

Thanks for the explanation! Totally agree! I was thinking it might be a good time to bring FastAppend and MergeAppend to pyiceberg, making them inherit from a _SnapshotProducer

Fokko · 2024-02-08T12:03:14Z

pyiceberg/table/__init__.py

            raise ValueError("Cannot write to partitioned tables")

-        merge = _MergingSnapshotProducer(operation=Operation.APPEND, table=self)
+        # TODO: need to consider how to support both _MergeAppend and _FastAppend


Do we really want to support both? This part of the Java code has been a major source of (hard to debug) problems. Splitting out the commit and compaction path completely would simplify that quite a bit.

I think it is a good idea to have a separate API in UpdateSnapshot in #446 to compact manifests only. However, I believe retaining MergeAppend is also necessary due to the commit.manifest-merge.enabled setting. This setting, when enabled (which is the default), leads users to expect automatic merging of manifests when they append/overwrite data, rather than having to compact manifest by another API. What do you think?

tests/conftest.py

Fokko

Hey @HonahX thanks for working on this and sorry for the late reply. I wanted to take the time to test this properly.

It looks like either the snapshot inheritance is not working properly, or something is off with the writer. I converted the Avro manifest files to JSON using avro-tools, and noticed the following:

{
    "status": 1,
    "snapshot_id": {
        "long": 6972473597951752000
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 3438738529910612500
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 1638533332780464400
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
....
}

Looks like either the snapshot inheritance is not working properly when rewriting the manifests.

Fokko · 2024-03-13T08:43:40Z

tests/integration/test_writes.py

    assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


+@pytest.mark.integration


Can you parameterize the test for both V1 and V2 tables?

We want to assert the manifest-entries as well (only for the merge-appended one).

pyiceberg/table/__init__.py

sungwy

Thank you very much for adding this @HonahX . Just one small nit, and otherwise looks good to me!

sungwy · 2024-03-22T20:18:59Z

pyiceberg/table/__init__.py


        with self.transaction() as txn:
-            with txn.update_snapshot().fast_append() as update_snapshot:
+            with txn.update_snapshot().merge_append() as update_snapshot:


Could we update the new add_files method to also use merge_append?

That seems to be the default choice of snapshot producer in Java

@syun64 Could you elaborate on the motivation to pick merge-append over a fast-append? For Java, it is for historical reasons since the fast-append was added later. The fast-append creates more metadata but also has:

Takes less time to commit, since it doesn't rewrite any existing manifests. This reduces the chances of having a conflict.

The time it takes to commit is more predictable and fairly constant to the number of data files that are written.

When you static-overwrite partitions as you do in your typical ETL, it will speed up the deletes since it can just drop a whole manifest that the previous fast-append has produced.

The main downside is when you do full-table scans that you need to evaluate more metadata.

That's a good argument @Fokko . Especially in a world where we are potentially moving the work of doing table scans into the Rest Catalog, compacting manifests on write isn't important for this function that already looks to prioritize commit speed over anything else.

I think it makes sense to leave the function to use fast_append and let the users rely on other means of optimizing their table scans.

HonahX · 2024-06-03T07:37:19Z

Sorry for the long wait. I've fixed the sequence number inheritance issue. Previously some manifest entry incorrectly persist the -1 sequence number inherited from a newly constructed ManifestFile. I added a wrapper in ManifestWriter to ensure the sequence number None when unassigned.

I will add tests and update the doc soon

HonahX

Tests and doc are pushed! @Fokko @syun64 Could you please review this again when you have a chance?

pyiceberg/table/__init__.py

sungwy

Just a few nits, otherwise looks good @HonahX

pyiceberg/table/__init__.py

sungwy

This looks good to me @HonahX 👍

Fokko · 2024-06-30T19:46:26Z

I'm seeing some odd behavior:

from pyiceberg.catalog.sql import SqlCatalog
from datetime import datetime, timezone, date
import uuid
import pyarrow as pa

pa_schema = pa.schema([
    ("bool", pa.bool_()),
    ("string", pa.large_string()),
    ("string_long", pa.large_string()),
    ("int", pa.int32()),
    ("long", pa.int64()),
    ("float", pa.float32()),
    ("double", pa.float64()),
    # Not supported by Spark
    # ("time", pa.time64('us')),
    ("timestamp", pa.timestamp(unit="us")),
    ("timestamptz", pa.timestamp(unit="us", tz="UTC")),
    ("date", pa.date32()),
    # Not supported by Spark
    # ("time", pa.time64("us")),
    # Not natively supported by Arrow
    # ("uuid", pa.fixed(16)),
    ("binary", pa.large_binary()),
    ("fixed", pa.binary(16)),
])


TEST_DATA_WITH_NULL = {
    "bool": [False, None, True],
    "string": ["a", None, "z"],
    # Go over the 16 bytes to kick in truncation
    "string_long": ["a" * 22, None, "z" * 22],
    "int": [1, None, 9],
    "long": [1, None, 9],
    "float": [0.0, None, 0.9],
    "double": [0.0, None, 0.9],
    # 'time': [1_000_000, None, 3_000_000],  # Example times: 1s, none, and 3s past midnight #Spark does not support time fields
    "timestamp": [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)],
    "timestamptz": [
        datetime(2023, 1, 1, 19, 25, 00, tzinfo=timezone.utc),
        None,
        datetime(2023, 3, 1, 19, 25, 00, tzinfo=timezone.utc),
    ],
    "date": [date(2023, 1, 1), None, date(2023, 3, 1)],
    # Not supported by Spark
    # 'time': [time(1, 22, 0), None, time(19, 25, 0)],
    # Not natively supported by Arrow
    # 'uuid': [uuid.UUID('00000000-0000-0000-0000-000000000000').bytes, None, uuid.UUID('11111111-1111-1111-1111-111111111111').bytes],
    "binary": [b"\01", None, b"\22"],
    "fixed": [
        uuid.UUID("00000000-0000-0000-0000-000000000000").bytes,
        None,
        uuid.UUID("11111111-1111-1111-1111-111111111111").bytes,
    ],
}

catalog = SqlCatalog("test_sql_catalog", uri="sqlite:///:memory:", warehouse=f"/tmp/")

pa_table = pa.Table.from_pydict(TEST_DATA_WITH_NULL, schema=pa_schema)

catalog.create_namespace(('some',))

tbl = catalog.create_table(identifier="some.table", schema=pa_schema, properties={
    "commit.manifest.min-count-to-merge": "2"
})

for num in range(5):
    print(f"Appended: {num}")
    tbl.merge_append(pa_table)

It tries to read a corrupt file (or a bug in our reader):

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In[2], line 71
     69 for num in range(5):
     70     print(f"Appended: {num}")
---> 71     tbl.merge_append(pa_table)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1424, in Table.merge_append(self, df, snapshot_properties)
   1411 """
   1412 Shorthand API for appending a PyArrow table to a table transaction and merging manifests on write.
   1413 
   (...)
   1421     snapshot_properties: Custom properties to be added to the snapshot summary
   1422 """
   1423 with self.transaction() as tx:
-> 1424     tx.merge_append(df=df, snapshot_properties=snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:472, in Transaction.merge_append(self, df, snapshot_properties)
    468 data_files = _dataframe_to_data_files(
    469     table_metadata=self._table.metadata, write_uuid=update_snapshot.commit_uuid, df=df, io=self._table.io
    470 )
    471 for data_file in data_files:
--> 472     update_snapshot.append_data_file(data_file)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1899, in UpdateTableMetadata.__exit__(self, _, value, traceback)
   1897 def __exit__(self, _: Any, value: Any, traceback: Any) -> None:
   1898     """Close and commit the change."""
-> 1899     self.commit()

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1895, in UpdateTableMetadata.commit(self)
   1894 def commit(self) -> None:
-> 1895     self._transaction._apply(*self._commit())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2966, in _SnapshotProducer._commit(self)
   2965 def _commit(self) -> UpdatesAndRequirements:
-> 2966     new_manifests = self._manifests()
   2967     next_sequence_number = self._transaction.table_metadata.next_sequence_number()
   2969     summary = self._summary(self.snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2935, in _SnapshotProducer._manifests(self)
   2932 delete_manifests = executor.submit(_write_delete_manifest)
   2933 existing_manifests = executor.submit(self._existing_manifests)
-> 2935 return self._process_manifests(added_manifests.result() + delete_manifests.result() + existing_manifests.result())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3111, in MergeAppendFiles._process_manifests(self, manifests)
   3102 unmerged_deletes_manifests = [manifest for manifest in manifests if manifest.content == ManifestContent.DELETES]
   3104 data_manifest_merge_manager = _ManifestMergeManager(
   3105     target_size_bytes=self._target_size_bytes,
   3106     min_count_to_merge=self._min_count_to_merge,
   3107     merge_enabled=self._merge_enabled,
   3108     snapshot_producer=self,
   3109 )
-> 3111 return data_manifest_merge_manager.merge_manifests(unmerged_data_manifests) + unmerged_deletes_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3987, in _ManifestMergeManager.merge_manifests(self, manifests)
   3985 merged_manifests = []
   3986 for spec_id in reversed(groups.keys()):
-> 3987     merged_manifests.extend(self._merge_group(first_manifest, spec_id, groups[spec_id]))
   3989 return merged_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in _ManifestMergeManager._merge_group(self, first_manifest, spec_id, manifests)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in <listcomp>(.0)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3961, in _ManifestMergeManager._merge_group.<locals>.merge_bin(manifest_bin)
   3959     output_manifests.extend(manifest_bin)
   3960 else:
-> 3961     output_manifests.append(self._create_manifest(spec_id, manifest_bin))
   3963 return output_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3934, in _ManifestMergeManager._create_manifest(self, spec_id, manifest_bin)
   3932 with self._snapshot_producer.new_manifest_writer(spec=self._snapshot_producer.spec(spec_id)) as writer:
   3933     for manifest in manifest_bin:
-> 3934         for entry in self._snapshot_producer.fetch_manifest_entry(manifest=manifest, discard_deleted=False):
   3935             if entry.status == ManifestEntryStatus.DELETED and entry.snapshot_id == self._snapshot_producer.snapshot_id:
   3936                 #  only files deleted by this snapshot should be added to the new manifest
   3937                 writer.delete(entry)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3034, in _SnapshotProducer.fetch_manifest_entry(self, manifest, discard_deleted)
   3033 def fetch_manifest_entry(self, manifest: ManifestFile, discard_deleted: bool = True) -> List[ManifestEntry]:
-> 3034     return manifest.fetch_manifest_entry(io=self._io, discard_deleted=discard_deleted)

File ~/work/iceberg-python/pyiceberg/manifest.py:611, in ManifestFile.fetch_manifest_entry(self, io, discard_deleted)
    609 print(f"MANIFEST: {self.manifest_path}")
    610 input_file = io.new_input(self.manifest_path)
--> 611 with AvroFile[ManifestEntry](
    612     input_file,
    613     MANIFEST_ENTRY_SCHEMAS[DEFAULT_READ_VERSION],
    614     read_types={-1: ManifestEntry, 2: DataFile},
    615     read_enums={0: ManifestEntryStatus, 101: FileFormat, 134: DataFileContent},
    616 ) as reader:
    617     return [
    618         _inherit_from_manifest(entry, self)
    619         for entry in reader
    620         if not discard_deleted or entry.status != ManifestEntryStatus.DELETED
    621     ]

File ~/work/iceberg-python/pyiceberg/avro/file.py:172, in AvroFile.__enter__(self)
    170 with self.input_file.open() as f:
    171     self.decoder = new_decoder(f.read())
--> 172 self.header = self._read_header()
    173 self.schema = self.header.get_schema()
    174 if not self.read_schema:

File ~/work/iceberg-python/pyiceberg/avro/file.py:220, in AvroFile._read_header(self)
    219 def _read_header(self) -> AvroFileHeader:
--> 220     return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:333, in StructReader.read(self, decoder)
    331 for pos, field_reader in self._field_reader_functions:
    332     if pos is not None:
--> 333         struct[pos] = field_reader(decoder)  # later: pass reuse in here
    334     else:
    335         field_reader(decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:469, in MapReader.read(self, decoder)
    467         block_count = decoder.read_int()
    468 else:
--> 469     block_count = decoder.read_int()
    470     while block_count != 0:
    471         if block_count < 0:

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:85, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:92, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

EOFError: EOF: read 1 bytes

It tries to read this file, which turns out to be empty?

avro-tools tojson /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:44:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

avro-tools getmeta /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:45:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
schema	{"type":"struct","fields":[{"id":1,"name":"bool","type":"boolean","required":false},{"id":2,"name":"string","type":"string","required":false},{"id":3,"name":"string_long","type":"string","required":false},{"id":4,"name":"int","type":"int","required":false},{"id":5,"name":"long","type":"long","required":false},{"id":6,"name":"float","type":"float","required":false},{"id":7,"name":"double","type":"double","required":false},{"id":8,"name":"timestamp","type":"timestamp","required":false},{"id":9,"name":"timestamptz","type":"timestamptz","required":false},{"id":10,"name":"date","type":"date","required":false},{"id":11,"name":"binary","type":"binary","required":false},{"id":12,"name":"fixed","type":"fixed[16]","required":false}],"schema-id":0,"identifier-field-ids":[]}
partition-spec	{"spec-id":0,"fields":[]}
partition-spec-id	0
format-version	2
content	data
avro.schema	{"type": "record", "fields": [{"name": "status", "field-id": 0, "type": "int"}, {"name": "snapshot_id", "field-id": 1, "type": ["null", "long"], "default": null}, {"name": "data_sequence_number", "field-id": 3, "type": ["null", "long"], "default": null}, {"name": "file_sequence_number", "field-id": 4, "type": ["null", "long"], "default": null}, {"name": "data_file", "field-id": 2, "type": {"type": "record", "fields": [{"name": "content", "field-id": 134, "type": "int", "doc": "File format name: avro, orc, or parquet"}, {"name": "file_path", "field-id": 100, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "file_format", "field-id": 101, "type": "string", "doc": "File format name: avro, orc, or parquet"}, {"name": "partition", "field-id": 102, "type": {"type": "record", "fields": [], "name": "r102"}, "doc": "Partition data tuple, schema based on the partition spec"}, {"name": "record_count", "field-id": 103, "type": "long", "doc": "Number of records in the file"}, {"name": "file_size_in_bytes", "field-id": 104, "type": "long", "doc": "Total file size in bytes"}, {"name": "column_sizes", "field-id": 108, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k117_v118", "fields": [{"name": "key", "type": "int", "field-id": 117}, {"name": "value", "type": "long", "field-id": 118}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total size on disk"}, {"name": "value_counts", "field-id": 109, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k119_v120", "fields": [{"name": "key", "type": "int", "field-id": 119}, {"name": "value", "type": "long", "field-id": 120}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total count, including null and NaN"}, {"name": "null_value_counts", "field-id": 110, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k121_v122", "fields": [{"name": "key", "type": "int", "field-id": 121}, {"name": "value", "type": "long", "field-id": 122}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to null value count"}, {"name": "nan_value_counts", "field-id": 137, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k138_v139", "fields": [{"name": "key", "type": "int", "field-id": 138}, {"name": "value", "type": "long", "field-id": 139}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to number of NaN values in the column"}, {"name": "lower_bounds", "field-id": 125, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k126_v127", "fields": [{"name": "key", "type": "int", "field-id": 126}, {"name": "value", "type": "bytes", "field-id": 127}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to lower bound"}, {"name": "upper_bounds", "field-id": 128, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k129_v130", "fields": [{"name": "key", "type": "int", "field-id": 129}, {"name": "value", "type": "bytes", "field-id": 130}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to upper bound"}, {"name": "key_metadata", "field-id": 131, "type": ["null", "bytes"], "default": null, "doc": "Encryption key metadata blob"}, {"name": "split_offsets", "field-id": 132, "type": ["null", {"type": "array", "element-id": 133, "items": "long"}], "default": null, "doc": "Splittable offsets"}, {"name": "equality_ids", "field-id": 135, "type": ["null", {"type": "array", "element-id": 136, "items": "long"}], "default": null, "doc": "Field ids used to determine row equality in equality delete files."}, {"name": "sort_order_id", "field-id": 140, "type": ["null", "int"], "default": null, "doc": "ID representing sort order for this file"}], "name": "r2"}}], "name": "manifest_entry"}
avro.codec	null

Looks like we're writing empty files: #876

Fokko

Looking good @HonahX ! 🙌

Fokko · 2024-06-30T17:50:19Z

mkdocs/docs/api.md


 # or

+tbl.merge_append(df)


I'm reluctant to expose this to the public API for a couple of reasons:

Unsure if folks know what the impact is between choosing fast- or merge appends.

It might also be that we do appends as part of the operation (upserts as an obvious one).

Another method to the public API :)

How about having something similar as in Java, to control this using a table property: https://iceberg.apache.org/docs/1.5.2/configuration/#table-behavior-properties

Sounds great! I am also +1 on let it controlled by the config. I made merge_append a separate API to mirror the Java side implementation, which has newAppend and newFastAppend APIs. But it seems better to just make the commit.manifest-merge.enabled default to False on python side.

I will still keep FastAppend and MergeAppend as separate class, and keep merge_append in UpdateSnapshot class to ensure clarity, although the current MergeAppend is purely FastAppend + manifest merge.

Just curious, why not Java side newAppend return an FastAppend impl when commit.manifest-merge.enabled is False. Is it due to some backward compatibiilty issue?

Thanks! I think the use-case of the Java library is slightly different, since that's mostly used in query engines.

Is it due to some backward compatibiilty issue?

I think it is for historical reasons, since the fast-append was added later on :)

btw, I like how you split it out in classes, it is much cleaner now 👍

mkdocs/docs/configuration.md

HonahX · 2024-07-01T08:45:39Z

pyiceberg/table/__init__.py

-                output_file_location = _new_manifest_path(
-                    location=self._transaction.table_metadata.location, num=0, commit_uuid=self.commit_uuid
-                )
                with write_manifest(
                    format_version=self._transaction.table_metadata.format_version,
                    spec=self._transaction.table_metadata.spec(),
                    schema=self._transaction.table_metadata.schema(),
-                    output_file=self._io.new_output(output_file_location),
+                    output_file=self.new_manifest_output(),


@Fokko Thanks for the detailed code example and stacktrace! With the help of them and #876, I found the root cause of the bug: the collision of the names of manifest files within a commit. I've modified the code to avoid that.

It is hard to find because if the file is in the object storage, when FileIO opens a new OutputFile on the same location, the existing file is still readable until the OutputFile "commit". So for integration test that use minio, everything works fine. We won't find any issue until we rollback to some previous snapshot.

For the in-memory SqlCatalog test, since the file is in the local filesystem, the existing file become empty/corrupted immediately after we open a new OutputFile on the same location. This behavior causes the ManifestMergeManager write some empty file and the issue emerges.

I've included a temporary test in test_sql.py to ensure correctness of the current change. I will try to formalize that tommorrow

Thanks for digging into this and fixing it 🙌

Fokko · 2024-07-04T18:55:52Z

Doing some testing with avro-tools, asserting the state after 5 append operations with "commit.manifest.min-count-to-merge": "2"

V1 Table

Manifest-list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro",
    "manifest_length": 6878,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 6508090689697406000,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/88807344-0e23-413c-827e-2a9ec63c6233-m1.avro",
    "manifest_length": 6436,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 4,
    "min_sequence_number": 1,
    "added_snapshot_id": 3455109142449701000,
    "added_files_count": 1,
    "existing_files_count": 3,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 9,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

We have 5 manifests as expected:

avro-tools tojson /tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro | wc -l 
       5

Last one:

{
    "status": 1,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-80ba9f84-99af-4af1-b8f5-4caa254645c2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First one:

{
    "status": 0,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-bbd4029c-510a-48e6-a905-ab5b69a832e8.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

This looks good, except for one thing: the snapshot_id is off, as from the spec:

Snapshot id where the file was added, or deleted if status is 2. Inherited when null.

This should be the ID of the first append operation.

V2 Table

Manifest list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/93717a88-1cea-4e3d-a69a-00ce3d087822-m1.avro",
    "manifest_length": 6883,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 898025966831056900,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/5c64a07c-4b8a-4be1-a751-d4fd339560e2-m0.avro",
    "manifest_length": 5127,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 1343032504684197000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

last manifest file in manifest-list

{
    "status": 1,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First manifest in manifest-list

{
    "status": 0,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-5c64a07c-4b8a-4be1-a751-d4fd339560e2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

Except for the snapshot-id and #893 this looks great! 🥳

Fokko · 2024-07-04T19:05:29Z

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

avro-tools tojson /tmp/some.db/woooo/metadata/snap-3952911087333379496-0-27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5.avro        
{"manifest_path":"/tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":500,"min_sequence_number":500,"added_snapshot_id":3952911087333379496,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/dac5af38-f01b-4a59-9e4c-b14a26706e75-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":499,"min_sequence_number":499,"added_snapshot_id":8943105647176444976,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/ed164af5-dda7-4e3e-9b67-fcb2fd78771b-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":498,"min_sequence_number":498,"added_snapshot_id":723002263384967579,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/e2d3e14e-8caf-4ca0-9515-c0a19c2a5658-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":497,"min_sequence_number":497,"added_snapshot_id":6977509396340474362,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro","manifest_length":222800,"partition_spec_id":0,"content":0,"sequence_number":496,"min_sequence_number":1,"added_snapshot_id":7132518699806947299,"added_files_count":1,"existing_files_count":495,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":1485,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}

I don't think it merges the manifests as it should:

➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro | wc -l                 
24/07/04 21:04:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
     496
➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro | wc -l
24/07/04 21:04:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
       1

I would expect the manifest-entries to be distributed more evenly over the manifests to ensure maximum parallelization.

# Conflicts: # pyiceberg/table/__init__.py # tests/integration/test_writes/test_writes.py

HonahX · 2024-07-10T07:59:12Z

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

I think the observed behavior aligns with Java's merge_append. Each time we do one append, we add one manifest. At 100th append, when the number of manifest reach 100, the merge manager merge all of them to a new manifest file because they are all in the same "bin". This happens whenever the number of manifest reach 100, thus leaving us with a large manifest and 4 small ones.

I use spark to do the similar thing and get a similar result

@pytest.mark.integration
def test_spark_ref_behavior(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
    identifier = "default.test_spark_ref_behavior"
    tbl = _create_table(session_catalog, identifier,
                        {"commit.manifest-merge.enabled": "true", "commit.manifest.min-count-to-merge": "10", "format-version": 2}, [])
    spark_df = spark.createDataFrame(arrow_table_with_null.to_pandas())

    for i in range(50):
        spark_df.writeTo(f"integration.{identifier}").append()
    tbl = session_catalog.load_table(identifier)
    tbl_a_manifests = tbl.current_snapshot().manifests(tbl.io)
    for manifest in tbl_a_manifests:
        print(
            f"Manifest: added: {manifest.added_files_count}, existing: {manifest.existing_files_count}, deleted: {manifest.deleted_files_count}")
=====
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 135, deleted: 0

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

I think this also reveal the value of the fast_append + compaction model, which make things more explicit

HonahX · 2024-07-10T08:02:16Z

tests/integration/test_writes/test_writes.py

+        assert tbl_a_data_file["file_path"].startswith("s3://warehouse/default/merge_manifest_a/data/")
+        if tbl_a_data_file["file_path"] == first_data_file_path:
+            # verify that the snapshot id recorded should be the one where the file was added
+            assert tbl_a_entries["snapshot_id"][i] == first_snapshot_id


added a test to verify the snapshot_id issue

Fokko · 2024-07-10T10:55:37Z

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

Thanks, that makes actually a lot of sense 👍

Fokko · 2024-07-10T10:56:31Z

Whoo 🥳 Thanks @HonahX for working on this, and thanks @syun64 for the review 🙌

HonahX mentioned this pull request Feb 4, 2024

Implement Centralized Management of Table Properties #365

Closed

Fokko reviewed Feb 5, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024

Fokko reviewed Feb 8, 2024

View reviewed changes

HonahX changed the title ~~Support merge manifests on writes~~ Support merge manifests on writes (MergeAppend) Feb 23, 2024

HonahX marked this pull request as ready for review February 26, 2024 10:51

HonahX mentioned this pull request Feb 27, 2024

Support metadata compaction #270

Open

HonahX commented Mar 3, 2024

View reviewed changes

tests/conftest.py Show resolved Hide resolved

Fokko requested changes Mar 13, 2024

View reviewed changes

sungwy reviewed Mar 22, 2024

View reviewed changes

Fokko mentioned this pull request Mar 27, 2024

Add entries metadata table #551

Merged

jqin61 mentioned this pull request Apr 3, 2024

Support partial deletes #569

Merged

Fokko mentioned this pull request Apr 10, 2024

Implement rolling manifest-writers #596

Closed

HonahX added 4 commits June 2, 2024 18:39

add ListPacker + tests

6803eba

add merge append

f0fc260

add merge_append

cbb8cec

fix snapshot inheritance

bf63c03

HonahX force-pushed the manifest_compaction branch from 57eba6a to bf63c03 Compare June 3, 2024 07:29

HonahX added 4 commits June 3, 2024 22:16

test manifest file and entries

9dd69af

add doc

4921a7f

fix lint

984ca41

change test name

8510f71

HonahX requested review from Fokko and sungwy June 4, 2024 06:53

HonahX commented Jun 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

sungwy reviewed Jun 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

HonahX added 3 commits June 9, 2024 22:49

Merge branch 'main' into manifest_compaction

1ec5edd

address review comments

a7da318

rename _MergingSnapshotProducer to _SnapshotProducer

c4feda5

sungwy approved these changes Jun 15, 2024

View reviewed changes

Merge branch 'main' into manifest_compaction

7e6e1d4

Fokko reviewed Jun 30, 2024

View reviewed changes

Fokko mentioned this pull request Jun 30, 2024

Disallow writing empty Manifest files #876

Merged

HonahX added 2 commits July 1, 2024 01:14

Merge branch 'main' into manifest_compaction

123f5d3

fix a serious bug

9777e9b

HonahX commented Jul 1, 2024

View reviewed changes

HonahX added 5 commits July 1, 2024 21:37

Merge branch 'main' into manifest_compaction

3393757

update the doc

66dddbe

remove merge_append as public API

aff1bea

make default to false

7625857

add test description

3e3a1b4

HonahX added 3 commits July 9, 2024 22:39

Merge branch 'main' into manifest_compaction

914d6ef

# Conflicts: # pyiceberg/table/__init__.py # tests/integration/test_writes/test_writes.py

fix merge conflict

71a5fe0

fix snapshot_id issue

c7e4095

HonahX commented Jul 10, 2024

View reviewed changes

Fokko approved these changes Jul 10, 2024

View reviewed changes

Fokko merged commit 77a07c9 into apache:main Jul 10, 2024

		assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


		@pytest.mark.integration


		# or

		tbl.merge_append(df)

Support merge manifests on writes (MergeAppend) #363

Support merge manifests on writes (MergeAppend) #363

Uh oh!

Conversation

HonahX commented Feb 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX commented Jun 3, 2024

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jun 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HonahX Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jul 4, 2024

V1 Table

Manifest-list

5th manifest-list

4th manifest-list

HonahX commented Feb 4, 2024 •

edited

Loading

Fokko left a comment •

edited

Loading

Fokko commented Jun 30, 2024 •

edited

Loading

HonahX Jul 1, 2024 •

edited

Loading

HonahX commented Jul 10, 2024 •

edited

Loading

HonahX Jul 10, 2024 •

edited

Loading