HDDS-10611. Design document for MPU GC Optimization by devabhishekpal · Pull Request #9793 · apache/ozone

devabhishekpal · 2026-02-19T17:02:03Z

What changes were proposed in this pull request?

HDDS-10611. Design document for MPU GC Optimization

Please describe your PR in detail:
This PR adds the design doc for optimizing the OM GC pressure by the MPU file handling

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10611

How was this patch tested?

N/A

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

ivandika3

Thanks @devabhishekpal for the design on this long overdue issue. I am +1 on the overall direction. Left some comments.

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

…ection

devabhishekpal · 2026-02-22T13:47:34Z

Thanks for the exhaustive review and inputs @ivandika3.
I updated the document with the new details, please do let me know in case I am missing something in the understanding and also if something else could be improved.

FYI, I have a sample/PoC patch created if anybody wants to check the changes.
master...devabhishekpal:ozone:HDDS-10611

errose28

Thanks for the design @devabhishekpal @rakeshadr. Overall LGTM, just a few things we can clarify in the doc.

errose28 · 2026-02-23T20:49:24Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611).


It's worth noting that this is how the regular open key write works as well, and alone isn't what causes MPU to be worse than other writes. The overhead increases with number of blocks. Since MPU defaults to 5mb parts, it is 50x more expensive in this area compared to our regular key write which uses 256mb blocks.

errose28 · 2026-02-23T21:13:49Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**


My understanding is that we are going to keep MultipartKeyInfo and the MultipartInfoTable, but deprecate some fields in MultipartKeyInfo while potentially adding new ones. I think defining the MultipartKeyInfo structure for the new flow first is important to set up context for the individual part structure. For example, it seems like we don't need to duplicate volume, bucket, key, metadata, encryption info, and file checksum for each part, that should be contained in the metadata object.

We are not deprecating existing fields from MultipartKeyInfo because that might introduce issues with compatibility with older clients.
Instead we are only appending one new field i.e. schemaVersion to the object to check what type of write is happening i.e upgraded write or older write with the existing flow.

I also checked the S3 Multipart specification. It seems the S3 client does preserve checksums for the parts as well as the object as a whole.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpuchecksums

When using CRC-64/NVME, Amazon S3 calculates the checksum of the full object after the multipart or single part upload is complete.

and

For individual parts, you can use GetObject or HeadObject. If you want to retrieve the checksum values for individual parts of multipart uploads while they're still in process, you can use ListParts.

Though I agree that storing the volume and bucket name at the MultipartKeyInfo might be a better way. In fact since we already know that the volume is fixed to be /s3v we can even skip that information - unless we plan on changing this name sometime in the future

errose28 · 2026-02-23T21:15:01Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;


Currently we map each part to a single block. Is this repeated to potentially support multiple blocks per part in the future?

What about larger part sizes?
I think aws supports part size of 5MB minimum, but it goes upto 5GB per part maximum.

Ref Q/A answered here: https://repost.aws/questions/QU1c1UQ6LuTceus0VCxGmntg/upload-large-files-to-s3-via-cli

The maximum object size that can be uploaded in a single PUT operation is 5GB.

Would it still maintain a single block map in this case? I am not sure on this.

I just checked on this by trying to upload a 2GiB part locally.
Upon inspecting the table using ldb command here is the block information snippet:

more blocks above..., { "blockID" : { "containerBlockID" : { "containerID" : 1, "localID" : 117883640217600004 }, "blockCommitSequenceId" : 627 }, "length" : 268435456, "offset" : 0, "createVersion" : 0, "partNumber" : 0, "underConstruction" : false }, { "blockID" : { "containerBlockID" : { "containerID" : 2, "localID" : 117883640217600005 }, "blockCommitSequenceId" : 787 }, "length" : 268435456, "offset" : 0, "createVersion" : 0, "partNumber" : 0, "underConstruction" : false } ...

So this does write to multiple blocks in case it exceeds the configured block size.

errose28 · 2026-02-23T21:18:13Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+* Prefix scan for all parts in one upload uses:
+  * `uploadId(UTF-8 bytes)` + `0x00`


Just to clarify, only the ordering of the single metadata entry needs to match the order defined by S3 for list multipart upload requests, right? Are we free to order the parts internally as needed?

errose28 · 2026-02-23T21:19:53Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  * `uploadId` (`String`)
+  * `partNumber` (`int32`)
+* Persisted key bytes are encoded as:
+  * `uploadId(UTF-8 bytes)` + `0x00` + `partNumber(4-byte big-endian int)`


Is there any advantage to this byte encoding scheme vs a simple string with separator like "uploadID/partnumber" which we use elsewhere? String keys are much easier to debug.

Yes, this comment by @ivandika3 gives more insight into why using a byte encoding is better.

errose28 · 2026-02-23T21:20:56Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+  * `uploadId(UTF-8 bytes)` + `0x00`
+
+```protobuf
+message MultipartKeyInfo {


This should be the same as the current MultipartKeyInfo, but we should label the now unused fields as deprecated instead of removing them.

errose28 · 2026-02-23T21:28:25Z

hadoop-hdds/docs/content/design/mpu-gc-optimization.md

+* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag.
+* **Approach-2:** Dedicated metadata table, cleanest separation, requires broader refactor.
+
+#### Pros and Cons


This is sort of implied here, but for me the biggest reason the existing table works best for metadata is migration. Finalization can happen in between keys being written and committed so the OM would always have to check both tables, even after finalizing. The proposed approach with always reading from the existing table and forking with a schema version handles this much better.

HDDS-10611. Design document for MPU GC Optimization

bf34ede

devabhishekpal requested review from ChenSammi, adoroszlai, ivandika3, jojochuang, kerneltime and rakeshadr February 19, 2026 17:03

devabhishekpal self-assigned this Feb 19, 2026

devabhishekpal added the design label Feb 19, 2026

jojochuang reviewed Feb 20, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Outdated Show resolved Hide resolved

jojochuang reviewed Feb 20, 2026

View reviewed changes

hadoop-hdds/docs/content/design/mpu-gc-optimization.md Outdated Show resolved Hide resolved

Address review comments

c4416bb

devabhishekpal requested a review from jojochuang February 20, 2026 14:03

ivandika3 reviewed Feb 21, 2026

View reviewed changes

ivandika3 requested a review from szetszwo February 21, 2026 09:24

Update code flow logic details, fix sections, add industry practice s…

28f8ac6

…ection

devabhishekpal requested a review from ivandika3 February 22, 2026 13:47

devabhishekpal added 2 commits February 22, 2026 19:46

Update section headings

5a721f7

Correct some statements

c2102ba

errose28 reviewed Feb 23, 2026

View reviewed changes

		* Prefix scan for all parts in one upload uses:
		* `uploadId(UTF-8 bytes)` + `0x00`

Comments

Conversation

devabhishekpal commented Feb 19, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devabhishekpal commented Feb 22, 2026

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivandika3 left a comment •

edited

Loading

errose28 Feb 23, 2026 •

edited

Loading