Skip to content

Comments

HDDS-10611. Design document for MPU GC Optimization#9793

Open
devabhishekpal wants to merge 5 commits intoapache:masterfrom
devabhishekpal:HDDS-10611-design
Open

HDDS-10611. Design document for MPU GC Optimization#9793
devabhishekpal wants to merge 5 commits intoapache:masterfrom
devabhishekpal:HDDS-10611-design

Conversation

@devabhishekpal
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-10611. Design document for MPU GC Optimization

Please describe your PR in detail:
This PR adds the design doc for optimizing the OM GC pressure by the MPU file handling

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10611

How was this patch tested?

N/A

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for the design on this long overdue issue. I am +1 on the overall direction. Left some comments.

@ivandika3 ivandika3 requested a review from szetszwo February 21, 2026 09:24
@devabhishekpal
Copy link
Contributor Author

Thanks for the exhaustive review and inputs @ivandika3.
I updated the document with the new details, please do let me know in case I am missing something in the understanding and also if something else could be improved.

FYI, I have a sample/PoC patch created if anybody wants to check the changes.
master...devabhishekpal:ozone:HDDS-10611

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the design @devabhishekpal @rakeshadr. Overall LGTM, just a few things we can clarify in the doc.

* Value = full `OmMultipartKeyInfo` with all parts inline.

**Implications:**
1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this is how the regular open key write works as well, and alone isn't what causes MPU to be worse than other writes. The overhead increases with number of blocks. Since MPU defaults to 5mb parts, it is 50x more expensive in this area compared to our regular key write which uses 256mb blocks.

* **Metadata table:** Lightweight per-MPU metadata (no part list).
* **Parts table:** One row per part (flat structure).

**New MultipartPartInfo Structure:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we are going to keep MultipartKeyInfo and the MultipartInfoTable, but deprecate some fields in MultipartKeyInfo while potentially adding new ones. I think defining the MultipartKeyInfo structure for the new flow first is important to set up context for the individual part structure. For example, it seems like we don't need to duplicate volume, bucket, key, metadata, encryption info, and file checksum for each part, that should be contained in the metadata object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not deprecating existing fields from MultipartKeyInfo because that might introduce issues with compatibility with older clients.
Instead we are only appending one new field i.e. schemaVersion to the object to check what type of write is happening i.e upgraded write or older write with the existing flow.

I also checked the S3 Multipart specification. It seems the S3 client does preserve checksums for the parts as well as the object as a whole.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpuchecksums

When using CRC-64/NVME, Amazon S3 calculates the checksum of the full object after the multipart or single part upload is complete. 

and

For individual parts, you can use GetObject or HeadObject. If you want to retrieve the checksum values for individual parts of multipart uploads while they're still in process, you can use ListParts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I agree that storing the volume and bucket name at the MultipartKeyInfo might be a better way. In fact since we already know that the volume is fixed to be /s3v we can even skip that information - unless we plan on changing this name sometime in the future

required string keyName = 5;
required uint64 dataSize = 6;
required uint64 modificationTime = 7;
repeated KeyLocationList keyLocationList = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we map each part to a single block. Is this repeated to potentially support multiple blocks per part in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about larger part sizes?
I think aws supports part size of 5MB minimum, but it goes upto 5GB per part maximum.

Ref Q/A answered here: https://repost.aws/questions/QU1c1UQ6LuTceus0VCxGmntg/upload-large-files-to-s3-via-cli

The maximum object size that can be uploaded in a single PUT operation is 5GB.

Would it still maintain a single block map in this case? I am not sure on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked on this by trying to upload a 2GiB part locally.
Upon inspecting the table using ldb command here is the block information snippet:

more blocks above...,
{
  "blockID" : {
    "containerBlockID" : {
      "containerID" : 1,
      "localID" : 117883640217600004
    },
    "blockCommitSequenceId" : 627
  },
  "length" : 268435456,
  "offset" : 0,
  "createVersion" : 0,
  "partNumber" : 0,
  "underConstruction" : false
}, {
  "blockID" : {
    "containerBlockID" : {
      "containerID" : 2,
      "localID" : 117883640217600005
    },
    "blockCommitSequenceId" : 787
  },
  "length" : 268435456,
  "offset" : 0,
  "createVersion" : 0,
  "partNumber" : 0,
  "underConstruction" : false
}
...

So this does write to multiple blocks in case it exceeds the configured block size.

Comment on lines +137 to +138
* Prefix scan for all parts in one upload uses:
* `uploadId(UTF-8 bytes)` + `0x00`
Copy link
Contributor

@errose28 errose28 Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, only the ordering of the single metadata entry needs to match the order defined by S3 for list multipart upload requests, right? Are we free to order the parts internally as needed?

* `uploadId` (`String`)
* `partNumber` (`int32`)
* Persisted key bytes are encoded as:
* `uploadId(UTF-8 bytes)` + `0x00` + `partNumber(4-byte big-endian int)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any advantage to this byte encoding scheme vs a simple string with separator like "uploadID/partnumber" which we use elsewhere? String keys are much easier to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comment by @ivandika3 gives more insight into why using a byte encoding is better.

* `uploadId(UTF-8 bytes)` + `0x00`

```protobuf
message MultipartKeyInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the same as the current MultipartKeyInfo, but we should label the now unused fields as deprecated instead of removing them.

* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag.
* **Approach-2:** Dedicated metadata table, cleanest separation, requires broader refactor.

#### Pros and Cons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of implied here, but for me the biggest reason the existing table works best for metadata is migration. Finalization can happen in between keys being written and committed so the OM would always have to check both tables, even after finalizing. The proposed approach with always reading from the existing table and forking with a schema version handles this much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants