HDDS-10611. Design document for MPU GC Optimization#9793
HDDS-10611. Design document for MPU GC Optimization#9793devabhishekpal wants to merge 5 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Thanks @devabhishekpal for the design on this long overdue issue. I am +1 on the overall direction. Left some comments.
|
Thanks for the exhaustive review and inputs @ivandika3. FYI, I have a sample/PoC patch created if anybody wants to check the changes. |
errose28
left a comment
There was a problem hiding this comment.
Thanks for the design @devabhishekpal @rakeshadr. Overall LGTM, just a few things we can clarify in the doc.
| * Value = full `OmMultipartKeyInfo` with all parts inline. | ||
|
|
||
| **Implications:** | ||
| 1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). |
There was a problem hiding this comment.
It's worth noting that this is how the regular open key write works as well, and alone isn't what causes MPU to be worse than other writes. The overhead increases with number of blocks. Since MPU defaults to 5mb parts, it is 50x more expensive in this area compared to our regular key write which uses 256mb blocks.
| * **Metadata table:** Lightweight per-MPU metadata (no part list). | ||
| * **Parts table:** One row per part (flat structure). | ||
|
|
||
| **New MultipartPartInfo Structure:** |
There was a problem hiding this comment.
My understanding is that we are going to keep MultipartKeyInfo and the MultipartInfoTable, but deprecate some fields in MultipartKeyInfo while potentially adding new ones. I think defining the MultipartKeyInfo structure for the new flow first is important to set up context for the individual part structure. For example, it seems like we don't need to duplicate volume, bucket, key, metadata, encryption info, and file checksum for each part, that should be contained in the metadata object.
There was a problem hiding this comment.
We are not deprecating existing fields from MultipartKeyInfo because that might introduce issues with compatibility with older clients.
Instead we are only appending one new field i.e. schemaVersion to the object to check what type of write is happening i.e upgraded write or older write with the existing flow.
I also checked the S3 Multipart specification. It seems the S3 client does preserve checksums for the parts as well as the object as a whole.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html#mpuchecksums
When using CRC-64/NVME, Amazon S3 calculates the checksum of the full object after the multipart or single part upload is complete.
and
For individual parts, you can use GetObject or HeadObject. If you want to retrieve the checksum values for individual parts of multipart uploads while they're still in process, you can use ListParts.
There was a problem hiding this comment.
Though I agree that storing the volume and bucket name at the MultipartKeyInfo might be a better way. In fact since we already know that the volume is fixed to be /s3v we can even skip that information - unless we plan on changing this name sometime in the future
| required string keyName = 5; | ||
| required uint64 dataSize = 6; | ||
| required uint64 modificationTime = 7; | ||
| repeated KeyLocationList keyLocationList = 8; |
There was a problem hiding this comment.
Currently we map each part to a single block. Is this repeated to potentially support multiple blocks per part in the future?
There was a problem hiding this comment.
What about larger part sizes?
I think aws supports part size of 5MB minimum, but it goes upto 5GB per part maximum.
Ref Q/A answered here: https://repost.aws/questions/QU1c1UQ6LuTceus0VCxGmntg/upload-large-files-to-s3-via-cli
The maximum object size that can be uploaded in a single PUT operation is 5GB.
Would it still maintain a single block map in this case? I am not sure on this.
There was a problem hiding this comment.
I just checked on this by trying to upload a 2GiB part locally.
Upon inspecting the table using ldb command here is the block information snippet:
more blocks above...,
{
"blockID" : {
"containerBlockID" : {
"containerID" : 1,
"localID" : 117883640217600004
},
"blockCommitSequenceId" : 627
},
"length" : 268435456,
"offset" : 0,
"createVersion" : 0,
"partNumber" : 0,
"underConstruction" : false
}, {
"blockID" : {
"containerBlockID" : {
"containerID" : 2,
"localID" : 117883640217600005
},
"blockCommitSequenceId" : 787
},
"length" : 268435456,
"offset" : 0,
"createVersion" : 0,
"partNumber" : 0,
"underConstruction" : false
}
...
So this does write to multiple blocks in case it exceeds the configured block size.
| * Prefix scan for all parts in one upload uses: | ||
| * `uploadId(UTF-8 bytes)` + `0x00` |
There was a problem hiding this comment.
Just to clarify, only the ordering of the single metadata entry needs to match the order defined by S3 for list multipart upload requests, right? Are we free to order the parts internally as needed?
| * `uploadId` (`String`) | ||
| * `partNumber` (`int32`) | ||
| * Persisted key bytes are encoded as: | ||
| * `uploadId(UTF-8 bytes)` + `0x00` + `partNumber(4-byte big-endian int)` |
There was a problem hiding this comment.
Is there any advantage to this byte encoding scheme vs a simple string with separator like "uploadID/partnumber" which we use elsewhere? String keys are much easier to debug.
There was a problem hiding this comment.
Yes, this comment by @ivandika3 gives more insight into why using a byte encoding is better.
| * `uploadId(UTF-8 bytes)` + `0x00` | ||
|
|
||
| ```protobuf | ||
| message MultipartKeyInfo { |
There was a problem hiding this comment.
This should be the same as the current MultipartKeyInfo, but we should label the now unused fields as deprecated instead of removing them.
| * **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag. | ||
| * **Approach-2:** Dedicated metadata table, cleanest separation, requires broader refactor. | ||
|
|
||
| #### Pros and Cons |
There was a problem hiding this comment.
This is sort of implied here, but for me the biggest reason the existing table works best for metadata is migration. Finalization can happen in between keys being written and committed so the OM would always have to check both tables, even after finalizing. The proposed approach with always reading from the existing table and forking with a schema version handles this much better.
What changes were proposed in this pull request?
HDDS-10611. Design document for MPU GC Optimization
Please describe your PR in detail:
This PR adds the design doc for optimizing the OM GC pressure by the MPU file handling
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10611
How was this patch tested?
N/A