-
Notifications
You must be signed in to change notification settings - Fork 432
feat(storage): implement multi stream manager for async downloads #15823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(storage): implement multi stream manager for async downloads #15823
Conversation
Summary of ChangesHello @shubham-up-47, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the asynchronous object download capabilities of the Google Cloud Storage C++ client. It introduces a dedicated Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a MultiStreamManager for handling multiple concurrent streams in asynchronous downloads, a significant enhancement. The refactoring of ObjectDescriptorImpl to utilize this manager is substantial and well-executed, incorporating proactive stream creation and the reuse of idle streams to boost performance. The accompanying tests are thorough, covering a good range of stream management scenarios.
I've identified a few areas for improvement. There is a code structure issue with the definition of ReadStream that creates a layering violation and should be addressed by moving it to its own header. Additionally, some temporary developer comments and FIXMEs remain in the code and tests that need to be cleaned up. Some documentation for public methods in ObjectDescriptorImpl was also removed and should be restored. Addressing these points will improve the overall quality and maintainability of the code.
google/cloud/storage/internal/async/object_descriptor_impl_test.cc
Outdated
Show resolved
Hide resolved
google/cloud/storage/internal/async/object_descriptor_impl_test.cc
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #15823 +/- ##
==========================================
+ Coverage 92.95% 92.96% +0.01%
==========================================
Files 2458 2460 +2
Lines 227589 228210 +621
==========================================
+ Hits 211553 212155 +602
- Misses 16036 16055 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gcbrun |
| if (it->active_ranges.size() < min_ranges) { | ||
| best_it = it; | ||
| min_ranges = it->active_ranges.size(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the active_ranges.size() value is 1 but range itself is 10GB, and in second case active_ranges.size() value is 5 but all the ranges are under 1 MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we cannot predict how fast a large file will download, counting the active requests is the standard and most reliable way to balance the load.
| void Cancel() override { | ||
| if (stream_) stream_->Cancel(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a private: missing here or are these member variables intended to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have switched ReadStream to a struct and made its fields public (no trailing underscores) per our style guide for structs. They stay public because MultiStreamManager mutates them directly on the hot path.
91abbce
This PR implements scalable multi-stream reading for Zonal Buckets by introducing a MultiStreamManager to pool and reuse gRPC connections. It mitigates performance regressions from frequent stream creation by reusing idle streams and load-balancing requests, while maintaining a proactive background connection to minimize latency during ramp-up.
Class UML diagrams of the final architecture: Link
Benchmarking script used: Link
Bechmarking the changes by reading 500 GB data (thread count = 1),
Bechmarking the changes by reading 500 GB data (thread count = 30),
Benchmarking by changing the MaximumRangeSizeOption values,
Code accurate Control flow diagram for this snippet,