-
Notifications
You must be signed in to change notification settings - Fork 549
Fix: Stream Json Content #3099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Stream Json Content #3099
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @nathanmascitelli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the process of sending large JSON payloads, particularly to services like BigQuery, by refactoring the serialization mechanism. It addresses high memory consumption caused by intermediate string allocations on the large object heap. The core change involves implementing a custom Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance improvement for uploading large JSON payloads by streaming the serialization process directly to the network stream. This avoids allocating large strings on the heap, addressing the memory pressure issue. The introduction of JsonStreamContent is a clean way to encapsulate this logic. The changes to the ISerializer interface are non-breaking and well-implemented. My review includes one critical fix for the new JsonStreamContent class to ensure gzipped content is correctly formatted.
845490e to
488e844
Compare
amanda-tarafa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the breaking change, the code this PR is changing is used by all the Discovery based .NET Client Libraries, so we would have to confirm and test that this change works for 400+ APIs. It's unlikely that we can merge this change as is (even without the breaking change) as it changes long-established default behaviour.
In principle, you should be able to implement your own custom serializer and configure your services to use that? Did you tried that?
In addition, for working with large amounts of data, the recommendation is to use BigQuery Storage API with the Google.Cloud.BigQuery.Storage.V1 library.
I'll close this PR now, but feel freeto create an issue for further discussion. It'd be best if you can include the original problem statement as well as a link to this PR there for easier discoverability.
|
@amanda-tarafa I left a comment on the diff but I don't understand how an optional parameter is a breaking change and would appreciate some more info.
I dont think this fixes the problem as the code in my initial post shows that it still needs to create a string in memory before sending it over the network, causing the GC to have to clean that string up later. If I'm misunderstanding what you're suggesting could you please give me an example?
I can look at this but to be clear we are inserting one row that is just very wide and so the string that is created ends up being large. Is the Storage API still a good idea for inserting single rows? |
|
Adding an optional parameter is a binary breaking change.
Yes you are right, I missed this.
But unless you are inserting very wide rows many times over, even a very wide row shouldn't cause memory issues? And if you are inserting the wide rows many times over, then maybe BigQuery Storage is indeed an option. Don't get me wrong, I agree allocating the very wide string is not ideal, but these libraries are not optimized for "very big requests" as it's not a common case. If BigQuery Storage is not an optoin, then, maybe consider uploading the row via a job that uses a media upload? You can see how we've used that on Google.Cloud.BigQuery.V2. Also, I'm not sure which you are using, but note that Google.Cloud.BigQuery.V2 wraps and is recommended over Google.Apis.Bigquery.v2. It continues to be unlikely that we make the change as you are proposing. The threshold for such a wide reaching change is high, as in something needs to be fundametanlly broken, and that doesn't seem to be the case here. Please do create an issue for further discussion. Discussions on closed PRs are not as easily discovered. |
|
Thanks @amanda-tarafa. Let me take a look at the Storage API as you suggested and if it doesn't solve the problem I'll open an issue with a reproduction of whats causing the allocations I'm seeing and we can continue to discuss there. We are using Google.Cloud.BigQuery.V2, the allocations just come from Google.Apis.Bigquery.v2. Thanks again for taking the time to review and explain. |
Just to be clear, I was proposing you look into the BigQuery Storage API. The (plain) Storage API serves different purposes alltogether. |
When inserting a large amount of data into BigQuery it has been noticed that a large number of strings on the large object heap were rooted in
Google.Apis.Requests.HttpRequestMessageExtenstions:The above is due to the fact that in
HttpRequestMessageExtenstionswe serialize the object we are sending to BQ into a string before sending it over the network:These allocations can be removed if the object is serialized and its bytes pushed to the network stream as the serialization is done. I've done this in a custom implementation of
HttpContentin this PR.Please let me know if there are any questions or anything I can do. I've kept the dependency on Newtonsoft.JSON and extended the
ISerializerinterface in a non-breaking way.