Skip to content

Conversation

@nathanmascitelli
Copy link

@nathanmascitelli nathanmascitelli commented Dec 30, 2025

When inserting a large amount of data into BigQuery it has been noticed that a large number of strings on the large object heap were rooted in Google.Apis.Requests.HttpRequestMessageExtenstions:

image

The above is due to the fact that in HttpRequestMessageExtenstions we serialize the object we are sending to BQ into a string before sending it over the network:

image

These allocations can be removed if the object is serialized and its bytes pushed to the network stream as the serialization is done. I've done this in a custom implementation of HttpContent in this PR.

Please let me know if there are any questions or anything I can do. I've kept the dependency on Newtonsoft.JSON and extended the ISerializer interface in a non-breaking way.

@nathanmascitelli nathanmascitelli requested a review from a team as a code owner December 30, 2025 22:10
@google-cla
Copy link

google-cla bot commented Dec 30, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link

Summary of Changes

Hello @nathanmascitelli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the process of sending large JSON payloads, particularly to services like BigQuery, by refactoring the serialization mechanism. It addresses high memory consumption caused by intermediate string allocations on the large object heap. The core change involves implementing a custom HttpContent that streams serialized JSON directly to the network, optionally applying GZip compression on the fly, thus improving performance and reducing memory footprint.

Highlights

  • New JsonStreamContent Class: Introduced a new JsonStreamContent class that inherits from HttpContent to enable direct streaming of JSON serialized objects to the network, bypassing intermediate string allocations.
  • Optimized Serialization to Stream: Modified the ISerializer interface and its NewtonsoftJsonSerializer implementation to include a leaveOpen parameter, allowing the underlying stream to remain open after serialization, which is crucial for the new streaming approach.
  • Reduced Memory Allocations: Refactored HttpRequestMessageExtenstions to utilize the new JsonStreamContent, eliminating the previous pattern of serializing objects into large strings before sending, thereby significantly reducing large object heap allocations, especially for large BigQuery data insertions.
  • Integrated GZip Compression: The new JsonStreamContent class now directly handles GZip compression by wrapping the target stream in a GZipStream when enabled, further optimizing data transfer without intermediate buffers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement for uploading large JSON payloads by streaming the serialization process directly to the network stream. This avoids allocating large strings on the heap, addressing the memory pressure issue. The introduction of JsonStreamContent is a clean way to encapsulate this logic. The changes to the ISerializer interface are non-breaking and well-implemented. My review includes one critical fix for the new JsonStreamContent class to ensure gzipped content is correctly formatted.

@nathanmascitelli nathanmascitelli changed the title Stream Json Content Fix: Stream Json Content Dec 31, 2025
Copy link
Contributor

@amanda-tarafa amanda-tarafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the breaking change, the code this PR is changing is used by all the Discovery based .NET Client Libraries, so we would have to confirm and test that this change works for 400+ APIs. It's unlikely that we can merge this change as is (even without the breaking change) as it changes long-established default behaviour.

In principle, you should be able to implement your own custom serializer and configure your services to use that? Did you tried that?

In addition, for working with large amounts of data, the recommendation is to use BigQuery Storage API with the Google.Cloud.BigQuery.Storage.V1 library.

I'll close this PR now, but feel freeto create an issue for further discussion. It'd be best if you can include the original problem statement as well as a link to this PR there for easier discoverability.

@amanda-tarafa amanda-tarafa self-assigned this Jan 7, 2026
@nathanmascitelli
Copy link
Author

nathanmascitelli commented Jan 7, 2026

@amanda-tarafa I left a comment on the diff but I don't understand how an optional parameter is a breaking change and would appreciate some more info.

In principle, you should be able to implement your own custom serializer and configure your services to use that? Did you tried that?

I dont think this fixes the problem as the code in my initial post shows that it still needs to create a string in memory before sending it over the network, causing the GC to have to clean that string up later. If I'm misunderstanding what you're suggesting could you please give me an example?

...the recommendation is to use BigQuery Storage API...

I can look at this but to be clear we are inserting one row that is just very wide and so the string that is created ends up being large. Is the Storage API still a good idea for inserting single rows?

@amanda-tarafa
Copy link
Contributor

Adding an optional parameter is a binary breaking change.

I dont think this fixes the problem ...

Yes you are right, I missed this.

I can look at this but to be clear we are inserting one row that is just very wide...

But unless you are inserting very wide rows many times over, even a very wide row shouldn't cause memory issues? And if you are inserting the wide rows many times over, then maybe BigQuery Storage is indeed an option. Don't get me wrong, I agree allocating the very wide string is not ideal, but these libraries are not optimized for "very big requests" as it's not a common case. If BigQuery Storage is not an optoin, then, maybe consider uploading the row via a job that uses a media upload? You can see how we've used that on Google.Cloud.BigQuery.V2.

Also, I'm not sure which you are using, but note that Google.Cloud.BigQuery.V2 wraps and is recommended over Google.Apis.Bigquery.v2.

It continues to be unlikely that we make the change as you are proposing. The threshold for such a wide reaching change is high, as in something needs to be fundametanlly broken, and that doesn't seem to be the case here.

Please do create an issue for further discussion. Discussions on closed PRs are not as easily discovered.

@nathanmascitelli
Copy link
Author

Thanks @amanda-tarafa. Let me take a look at the Storage API as you suggested and if it doesn't solve the problem I'll open an issue with a reproduction of whats causing the allocations I'm seeing and we can continue to discuss there.

We are using Google.Cloud.BigQuery.V2, the allocations just come from Google.Apis.Bigquery.v2.

Thanks again for taking the time to review and explain.

@amanda-tarafa
Copy link
Contributor

Let me take a look at the Storage API

Just to be clear, I was proposing you look into the BigQuery Storage API. The (plain) Storage API serves different purposes alltogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants