Skip to content

Conversation

@TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. samples Issues that are directly related to samples. labels Oct 9, 2025
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Oct 16, 2025
@TrevorBergeron TrevorBergeron requested a review from tswast October 16, 2025 17:45
@TrevorBergeron TrevorBergeron marked this pull request as ready for review October 16, 2025 17:45
@TrevorBergeron TrevorBergeron requested review from a team as code owners October 16, 2025 17:45
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, I think this makes a lot of sense. My main worry would be the case of ExecuteResult with local data. If we are keeping these around more, is there a risk of memory leaks?

dfs = iter(map(self._copy_index_to_pandas, dfs))

total_rows = execute_result.total_rows
total_rows = result_batches.approx_total_rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I think it'd be helpful to have cases where we know (or at least have very high confidence) of this being exact, such as when reading the destination table of a query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do insert row count directly in the tree when known exactly (or at least some of the time we know exactly, might be missing some opportunities). At a certain point we just intermingle known and approx though, before surfacing. Could keep separate if really needed, but a bit more complexity.

@TrevorBergeron
Copy link
Contributor Author

At a high level, I think this makes a lot of sense. My main worry would be the case of ExecuteResult with local data. If we are keeping these around more, is there a risk of memory leaks?

Yeah, this refactor is intended to help with caching, but doesn't actually add any caching behavior for now. So, these data handles will be garbage collected just fine.

@TrevorBergeron TrevorBergeron requested a review from tswast October 29, 2025 00:13
for scan_item in node.scan_list.items:
if (
scan_item.dtype == dtypes.JSON_DTYPE
node.source.schema.get_type(scan_item.source_id) == dtypes.JSON_DTYPE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the purpose of this extra layer of indirection is compared to the previous? I guess it's partially undoing some of the changes Shenyang and Chelsea made to plumb type info into the tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its mostly from moving the schema to the source definition rather than the scan definition, which brings it more in line with the local source, where a source object is not just data, but interpretation (mostly matters for virtual types)

@tswast
Copy link
Collaborator

tswast commented Oct 31, 2025

e2e failures are "Blob" related. Probably flakes.

@TrevorBergeron TrevorBergeron merged commit b23cf83 into main Oct 31, 2025
24 of 25 checks passed
@TrevorBergeron TrevorBergeron deleted the new_execute_result branch October 31, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. samples Issues that are directly related to samples. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants