Skip to content

Commit 3b2587c

Browse files
feat: Add read_arrow methods to Session and pandas
Adds `read_arrow` methods to `bigframes.session.Session` and `bigframes.pandas.read_arrow` for creating BigQuery DataFrames DataFrames from PyArrow Tables. The implementation refactors existing logic from `bigframes.session._io.bigquery.read_gbq_query` for converting Arrow data into BigFrames DataFrames. Includes: - New file `bigframes/session/_io/arrow.py` with the core conversion logic. - `read_arrow(pa.Table) -> bpd.DataFrame` in `Session` class. - `read_arrow(pa.Table) -> bpd.DataFrame` in `pandas` module. - Unit and system tests for the new functionality. - Docstrings for new methods/functions. Note: Unit tests for direct DataFrame operations (shape, to_pandas) on the result of read_arrow are currently failing due to the complexity of mocking the session and executor for LocalDataNode interactions. System tests are recommended for full end-to-end validation.
1 parent 38cc43f commit 3b2587c

File tree

105 files changed

+5154
-1476
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+5154
-1476
lines changed

CHANGELOG.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,54 @@
44

55
[1]: https://pypi.org/project/bigframes/#history
66

7+
## [2.8.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.7.0...v2.8.0) (2025-06-23)
8+
9+
10+
### ⚠ BREAKING CHANGES
11+
12+
* add required param 'engine' to multimodal functions ([#1834](https://github.com/googleapis/python-bigquery-dataframes/issues/1834))
13+
14+
### Features
15+
16+
* Add `bpd.options.compute.maximum_result_rows` option to limit client data download ([#1829](https://github.com/googleapis/python-bigquery-dataframes/issues/1829)) ([e22a3f6](https://github.com/googleapis/python-bigquery-dataframes/commit/e22a3f61a02cc1b7a5155556e5a07a1a2fea1d82))
17+
* Add `bpd.options.display.repr_mode = "anywidget"` to create an interactive display of the results ([#1820](https://github.com/googleapis/python-bigquery-dataframes/issues/1820)) ([be0a3cf](https://github.com/googleapis/python-bigquery-dataframes/commit/be0a3cf7711dadc68d8366ea90b99855773e2a2e))
18+
* Add DataFrame.ai.forecast() support ([#1828](https://github.com/googleapis/python-bigquery-dataframes/issues/1828)) ([7bc7f36](https://github.com/googleapis/python-bigquery-dataframes/commit/7bc7f36fc20d233f4cf5ed688cc5dcaf100ce4fb))
19+
* Add describe() method to Series ([#1827](https://github.com/googleapis/python-bigquery-dataframes/issues/1827)) ([a4205f8](https://github.com/googleapis/python-bigquery-dataframes/commit/a4205f882012820c034cb15d73b2768ec4ad3ac8))
20+
* Add required param 'engine' to multimodal functions ([#1834](https://github.com/googleapis/python-bigquery-dataframes/issues/1834)) ([37666e4](https://github.com/googleapis/python-bigquery-dataframes/commit/37666e4c137d52c28ab13477dfbcc6e92b913334))
21+
22+
23+
### Performance Improvements
24+
25+
* Produce simpler sql ([#1836](https://github.com/googleapis/python-bigquery-dataframes/issues/1836)) ([cf9c22a](https://github.com/googleapis/python-bigquery-dataframes/commit/cf9c22a09c4e668a598fa1dad0f6a07b59bc6524))
26+
27+
28+
### Documentation
29+
30+
* Add ai.forecast notebook ([#1840](https://github.com/googleapis/python-bigquery-dataframes/issues/1840)) ([2430497](https://github.com/googleapis/python-bigquery-dataframes/commit/24304972fdbdfd12c25c7f4ef5a7b280f334801a))
31+
32+
## [2.7.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.6.0...v2.7.0) (2025-06-16)
33+
34+
35+
### Features
36+
37+
* Add bbq.json_query_array and warn bbq.json_extract_array deprecated ([#1811](https://github.com/googleapis/python-bigquery-dataframes/issues/1811)) ([dc9eb27](https://github.com/googleapis/python-bigquery-dataframes/commit/dc9eb27fa75e90c2c95a0619551bf67aea6ef63b))
38+
* Add bbq.json_value_array and deprecate bbq.json_extract_string_array ([#1818](https://github.com/googleapis/python-bigquery-dataframes/issues/1818)) ([019051e](https://github.com/googleapis/python-bigquery-dataframes/commit/019051e453d81769891aa398475ebd04d1826e81))
39+
* Add groupby cumcount ([#1798](https://github.com/googleapis/python-bigquery-dataframes/issues/1798)) ([18f43e8](https://github.com/googleapis/python-bigquery-dataframes/commit/18f43e8b58e03a27b021bce07566a3d006ac3679))
40+
* Support custom build service account in `remote_function` ([#1796](https://github.com/googleapis/python-bigquery-dataframes/issues/1796)) ([e586151](https://github.com/googleapis/python-bigquery-dataframes/commit/e586151df81917b49f702ae496aaacbd02931636))
41+
42+
43+
### Bug Fixes
44+
45+
* Correct read_csv behaviours with use_cols, names, index_col ([#1804](https://github.com/googleapis/python-bigquery-dataframes/issues/1804)) ([855031a](https://github.com/googleapis/python-bigquery-dataframes/commit/855031a316a6957731a5d1c5e59dedb9757d9f7a))
46+
* Fix single row broadcast with null index ([#1803](https://github.com/googleapis/python-bigquery-dataframes/issues/1803)) ([080eb7b](https://github.com/googleapis/python-bigquery-dataframes/commit/080eb7be3cde591e08cad0d5c52c68cc0b25ade8))
47+
48+
49+
### Documentation
50+
51+
* Document how to use ai.map() for information extraction ([#1808](https://github.com/googleapis/python-bigquery-dataframes/issues/1808)) ([b586746](https://github.com/googleapis/python-bigquery-dataframes/commit/b5867464a5bf30300dcfc069eda546b11f03146c))
52+
* Rearrange README.rst to include a short code sample ([#1812](https://github.com/googleapis/python-bigquery-dataframes/issues/1812)) ([f6265db](https://github.com/googleapis/python-bigquery-dataframes/commit/f6265dbb8e22de81bb59c7def175cd325e85c041))
53+
* Use pandas API instead of pandas-like or pandas-compatible ([#1825](https://github.com/googleapis/python-bigquery-dataframes/issues/1825)) ([aa32369](https://github.com/googleapis/python-bigquery-dataframes/commit/aa323694e161f558bc5e60490c2f21008961e2ca))
54+
755
## [2.6.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.5.0...v2.6.0) (2025-06-09)
856

957

bigframes/_config/compute_options.py

Lines changed: 39 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -55,29 +55,7 @@ class ComputeOptions:
5555
{'test2': 'abc', 'test3': False}
5656
5757
Attributes:
58-
maximum_bytes_billed (int, Options):
59-
Limits the bytes billed for query jobs. Queries that will have
60-
bytes billed beyond this limit will fail (without incurring a
61-
charge). If unspecified, this will be set to your project default.
62-
See `maximum_bytes_billed`: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJobConfig#google_cloud_bigquery_job_QueryJobConfig_maximum_bytes_billed.
63-
64-
enable_multi_query_execution (bool, Options):
65-
If enabled, large queries may be factored into multiple smaller queries
66-
in order to avoid generating queries that are too complex for the query
67-
engine to handle. However this comes at the cost of increase cost and latency.
68-
69-
extra_query_labels (Dict[str, Any], Options):
70-
Stores additional custom labels for query configuration.
71-
72-
semantic_ops_confirmation_threshold (int, optional):
73-
.. deprecated:: 1.42.0
74-
Semantic operators are deprecated. Please use AI operators instead
75-
76-
semantic_ops_threshold_autofail (bool):
77-
.. deprecated:: 1.42.0
78-
Semantic operators are deprecated. Please use AI operators instead
79-
80-
ai_ops_confirmation_threshold (int, optional):
58+
ai_ops_confirmation_threshold (int | None):
8159
Guards against unexpected processing of large amount of rows by semantic operators.
8260
If the number of rows exceeds the threshold, the user will be asked to confirm
8361
their operations to resume. The default value is 0. Set the value to None
@@ -87,26 +65,57 @@ class ComputeOptions:
8765
Guards against unexpected processing of large amount of rows by semantic operators.
8866
When set to True, the operation automatically fails without asking for user inputs.
8967
90-
allow_large_results (bool):
68+
allow_large_results (bool | None):
9169
Specifies whether query results can exceed 10 GB. Defaults to False. Setting this
9270
to False (the default) restricts results to 10 GB for potentially faster execution;
9371
BigQuery will raise an error if this limit is exceeded. Setting to True removes
9472
this result size limit.
73+
74+
enable_multi_query_execution (bool | None):
75+
If enabled, large queries may be factored into multiple smaller queries
76+
in order to avoid generating queries that are too complex for the query
77+
engine to handle. However this comes at the cost of increase cost and latency.
78+
79+
extra_query_labels (Dict[str, Any] | None):
80+
Stores additional custom labels for query configuration.
81+
82+
maximum_bytes_billed (int | None):
83+
Limits the bytes billed for query jobs. Queries that will have
84+
bytes billed beyond this limit will fail (without incurring a
85+
charge). If unspecified, this will be set to your project default.
86+
See `maximum_bytes_billed`: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJobConfig#google_cloud_bigquery_job_QueryJobConfig_maximum_bytes_billed.
87+
88+
maximum_result_rows (int | None):
89+
Limits the number of rows in an execution result. When converting
90+
a BigQuery DataFrames object to a pandas DataFrame or Series (e.g.,
91+
using ``.to_pandas()``, ``.peek()``, ``.__repr__()``, direct
92+
iteration), the data is downloaded from BigQuery to the client
93+
machine. This option restricts the number of rows that can be
94+
downloaded. If the number of rows to be downloaded exceeds this
95+
limit, a ``bigframes.exceptions.MaximumResultRowsExceeded``
96+
exception is raised.
97+
98+
semantic_ops_confirmation_threshold (int | None):
99+
.. deprecated:: 1.42.0
100+
Semantic operators are deprecated. Please use AI operators instead
101+
102+
semantic_ops_threshold_autofail (bool):
103+
.. deprecated:: 1.42.0
104+
Semantic operators are deprecated. Please use AI operators instead
95105
"""
96106

97-
maximum_bytes_billed: Optional[int] = None
107+
ai_ops_confirmation_threshold: Optional[int] = 0
108+
ai_ops_threshold_autofail: bool = False
109+
allow_large_results: Optional[bool] = None
98110
enable_multi_query_execution: bool = False
99111
extra_query_labels: Dict[str, Any] = dataclasses.field(
100112
default_factory=dict, init=False
101113
)
114+
maximum_bytes_billed: Optional[int] = None
115+
maximum_result_rows: Optional[int] = None
102116
semantic_ops_confirmation_threshold: Optional[int] = 0
103117
semantic_ops_threshold_autofail = False
104118

105-
ai_ops_confirmation_threshold: Optional[int] = 0
106-
ai_ops_threshold_autofail: bool = False
107-
108-
allow_large_results: Optional[bool] = None
109-
110119
def assign_extra_query_labels(self, **kwargs: Any) -> None:
111120
"""
112121
Assigns additional custom labels for query configuration. The method updates the

bigframes/_config/display_options.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ class DisplayOptions:
2929
max_columns: int = 20
3030
max_rows: int = 25
3131
progress_bar: Optional[str] = "auto"
32-
repr_mode: Literal["head", "deferred"] = "head"
32+
repr_mode: Literal["head", "deferred", "anywidget"] = "head"
3333

3434
max_info_columns: int = 100
3535
max_info_rows: Optional[int] = 200000

bigframes/core/bigframe_node.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,19 @@
2020
import functools
2121
import itertools
2222
import typing
23-
from typing import Callable, Dict, Generator, Iterable, Mapping, Sequence, Set, Tuple
24-
25-
from bigframes.core import field, identifiers
23+
from typing import (
24+
Callable,
25+
Dict,
26+
Generator,
27+
Iterable,
28+
Mapping,
29+
Sequence,
30+
Set,
31+
Tuple,
32+
Union,
33+
)
34+
35+
from bigframes.core import expression, field, identifiers
2636
import bigframes.core.schema as schemata
2737
import bigframes.dtypes
2838

@@ -278,6 +288,13 @@ def _dtype_lookup(self) -> dict[identifiers.ColumnId, bigframes.dtypes.Dtype]:
278288
def field_by_id(self) -> Mapping[identifiers.ColumnId, field.Field]:
279289
return {field.id: field for field in self.fields}
280290

291+
@property
292+
def _node_expressions(
293+
self,
294+
) -> Sequence[Union[expression.Expression, expression.Aggregation]]:
295+
"""List of scalar expressions. Intended for checking engine compatibility with used ops."""
296+
return ()
297+
281298
# Plan algorithms
282299
def unique_nodes(
283300
self: BigFrameNode,

bigframes/core/blocks.py

Lines changed: 19 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,7 @@
3636
import google.cloud.bigquery as bigquery
3737
import numpy
3838
import pandas as pd
39-
# pyarrow is imported below where needed, but not at top-level if only used for type hints by Session
40-
# import pyarrow as pa
39+
import pyarrow as pa
4140

4241
from bigframes import session
4342
from bigframes._config import sampling_options
@@ -160,89 +159,38 @@ def __init__(
160159
@classmethod
161160
def from_local(
162161
cls,
163-
data: Union[pd.DataFrame, pd.Series],
162+
data: pd.DataFrame,
164163
session: bigframes.Session,
165164
*,
166165
cache_transpose: bool = True,
167166
) -> Block:
168-
# Assumes caller has already converted datatypes to bigframes ones where appropriate (e.g. for pandas inputs)
169-
index_cols: typing.Sequence[str]
170-
value_cols: typing.Sequence[str]
171-
index_names: typing.Sequence[typing.Optional[Label]]
172-
column_names: pd.Index
173-
managed_data: local_data.ManagedArrowTable
174-
array_value_column_ids: typing.Sequence[str]
175-
176-
177-
if isinstance(data, pd.Series):
178-
# Standardize column names to avoid collisions, eg. index named "value" and series also named "value"
179-
original_index_names = list(name if name is not None else f"level_{i}" for i, name in enumerate(data.index.names))
180-
original_series_name = data.name if data.name is not None else "value"
181-
182-
# Ensure series name doesn't clash with index names
183-
series_name_std = utils.get_standardized_id(original_series_name)
184-
index_names_std = [utils.get_standardized_id(name) for name in original_index_names]
185-
while series_name_std in index_names_std:
186-
series_name_std = series_name_std + "_series"
187-
188-
value_cols = [series_name_std]
189-
index_cols = index_names_std
190-
191-
pd_data_reset = data.rename(series_name_std).reset_index(names=index_names_std)
192-
managed_data = local_data.ManagedArrowTable.from_pandas(pd_data_reset)
193-
index_names = list(data.index.names)
194-
column_names = pd.Index([data.name])
195-
array_value_column_ids = [*index_cols, *value_cols]
196-
197-
elif isinstance(data, pd.DataFrame):
198-
original_index_names = list(name if name is not None else f"level_{i}" for i, name in enumerate(data.index.names))
199-
original_column_names = list(data.columns)
200-
201-
# Standardize all names
202-
index_names_std = [utils.get_standardized_id(name) for name in original_index_names]
203-
column_names_std = [utils.get_standardized_id(name) for name in original_column_names]
204-
205-
# Resolve clashes between index and column names after standardization
206-
final_column_names_std = []
207-
for name_std in column_names_std:
208-
temp_name_std = name_std
209-
while temp_name_std in index_names_std:
210-
temp_name_std = temp_name_std + "_col"
211-
final_column_names_std.append(temp_name_std)
212-
213-
value_cols = final_column_names_std
214-
index_cols = index_names_std
215-
216-
pd_data_prepared = data.copy(deep=False)
217-
pd_data_prepared.columns = value_cols
218-
pd_data_prepared = pd_data_prepared.reset_index(names=index_cols)
219-
220-
managed_data = local_data.ManagedArrowTable.from_pandas(pd_data_prepared)
221-
index_names = list(data.index.names)
222-
column_names = data.columns.copy()
223-
array_value_column_ids = [*index_cols, *value_cols]
224-
else:
225-
raise TypeError(
226-
f"data must be pandas DataFrame or Series. Got: {type(data)}"
227-
)
228-
229-
array_value = core.ArrayValue.from_managed(managed_data, session=session, default_column_ids=array_value_column_ids)
230-
167+
# Assumes caller has already converted datatypes to bigframes ones.
168+
pd_data = data
169+
column_labels = pd_data.columns
170+
index_labels = list(pd_data.index.names)
171+
172+
# unique internal ids
173+
column_ids = [f"column_{i}" for i in range(len(pd_data.columns))]
174+
index_ids = [f"level_{level}" for level in range(pd_data.index.nlevels)]
175+
176+
pd_data = pd_data.set_axis(column_ids, axis=1)
177+
pd_data = pd_data.reset_index(names=index_ids)
178+
managed_data = local_data.ManagedArrowTable.from_pandas(pd_data)
179+
array_value = core.ArrayValue.from_managed(managed_data, session=session)
231180
block = cls(
232181
array_value,
233-
column_labels=column_names,
234-
index_columns=index_cols,
235-
index_labels=index_names,
182+
column_labels=column_labels,
183+
index_columns=index_ids,
184+
index_labels=index_labels,
236185
)
237-
238186
if cache_transpose:
239187
try:
240188
# this cache will help when aligning on axis=1
241189
block = block.with_transpose_cache(
242190
cls.from_local(data.T, session, cache_transpose=False)
243191
)
244192
except Exception:
245-
pass # Transposition might fail for various reasons, non-critical.
193+
pass
246194
return block
247195

248196
@property
@@ -3397,5 +3345,3 @@ def _pd_index_to_array_value(
33973345
rows.append(row)
33983346

33993347
return core.ArrayValue.from_pyarrow(pa.Table.from_pylist(rows), session=session)
3400-
3401-
[end of bigframes/core/blocks.py]

bigframes/core/compile/compiler.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ def compile_sql(request: configs.CompileRequest) -> configs.CompileResult:
6565
ordering: Optional[bf_ordering.RowOrdering] = result_node.order_by
6666
result_node = dataclasses.replace(result_node, order_by=None)
6767
result_node = cast(nodes.ResultNode, rewrites.column_pruning(result_node))
68+
result_node = cast(nodes.ResultNode, rewrites.defer_selection(result_node))
6869
sql = compile_result_node(result_node)
6970
# Return the ordering iff no extra columns are needed to define the row order
7071
if ordering is not None:

bigframes/core/compile/googlesql/query.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,13 @@ def _select_field(self, field) -> SelectExpression:
8383
return SelectExpression(expression=expr.ColumnExpression(name=field))
8484

8585
else:
86-
alias = field[1] if (field[0] != field[1]) else None
86+
alias = (
87+
expr.AliasExpression(field[1])
88+
if isinstance(field[1], str)
89+
else field[1]
90+
if (field[0] != field[1])
91+
else None
92+
)
8793
return SelectExpression(
8894
expression=expr.ColumnExpression(name=field[0]), alias=alias
8995
)
@@ -119,7 +125,7 @@ def sql(self) -> str:
119125
return "\n".join(text)
120126

121127

122-
@dataclasses.dataclass
128+
@dataclasses.dataclass(frozen=True)
123129
class SelectExpression(abc.SQLSyntax):
124130
"""This class represents `select_expression`."""
125131

0 commit comments

Comments
 (0)