Skip to content

Commit 5d9e208

Browse files
Merge remote-tracking branch 'github/main' into polars_semi
2 parents 98b300c + ac55aae commit 5d9e208

File tree

49 files changed

+1621
-388
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+1621
-388
lines changed

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,29 @@
44

55
[1]: https://pypi.org/project/bigframes/#history
66

7+
## [2.7.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.6.0...v2.7.0) (2025-06-16)
8+
9+
10+
### Features
11+
12+
* Add bbq.json_query_array and warn bbq.json_extract_array deprecated ([#1811](https://github.com/googleapis/python-bigquery-dataframes/issues/1811)) ([dc9eb27](https://github.com/googleapis/python-bigquery-dataframes/commit/dc9eb27fa75e90c2c95a0619551bf67aea6ef63b))
13+
* Add bbq.json_value_array and deprecate bbq.json_extract_string_array ([#1818](https://github.com/googleapis/python-bigquery-dataframes/issues/1818)) ([019051e](https://github.com/googleapis/python-bigquery-dataframes/commit/019051e453d81769891aa398475ebd04d1826e81))
14+
* Add groupby cumcount ([#1798](https://github.com/googleapis/python-bigquery-dataframes/issues/1798)) ([18f43e8](https://github.com/googleapis/python-bigquery-dataframes/commit/18f43e8b58e03a27b021bce07566a3d006ac3679))
15+
* Support custom build service account in `remote_function` ([#1796](https://github.com/googleapis/python-bigquery-dataframes/issues/1796)) ([e586151](https://github.com/googleapis/python-bigquery-dataframes/commit/e586151df81917b49f702ae496aaacbd02931636))
16+
17+
18+
### Bug Fixes
19+
20+
* Correct read_csv behaviours with use_cols, names, index_col ([#1804](https://github.com/googleapis/python-bigquery-dataframes/issues/1804)) ([855031a](https://github.com/googleapis/python-bigquery-dataframes/commit/855031a316a6957731a5d1c5e59dedb9757d9f7a))
21+
* Fix single row broadcast with null index ([#1803](https://github.com/googleapis/python-bigquery-dataframes/issues/1803)) ([080eb7b](https://github.com/googleapis/python-bigquery-dataframes/commit/080eb7be3cde591e08cad0d5c52c68cc0b25ade8))
22+
23+
24+
### Documentation
25+
26+
* Document how to use ai.map() for information extraction ([#1808](https://github.com/googleapis/python-bigquery-dataframes/issues/1808)) ([b586746](https://github.com/googleapis/python-bigquery-dataframes/commit/b5867464a5bf30300dcfc069eda546b11f03146c))
27+
* Rearrange README.rst to include a short code sample ([#1812](https://github.com/googleapis/python-bigquery-dataframes/issues/1812)) ([f6265db](https://github.com/googleapis/python-bigquery-dataframes/commit/f6265dbb8e22de81bb59c7def175cd325e85c041))
28+
* Use pandas API instead of pandas-like or pandas-compatible ([#1825](https://github.com/googleapis/python-bigquery-dataframes/issues/1825)) ([aa32369](https://github.com/googleapis/python-bigquery-dataframes/commit/aa323694e161f558bc5e60490c2f21008961e2ca))
29+
730
## [2.6.0](https://github.com/googleapis/python-bigquery-dataframes/compare/v2.5.0...v2.6.0) (2025-06-09)
831

932

README.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ BigQuery DataFrames (BigFrames)
66
BigQuery DataFrames (also known as BigFrames) provides a Pythonic DataFrame
77
and machine learning (ML) API powered by the BigQuery engine.
88

9-
* ``bigframes.pandas`` provides a pandas-compatible API for analytics.
9+
* `bigframes.pandas` provides a pandas API for analytics. Many workloads can be
10+
migrated from pandas to bigframes by just changing a few imports.
1011
* ``bigframes.ml`` provides a scikit-learn-like API for ML.
1112

1213
BigQuery DataFrames is an open-source package.

bigframes/bigquery/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
json_query_array,
4444
json_set,
4545
json_value,
46+
json_value_array,
4647
parse_json,
4748
)
4849
from bigframes.bigquery._operations.search import create_vector_index, vector_search
@@ -71,6 +72,7 @@
7172
"json_query_array",
7273
"json_set",
7374
"json_value",
75+
"json_value_array",
7476
"parse_json",
7577
# search ops
7678
"create_vector_index",

bigframes/bigquery/_operations/json.py

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,10 @@ def json_extract_string_array(
196196
values in the array. This function uses single quotes and brackets to escape
197197
invalid JSONPath characters in JSON keys.
198198
199+
.. deprecated:: 2.6.0
200+
The ``json_extract_string_array`` is deprecated and will be removed in a future version.
201+
Use ``json_value_array`` instead.
202+
199203
**Examples:**
200204
201205
>>> import bigframes.pandas as bpd
@@ -233,6 +237,11 @@ def json_extract_string_array(
233237
Returns:
234238
bigframes.series.Series: A new Series with the parsed arrays from the input.
235239
"""
240+
msg = (
241+
"The `json_extract_string_array` is deprecated and will be removed in a future version. "
242+
"Use `json_value_array` instead."
243+
)
244+
warnings.warn(bfe.format_message(msg), category=UserWarning)
236245
array_series = input._apply_unary_op(
237246
ops.JSONExtractStringArray(json_path=json_path)
238247
)
@@ -334,7 +343,7 @@ def json_query_array(
334343

335344
def json_value(
336345
input: series.Series,
337-
json_path: str,
346+
json_path: str = "$",
338347
) -> series.Series:
339348
"""Extracts a JSON scalar value and converts it to a SQL ``STRING`` value. In
340349
addtion, this function:
@@ -366,6 +375,61 @@ def json_value(
366375
return input._apply_unary_op(ops.JSONValue(json_path=json_path))
367376

368377

378+
def json_value_array(
379+
input: series.Series,
380+
json_path: str = "$",
381+
) -> series.Series:
382+
"""
383+
Extracts a JSON array of scalar values and converts it to a SQL ``ARRAY<STRING>``
384+
value. In addition, this function:
385+
386+
- Removes the outermost quotes and unescapes the values.
387+
- Returns a SQL ``NULL`` if the selected value isn't an array or not an array
388+
containing only scalar values.
389+
- Uses double quotes to escape invalid ``JSON_PATH`` characters in JSON keys.
390+
391+
**Examples:**
392+
393+
>>> import bigframes.pandas as bpd
394+
>>> import bigframes.bigquery as bbq
395+
>>> bpd.options.display.progress_bar = None
396+
397+
>>> s = bpd.Series(['[1, 2, 3]', '[4, 5]'])
398+
>>> bbq.json_value_array(s)
399+
0 ['1' '2' '3']
400+
1 ['4' '5']
401+
dtype: list<item: string>[pyarrow]
402+
403+
>>> s = bpd.Series([
404+
... '{"fruits": ["apples", "oranges", "grapes"]',
405+
... '{"fruits": ["guava", "grapes"]}'
406+
... ])
407+
>>> bbq.json_value_array(s, "$.fruits")
408+
0 ['apples' 'oranges' 'grapes']
409+
1 ['guava' 'grapes']
410+
dtype: list<item: string>[pyarrow]
411+
412+
>>> s = bpd.Series([
413+
... '{"fruits": {"color": "red", "names": ["apple","cherry"]}}',
414+
... '{"fruits": {"color": "green", "names": ["guava", "grapes"]}}'
415+
... ])
416+
>>> bbq.json_value_array(s, "$.fruits.names")
417+
0 ['apple' 'cherry']
418+
1 ['guava' 'grapes']
419+
dtype: list<item: string>[pyarrow]
420+
421+
Args:
422+
input (bigframes.series.Series):
423+
The Series containing JSON data (as native JSON objects or JSON-formatted strings).
424+
json_path (str):
425+
The JSON path identifying the data that you want to obtain from the input.
426+
427+
Returns:
428+
bigframes.series.Series: A new Series with the parsed arrays from the input.
429+
"""
430+
return input._apply_unary_op(ops.JSONValueArray(json_path=json_path))
431+
432+
369433
@utils.preview(name="The JSON-related API `parse_json`")
370434
def parse_json(
371435
input: series.Series,

bigframes/core/bigframe_node.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,19 @@
2020
import functools
2121
import itertools
2222
import typing
23-
from typing import Callable, Dict, Generator, Iterable, Mapping, Sequence, Set, Tuple
24-
25-
from bigframes.core import field, identifiers
23+
from typing import (
24+
Callable,
25+
Dict,
26+
Generator,
27+
Iterable,
28+
Mapping,
29+
Sequence,
30+
Set,
31+
Tuple,
32+
Union,
33+
)
34+
35+
from bigframes.core import expression, field, identifiers
2636
import bigframes.core.schema as schemata
2737
import bigframes.dtypes
2838

@@ -278,6 +288,13 @@ def _dtype_lookup(self) -> dict[identifiers.ColumnId, bigframes.dtypes.Dtype]:
278288
def field_by_id(self) -> Mapping[identifiers.ColumnId, field.Field]:
279289
return {field.id: field for field in self.fields}
280290

291+
@property
292+
def _node_expressions(
293+
self,
294+
) -> Sequence[Union[expression.Expression, expression.Aggregation]]:
295+
"""List of scalar expressions. Intended for checking engine compatibility with used ops."""
296+
return ()
297+
281298
# Plan algorithms
282299
def unique_nodes(
283300
self: BigFrameNode,

bigframes/core/compile/scalar_op_compiler.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1448,6 +1448,11 @@ def json_value_op_impl(x: ibis_types.Value, op: ops.JSONValue):
14481448
return json_value(json_obj=x, json_path=op.json_path)
14491449

14501450

1451+
@scalar_op_compiler.register_unary_op(ops.JSONValueArray, pass_op=True)
1452+
def json_value_array_op_impl(x: ibis_types.Value, op: ops.JSONValueArray):
1453+
return json_value_array(json_obj=x, json_path=op.json_path)
1454+
1455+
14511456
# Blob Ops
14521457
@scalar_op_compiler.register_unary_op(ops.obj_fetch_metadata_op)
14531458
def obj_fetch_metadata_op_impl(obj_ref: ibis_types.Value):
@@ -2157,6 +2162,13 @@ def json_value( # type: ignore[empty-body]
21572162
"""Retrieve value of a JSON field as plain STRING."""
21582163

21592164

2165+
@ibis_udf.scalar.builtin(name="json_value_array")
2166+
def json_value_array( # type: ignore[empty-body]
2167+
json_obj: ibis_dtypes.JSON, json_path: ibis_dtypes.String
2168+
) -> ibis_dtypes.Array[ibis_dtypes.String]:
2169+
"""Extracts a JSON array and converts it to a SQL ARRAY of STRINGs."""
2170+
2171+
21602172
@ibis_udf.scalar.builtin(name="INT64")
21612173
def cast_json_to_int64(json_str: ibis_dtypes.JSON) -> ibis_dtypes.Int64: # type: ignore[empty-body]
21622174
"""Converts a JSON number to a SQL INT64 value."""

bigframes/core/compile/sqlglot/compiler.py

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -119,15 +119,33 @@ def _remap_variables(self, node: nodes.ResultNode) -> nodes.ResultNode:
119119
return typing.cast(nodes.ResultNode, result_node)
120120

121121
def _compile_result_node(self, root: nodes.ResultNode) -> str:
122-
sqlglot_ir = self.compile_node(root.child)
123-
122+
# Have to bind schema as the final step before compilation.
123+
root = typing.cast(nodes.ResultNode, schema_binding.bind_schema_to_tree(root))
124124
selected_cols: tuple[tuple[str, sge.Expression], ...] = tuple(
125125
(name, scalar_compiler.compile_scalar_expression(ref))
126126
for ref, name in root.output_cols
127127
)
128-
sqlglot_ir = sqlglot_ir.select(selected_cols)
128+
# Skip squashing selections to ensure the right ordering and limit keys
129+
sqlglot_ir = self.compile_node(root.child).select(
130+
selected_cols, squash_selections=False
131+
)
132+
133+
if root.order_by is not None:
134+
ordering_cols = tuple(
135+
sge.Ordered(
136+
this=scalar_compiler.compile_scalar_expression(
137+
ordering.scalar_expression
138+
),
139+
desc=ordering.direction.is_ascending is False,
140+
nulls_first=ordering.na_last is False,
141+
)
142+
for ordering in root.order_by.all_ordering_columns
143+
)
144+
sqlglot_ir = sqlglot_ir.order_by(ordering_cols)
145+
146+
if root.limit is not None:
147+
sqlglot_ir = sqlglot_ir.limit(root.limit)
129148

130-
# TODO: add order_by, limit to sqlglot_expr
131149
return sqlglot_ir.sql
132150

133151
@functools.lru_cache(maxsize=5000)
@@ -190,9 +208,19 @@ def compile_projection(
190208
)
191209
return child.project(projected_cols)
192210

211+
@_compile_node.register
212+
def compile_concat(
213+
self, node: nodes.ConcatNode, *children: ir.SQLGlotIR
214+
) -> ir.SQLGlotIR:
215+
output_ids = [id.sql for id in node.output_ids]
216+
return ir.SQLGlotIR.from_union(
217+
[child.expr for child in children],
218+
output_ids=output_ids,
219+
uid_gen=self.uid_gen,
220+
)
221+
193222

194223
def _replace_unsupported_ops(node: nodes.BigFrameNode):
195224
node = nodes.bottom_up(node, rewrite.rewrite_slice)
196-
node = nodes.bottom_up(node, schema_binding.bind_schema_to_expressions)
197225
node = nodes.bottom_up(node, rewrite.rewrite_range_rolling)
198226
return node

bigframes/core/compile/sqlglot/scalar_compiler.py

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,25 @@
1313
# limitations under the License.
1414
from __future__ import annotations
1515

16+
import dataclasses
1617
import functools
1718

1819
import sqlglot.expressions as sge
1920

21+
from bigframes import dtypes
2022
from bigframes.core import expression
2123
import bigframes.core.compile.sqlglot.sqlglot_ir as ir
2224
import bigframes.operations as ops
2325

2426

27+
@dataclasses.dataclass(frozen=True)
28+
class TypedExpr:
29+
"""SQLGlot expression with type."""
30+
31+
expr: sge.Expression
32+
dtype: dtypes.ExpressionType
33+
34+
2535
@functools.singledispatch
2636
def compile_scalar_expression(
2737
expression: expression.Expression,
@@ -50,9 +60,12 @@ def compile_constant_expression(
5060

5161

5262
@compile_scalar_expression.register
53-
def compile_op_expression(expr: expression.OpExpression):
63+
def compile_op_expression(expr: expression.OpExpression) -> sge.Expression:
5464
# Non-recursively compiles the children scalar expressions.
55-
args = tuple(map(compile_scalar_expression, expr.inputs))
65+
args = tuple(
66+
TypedExpr(compile_scalar_expression(input), input.output_type)
67+
for input in expr.inputs
68+
)
5669

5770
op = expr.op
5871
op_name = expr.op.__class__.__name__
@@ -79,8 +92,10 @@ def compile_op_expression(expr: expression.OpExpression):
7992

8093

8194
# TODO: add parenthesize for operators
82-
def compile_addop(
83-
op: ops.AddOp, left: sge.Expression, right: sge.Expression
84-
) -> sge.Expression:
85-
# TODO: support addop for string dtype.
86-
return sge.Add(this=left, expression=right)
95+
def compile_addop(op: ops.AddOp, left: TypedExpr, right: TypedExpr) -> sge.Expression:
96+
if left.dtype == dtypes.STRING_DTYPE and right.dtype == dtypes.STRING_DTYPE:
97+
# String addition
98+
return sge.Concat(expressions=[left.expr, right.expr])
99+
100+
# Numerical addition
101+
return sge.Add(this=left.expr, expression=right.expr)

0 commit comments

Comments
 (0)