Commit 80bac0f
chore: add script to generate
* Add script to generate BigQuery benchmark tables
This script creates 10 BigQuery tables with varying schemas and data volumes based on predefined statistics.
Key features:
- Dynamically generates table schemas to match target average row sizes, maximizing data type diversity.
- Generates random data for each table, respecting BigQuery data types.
- Includes placeholders for GCP project and dataset IDs.
- Handles very large table data generation by capping row numbers for in-memory processing and printing warnings (actual BQ load for huge tables would require GCS load jobs).
- Adds a specific requirements file for this script: `scripts/requirements-create_tables.txt`.
* Refactor: Vectorize data generation in benchmark script
Vectorized the `generate_random_data` function in
`scripts/create_read_gbq_colab_benchmark_tables.py`.
Changes include:
- Using NumPy's vectorized operations (`size` parameter in random
functions, `np.vectorize`) to generate arrays of random values for
most data types at once.
- Employing list comprehensions for transformations on these arrays (e.g.,
formatting dates, generating strings from character arrays).
- Retaining loops for types where full vectorization is overly complex
or offers little benefit (e.g., precise byte-length JSON strings, BYTES
generation via `rng.bytes`).
- Assembling the final list of row dictionaries from the generated
columnar data.
This should improve performance for data generation, especially for
tables with a large number of rows.
* Implement batched data generation and loading
Refactored the script to process data in batches, significantly
improving memory efficiency for large tables.
Changes include:
1. `generate_random_data` function:
* Modified to be a generator, yielding data in chunks of a
specified `batch_size`.
* The core vectorized logic for creating column data within each
batch is retained.
2. `create_and_load_table` function:
* Updated to consume data from the `generate_random_data` generator.
* No longer accepts a full list of data rows.
* For actual BigQuery loads, it iterates through generated batches
and further sub-batches them (if necessary) for optimal
`client.insert_rows_json` calls.
* Simulation mode now reflects this batched processing by showing
details of the first generated batch and estimated total batches.
3. `main` function:
* Removed pre-generation of the entire dataset or a capped sample.
* The call to `create_and_load_table` now passes parameters required
for it to invoke and manage the data generator (total `num_rows`,
`rng` object, and `DATA_GENERATION_BATCH_SIZE`).
* Optimize DATETIME/TIMESTAMP generation with numpy.datetime_as_string
Refactored the `generate_random_data` function to use
`numpy.datetime_as_string` for converting `numpy.datetime64` arrays
to ISO-formatted strings for DATETIME and TIMESTAMP columns.
- For DATETIME:
- Python `datetime.datetime` objects are created in a list first
(to ensure date component validity) then converted to
`numpy.datetime64[us]`.
- `numpy.datetime_as_string` is used, and the output 'T' separator
is replaced with a space.
- For TIMESTAMP:
- `numpy.datetime64[us]` arrays are constructed directly from epoch
seconds and microsecond offsets.
- `numpy.datetime_as_string` is used with `timezone='UTC'` to
produce a 'Z'-suffixed UTC string.
This change improves performance and code clarity for generating these
timestamp string formats.
* Add argparse for project and dataset IDs
Implemented command-line arguments for specifying Google Cloud Project ID
and BigQuery Dataset ID, replacing hardcoded global constants.
Changes:
- Imported `argparse` module.
- Added optional `--project_id` (-p) and `--dataset_id` (-d) arguments
to `main()`.
- If `project_id` or `dataset_id` are not provided, the script defaults
to simulation mode.
- `create_and_load_table` now checks for the presence of both IDs to
determine if it should attempt actual BigQuery operations or run in
simulation.
- Error handling in `create_and_load_table` for BQ operations was
adjusted to log errors per table and continue processing remaining
tables, rather than halting the script.
* Add unit tests for table generation script
Added unit tests for `get_bq_schema` and `generate_random_data`
functions in `create_read_gbq_colab_benchmark_tables.py`.
- Created `scripts/create_read_gbq_colab_benchmark_tables_test.py`.
- Implemented pytest-style tests covering various scenarios:
- For `get_bq_schema`:
- Zero and small target byte sizes.
- Exact fits with fixed-size types.
- Inclusion and expansion of flexible types.
- Generation of all fixed types where possible.
- Uniqueness of column names.
- Helper function `_calculate_row_size` used for validation.
- For `generate_random_data`:
- Zero rows case.
- Basic schema and batching logic (single batch, multiple full
batches, partial last batches).
- Generation of all supported data types, checking Python types,
string formats (using regex and `fromisoformat`),
lengths for string/bytes, and JSON validity.
- Added `pytest` and `pandas` (for pytest compatibility in the current project environment) to `scripts/requirements-create_tables.txt`.
- All tests pass.
* refactor
* reduce duplicated work
* only use percentile in table name
* use annotations to not fail in 3.9
* 🦉 Updates from OwlBot post-processor
See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md
* Update scripts/create_read_gbq_colab_benchmark_tables.py
* Delete scripts/requirements-create_tables.txt
* base64 encode
* refactor batch generation
* adjust test formatting
* parallel processing
---------
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>_read_gbq_colab BigQuery benchmark tables (#1846)1 parent c88a825 commit 80bac0f
File tree
5 files changed
+888
-2
lines changed- scripts
- readme-gen
5 files changed
+888
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
275 | 276 | | |
276 | 277 | | |
277 | 278 | | |
| 279 | + | |
278 | 280 | | |
279 | 281 | | |
280 | 282 | | |
| |||
0 commit comments