Skip to content

Comments

feat: unified benchmark runner with composable config [will not merge]#3534

Closed
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:unified-benchmark-runner
Closed

feat: unified benchmark runner with composable config [will not merge]#3534
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:unified-benchmark-runner

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 16, 2026

Summary

Part of #3440

I am going to break this PR down into some smaller PRs for easier review

Prior to this PR, we have three types of benchmark:

  • dev/benchmark for TPC-*
  • benchmarks/pyspark for an ETL use case
  • Microbenchmarks implemented in Scala

This PR implements a unified benchmark runner Python script, making it easy to run benchmarks with Spark, Spark+Comet, and Spark+Gluten.

It also adds memory profiling features, and support for local execution, and docker-compose. We can add k8s support in a future PR.

I ported one microbenchmark over as an example, and will create follow-on PRs to port the remaining microbenchmarks, once this is merged.

The rest of this description was written by Claude Code.

Claude's Summary

Replace scattered benchmark scripts (10 duplicated shell scripts in dev/benchmarks/ and separate pyspark shuffle benchmarks) with a single composable framework under benchmarks/.

  • Composable config system: profile configs (cluster shape, memory, master URL) + engine configs (plugin JARs, shuffle manager, engine-specific settings) + CLI overrides, with clear merge precedence
  • Single entry point (run.py): builds and executes spark-submit with --dry-run support
  • Three benchmark suites: TPC-H/TPC-DS, shuffle (hash + round-robin), and microbenchmarks (29 string expressions ported from CometStringExpressionBenchmark.scala)
  • Level 1 JVM profiling via Spark REST API with CSV output
  • Analysis tools: comparison chart generation (compare.py) and memory reports (memory_report.py)
  • Multi-engine support: Spark, Comet, Comet-Iceberg, Gluten (+ shuffle variants)
  • Infrastructure: Docker (with memory-constrained overlay + cgroup metrics) and Kubernetes manifests
  • 7 engine configs and 5 profile configs (local, standalone, docker)

Test plan

  • --dry-run produces correct spark-submit command for each suite
  • TPC suite runs with --engine comet --profile local
  • Shuffle suite runs with all three modes (spark, jvm, native)
  • Micro suite produces JSON with all 29 string expressions
  • --expression ascii runs only a single expression
  • JSON output works with analysis/compare.py
  • --profile flag produces metrics CSV

🤖 Generated with Claude Code

andygrove and others added 3 commits February 16, 2026 08:35
Replace scattered benchmark scripts (10 duplicated shell scripts in
dev/benchmarks/ and separate pyspark shuffle benchmarks) with a single
composable framework under benchmarks/.

Key changes:
- Config system: profile (cluster shape) + engine (plugin/JARs) + CLI
  overrides with clear merge precedence
- Python entry point (run.py) that builds and executes spark-submit
- TPC-H/TPC-DS and shuffle benchmark suites
- Level 1 JVM profiling via Spark REST API
- Analysis tools for comparison charts and memory reports
- Docker and Kubernetes infrastructure
- 8 engine configs (spark, comet, comet-iceberg, gluten, blaze, plus
  3 shuffle variants) and 5 profile configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port CometStringExpressionBenchmark as a micro suite in the unified
benchmark runner. All 29 string expressions are supported, and the
suite works with any engine (Comet, Gluten, Blaze, vanilla Spark).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title feat: add microbenchmark suite with string expressions feat: unified benchmark runner with composable config Feb 16, 2026
@andygrove andygrove changed the title feat: unified benchmark runner with composable config feat: unified benchmark runner with composable config [WIP] Feb 16, 2026
andygrove and others added 2 commits February 16, 2026 08:50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch base image to Spark 3.5.2 for Gluten compatibility
- Install both Java 8 (Gluten) and Java 17 (Comet) in Docker image
- Fix run.py injecting --name before subcommand, breaking argparse
- Document Gluten Java 8 requirement and JAVA_HOME override workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

andygrove commented Feb 16, 2026

TPC-H q21 Memory Profile: Comet vs Gluten (SF100, Docker, 1 executor / 8 cores / 16 GiB)

Ran TPC-H query 21 with JVM memory profiling enabled (--profile --profile-interval 1.0) for both engines on the same Docker cluster.

Configuration

  • Spark: 3.5.2 (docker image comet-bench)
  • Comet: Java 17, comet-baseline.jar
  • Gluten: Java 8, gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar
  • Cluster: 1 executor, 8 cores, 16 GiB executor memory, 16 GiB off-heap

Results (Executor 0)

Engine Wall-clock time
Comet 80.03s
Gluten 43.55s
Delta +36.48s (Comet 1.84x slower)
Metric Comet Gluten Delta
Peak memoryUsed 1.31 MB 1.91 MB -605.79 KB (0.69x)
Peak JVMHeapMemory 5.90 GB 2.43 GB +3.47 GB (2.43x)
Peak JVMOffHeapMemory 123.64 MB 116.36 MB +7.27 MB (1.06x)
Peak OnHeapExecutionMemory 0.00 B 0.00 B 0
Peak OffHeapExecutionMemory 4.17 GB 2.79 GB +1.38 GB (1.50x)
Peak OnHeapUnifiedMemory 9.49 MB 5.32 MB +4.17 MB (1.78x)
Peak OffHeapUnifiedMemory 4.17 GB 2.79 GB +1.38 GB (1.50x)
Peak ProcessTreeJVMRSSMemory 0.00 B 0.00 B 0
memoryUsed % of maxMemory 0.01% 0.01% -0.00 pp

maxMemory (executor 0): Comet = 25.42 GB | Gluten = 24.36 GB

Key Takeaways

  • Speed: Gluten finished in 43.55s vs Comet's 80.03s (1.84x faster on q21)
  • JVM Heap: Comet peaked at 5.90 GB on-heap — 2.43x higher than Gluten's 2.43 GB. This is the single largest memory difference.
  • Off-Heap Execution Memory: Comet used 4.17 GB vs Gluten's 2.79 GB (1.50x more), reflecting native engine working memory during q21's multi-way join + anti-join + aggregation.
  • JVM Off-Heap: Roughly comparable (~120 MB each, only 6% difference).
  • On-Heap Execution: Both 0 — expected since both engines execute natively off-heap.
  • ProcessTreeJVMRSSMemory: 0 for both (RSS tracking not enabled).

Notes

  • This is a single query (q21) on a single run — results may vary across queries and iterations.
  • The Docker image was updated to include both Java 8 (Gluten) and Java 17 (Comet), with a --name arg injection bugfix in run.py.

- Remove K8s manifests and k8s.conf profile (to be added in a follow-up
  issue once the core runner is merged)
- Add TPC-H (22) and TPC-DS (100) SQLBench query files under
  benchmarks/queries/ and embed them in the Docker image
- Update Dockerfile to COPY queries into /opt/benchmarks/queries/

TODO: file upstream issue for K8s support once PR apache#3534 is merged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TPC-H and TPC-DS query files use TPC Fair Use Policy headers,
not Apache License headers, so they must be excluded from both the
Maven RAT plugin and the release RAT script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title feat: unified benchmark runner with composable config [WIP] feat: unified benchmark runner with composable config [will not merge] Feb 16, 2026
@andygrove
Copy link
Member Author

This is now replaced by #3538 and #3539 which focus just on the TPC-* benchmarks. We can add other benchmarks to this framework later.

@andygrove andygrove closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant