feat: unified benchmark runner with composable config [will not merge] by andygrove · Pull Request #3534 · apache/datafusion-comet

andygrove · 2026-02-16T15:43:01Z

Summary

Part of #3440

I am going to break this PR down into some smaller PRs for easier review

Prior to this PR, we have three types of benchmark:

dev/benchmark for TPC-*
benchmarks/pyspark for an ETL use case
Microbenchmarks implemented in Scala

This PR implements a unified benchmark runner Python script, making it easy to run benchmarks with Spark, Spark+Comet, and Spark+Gluten.

It also adds memory profiling features, and support for local execution, and docker-compose. We can add k8s support in a future PR.

I ported one microbenchmark over as an example, and will create follow-on PRs to port the remaining microbenchmarks, once this is merged.

The rest of this description was written by Claude Code.

Claude's Summary

Replace scattered benchmark scripts (10 duplicated shell scripts in dev/benchmarks/ and separate pyspark shuffle benchmarks) with a single composable framework under benchmarks/.

Composable config system: profile configs (cluster shape, memory, master URL) + engine configs (plugin JARs, shuffle manager, engine-specific settings) + CLI overrides, with clear merge precedence
Single entry point (run.py): builds and executes spark-submit with --dry-run support
Three benchmark suites: TPC-H/TPC-DS, shuffle (hash + round-robin), and microbenchmarks (29 string expressions ported from CometStringExpressionBenchmark.scala)
Level 1 JVM profiling via Spark REST API with CSV output
Analysis tools: comparison chart generation (compare.py) and memory reports (memory_report.py)
Multi-engine support: Spark, Comet, Comet-Iceberg, Gluten (+ shuffle variants)
Infrastructure: Docker (with memory-constrained overlay + cgroup metrics) and Kubernetes manifests
7 engine configs and 5 profile configs (local, standalone, docker)

Test plan

--dry-run produces correct spark-submit command for each suite
TPC suite runs with --engine comet --profile local
Shuffle suite runs with all three modes (spark, jvm, native)
Micro suite produces JSON with all 29 string expressions
--expression ascii runs only a single expression
JSON output works with analysis/compare.py
--profile flag produces metrics CSV

🤖 Generated with Claude Code

Replace scattered benchmark scripts (10 duplicated shell scripts in dev/benchmarks/ and separate pyspark shuffle benchmarks) with a single composable framework under benchmarks/. Key changes: - Config system: profile (cluster shape) + engine (plugin/JARs) + CLI overrides with clear merge precedence - Python entry point (run.py) that builds and executes spark-submit - TPC-H/TPC-DS and shuffle benchmark suites - Level 1 JVM profiling via Spark REST API - Analysis tools for comparison charts and memory reports - Docker and Kubernetes infrastructure - 8 engine configs (spark, comet, comet-iceberg, gluten, blaze, plus 3 shuffle variants) and 5 profile configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Port CometStringExpressionBenchmark as a micro suite in the unified benchmark runner. All 29 string expressions are supported, and the suite works with any engine (Comet, Gluten, Blaze, vanilla Spark). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Switch base image to Spark 3.5.2 for Gluten compatibility - Install both Java 8 (Gluten) and Java 17 (Comet) in Docker image - Fix run.py injecting --name before subcommand, breaking argparse - Document Gluten Java 8 requirement and JAVA_HOME override workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove · 2026-02-16T16:27:09Z

TPC-H q21 Memory Profile: Comet vs Gluten (SF100, Docker, 1 executor / 8 cores / 16 GiB)

Ran TPC-H query 21 with JVM memory profiling enabled (--profile --profile-interval 1.0) for both engines on the same Docker cluster.

Configuration

Spark: 3.5.2 (docker image comet-bench)
Comet: Java 17, comet-baseline.jar
Gluten: Java 8, gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar
Cluster: 1 executor, 8 cores, 16 GiB executor memory, 16 GiB off-heap

Results (Executor 0)

Engine	Wall-clock time
Comet	80.03s
Gluten	43.55s
Delta	+36.48s (Comet 1.84x slower)

Metric	Comet	Gluten	Delta
Peak memoryUsed	1.31 MB	1.91 MB	-605.79 KB (0.69x)
Peak JVMHeapMemory	5.90 GB	2.43 GB	+3.47 GB (2.43x)
Peak JVMOffHeapMemory	123.64 MB	116.36 MB	+7.27 MB (1.06x)
Peak OnHeapExecutionMemory	0.00 B	0.00 B	0
Peak OffHeapExecutionMemory	4.17 GB	2.79 GB	+1.38 GB (1.50x)
Peak OnHeapUnifiedMemory	9.49 MB	5.32 MB	+4.17 MB (1.78x)
Peak OffHeapUnifiedMemory	4.17 GB	2.79 GB	+1.38 GB (1.50x)
Peak ProcessTreeJVMRSSMemory	0.00 B	0.00 B	0
memoryUsed % of maxMemory	0.01%	0.01%	-0.00 pp

maxMemory (executor 0): Comet = 25.42 GB | Gluten = 24.36 GB

Key Takeaways

Speed: Gluten finished in 43.55s vs Comet's 80.03s (1.84x faster on q21)
JVM Heap: Comet peaked at 5.90 GB on-heap — 2.43x higher than Gluten's 2.43 GB. This is the single largest memory difference.
Off-Heap Execution Memory: Comet used 4.17 GB vs Gluten's 2.79 GB (1.50x more), reflecting native engine working memory during q21's multi-way join + anti-join + aggregation.
JVM Off-Heap: Roughly comparable (~120 MB each, only 6% difference).
On-Heap Execution: Both 0 — expected since both engines execute natively off-heap.
ProcessTreeJVMRSSMemory: 0 for both (RSS tracking not enabled).

Notes

This is a single query (q21) on a single run — results may vary across queries and iterations.
The Docker image was updated to include both Java 8 (Gluten) and Java 17 (Comet), with a --name arg injection bugfix in run.py.

- Remove K8s manifests and k8s.conf profile (to be added in a follow-up issue once the core runner is merged) - Add TPC-H (22) and TPC-DS (100) SQLBench query files under benchmarks/queries/ and embed them in the Docker image - Update Dockerfile to COPY queries into /opt/benchmarks/queries/ TODO: file upstream issue for K8s support once PR apache#3534 is merged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The TPC-H and TPC-DS query files use TPC Fair Use Policy headers, not Apache License headers, so they must be excluded from both the Maven RAT plugin and the release RAT script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove · 2026-02-16T22:22:59Z

This is now replaced by #3538 and #3539 which focus just on the TPC-* benchmarks. We can add other benchmarks to this framework later.

andygrove and others added 3 commits February 16, 2026 08:35

chore: remove all references to Blaze engine

8b1a07a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove changed the title ~~feat: add microbenchmark suite with string expressions~~ feat: unified benchmark runner with composable config Feb 16, 2026

andygrove changed the title ~~feat: unified benchmark runner with composable config~~ feat: unified benchmark runner with composable config [WIP] Feb 16, 2026

andygrove and others added 2 commits February 16, 2026 08:50

style: run prettier on SHUFFLE.md

760eb1b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove mentioned this pull request Feb 16, 2026

Add Kubernetes support to unified benchmark runner #3537

Open

andygrove changed the title ~~feat: unified benchmark runner with composable config [WIP]~~ feat: unified benchmark runner with composable config [will not merge] Feb 16, 2026

andygrove closed this Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: unified benchmark runner with composable config [will not merge]#3534

feat: unified benchmark runner with composable config [will not merge]#3534
andygrove wants to merge 7 commits intoapache:mainfrom
andygrove:unified-benchmark-runner

andygrove commented Feb 16, 2026 •

edited

Loading

Uh oh!

andygrove commented Feb 16, 2026 •

edited

Loading

Uh oh!

andygrove commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

andygrove commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Claude's Summary

Test plan

Uh oh!

andygrove commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TPC-H q21 Memory Profile: Comet vs Gluten (SF100, Docker, 1 executor / 8 cores / 16 GiB)

Configuration

Results (Executor 0)

Key Takeaways

Notes

Uh oh!

andygrove commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Feb 16, 2026 •

edited

Loading

andygrove commented Feb 16, 2026 •

edited

Loading