ci: run TPC-H benchmarks on a Kind Kubernetes cluster#3549
ci: run TPC-H benchmarks on a Kind Kubernetes cluster#3549Shekharrajak wants to merge 7 commits intoapache:mainfrom
Conversation
9f78801 to
0b44de0
Compare
.github/workflows/k8s_benchmark.yml
Outdated
| on: | ||
| pull_request: | ||
| paths: | ||
| - "native/**/*.rs" |
There was a problem hiding this comment.
only code changes will trigger this .
| env: | ||
| RUST_VERSION: stable | ||
| K8S_VERSION: "1.32.0" | ||
| SPARK_VERSION: "3.5" |
There was a problem hiding this comment.
we can expand it for different spark versions
|
|
||
| - name: Install K8s tools | ||
| run: | | ||
| curl -Lo ./kind "https://kind.sigs.k8s.io/dl/v0.26.0/kind-linux-amd64" |
There was a problem hiding this comment.
Using stable kind version to create k8s cluster
| print(f"Required: {min_speedup:.2f}x") | ||
| print("-" * 50) | ||
|
|
||
| if speedup >= min_speedup: |
There was a problem hiding this comment.
This will make sure that we are not degrading the performance - we can have bunch of queries, joins, read, write.
benchmarks/Dockerfile.k8s
Outdated
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| FROM apache/spark:3.5.8 AS builder |
There was a problem hiding this comment.
We will extend for multiple spark versions
|
|
||
| --- | ||
| apiVersion: v1 | ||
| kind: ServiceAccount |
There was a problem hiding this comment.
service account and RBAC is needed for driver/executors to have k8s API access.
| helm repo add spark-operator https://kubeflow.github.io/spark-operator 2>/dev/null || true | ||
| helm repo update | ||
|
|
||
| if helm list -n spark-operator 2>/dev/null | grep -q spark-operator; then |
There was a problem hiding this comment.
We are using kubeflow/spark-operator
There was a problem hiding this comment.
we can extend it for apache-spark-operator as well but under the hood both have similar way of running spark components.
|
|
||
| kind: Cluster | ||
| apiVersion: kind.x-k8s.io/v1alpha4 | ||
| nodes: |
There was a problem hiding this comment.
Kind k8s cluter worker nodes configs
|
@andygrove , can I get access to trigger the workflow - this will help me to validate quicker. |
945c8c4 to
348ea4a
Compare
|
|
||
| on: | ||
| pull_request: | ||
| # paths: |
There was a problem hiding this comment.
Commenting this so that CI check get triggered.
|
|
||
| COPY --from=builder /comet/spark/target/comet-spark-spark${SPARK_VERSION}_${SCALA_VERSION}-*.jar $SPARK_HOME/jars/ | ||
| ARG COMET_JAR | ||
| COPY ${COMET_JAR} $SPARK_HOME/jars/ |
There was a problem hiding this comment.
building from source in dockerfile takes time.
ac4eed3 to
d65fffd
Compare
I don't have a way to do that, I'm afraid. |
|
@Shekharrajak Thanks for looking at this, but I have some concerns about this approach. CI takes way too long already and adding this workflow seems like it would add even more overhead. I am not sure that GitHub runners will provide consistent performance for benchmarking, and we really need to be testing with large data sets for meaningful results. Committers already have the ability to trigger TPC-H benchmarks @ 100GB by commenting on PRs. This is quite new and experimental, so it hasn't been documented yet. These benchmarks run on dedicated hardware. |
So do we expect some kind of script that can help running benchmarks in k8s cluster in local? |
The benchmarks do run in k8s already, but using local mode rather than truly distributed. I am planning on making that change, and I also need to align this with the benchmark scripts currently in the repo, which I am working on refactoring in #3538 and #3539. I planned on supporting k8s as a future step after the docker-compose one gets merged. |
Thanks for sharing. Then I think no more work required as part of #3537 until those PRs are merged. We can close this PR for now. |
Which issue does this PR close?
Add GitHub CI workflow to run TPC-H benchmarks on a Kind Kubernetes cluster, validating Comet performance achieves ≥1.1x speedup over Spark baseline.
Closes #3537
Rationale for this change
Run Spark baseline benchmark
Run Comet benchmark
Validate speedup ≥ 1.1x (10% improvement)
What changes are included in this PR?
The workflow triggers on PRs modifying:
native/**/*.rsspark/**/*.scalaspark/**/*.javaHow are these changes tested?
Local Testing