-
Notifications
You must be signed in to change notification settings - Fork 596
Kubernetes Job service #5113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Kubernetes Job service #5113
Conversation
7edfb48 to
b271b83
Compare
jonathanmetzman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some very surface level comments.
d5684e5 to
44737d3
Compare
This commit introduces the Kubernetes job client and service, providing a mechanism to schedule tasks on Kubernetes clusters (including GKE and Kind), supporting both standard and Kata Containers.
Key Features & Changes:
- **Kubernetes Service**: Implemented `KubernetesService` in `clusterfuzz._internal.k8s.service` to manage job creation.
- **Kata Support**: Added specialized job creation for Kata Containers (`create_kata_container_job`) with required security context (`privileged`, `capabilities: ALL`), networking (`hostNetwork: True`), and environment variables (`HOST_UID`).
- **Dependency Management**: Added `kubernetes` and necessary Google Cloud dependencies (`google-api-python-client`, `google-cloud-storage`, `google-cloud-ndb`, etc.) to `Pipfile`.
- **E2E Testing**:
- Created `tests.core.k8s.k8s_service_e2e_test` to verify job lifecycle on a local Kind cluster.
- Updated `local/tests/kubernetes_e2e_test.bash` to provision the test environment.
- Updated CI workflow (`.github/workflows/kubernetes-e2e-tests.yaml`) to install JDK 21 (required for Datastore emulator).
- Tests now verify job "Running" status to avoid timeouts with long-running commands.
- `KubernetesService` skips default credential loading when `K8S_E2E` is set to utilize the test-provided kubeconfig.
- **Unit Tests**: Added comprehensive unit tests in `tests.core.k8s.k8s_service_test` and `tests.core.kubernetes.kubernetes_test`, including mocking of `load_kube_config` and `_load_gke_credentials` to ensure robust testing without external dependencies.
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
b771b50 to
ae2e936
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
|
|
||
| pip install pipenv | ||
|
|
||
| # Install dependencies. | ||
| pipenv --python 3.11 | ||
| pipenv install | ||
|
|
||
| class KubernetesJobClient(RemoteTaskInterface): | ||
| """A remote task execution client for Kubernetes. | ||
| This class is a placeholder for a future implementation of a remote task | ||
| execution client that uses Kubernetes. It is not yet implemented. | ||
| """ | ||
| ./local/install_deps.bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only intended to be used in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
|
|
||
| # If we get here the task succeeded in running. Acknowledge the message. | ||
| self._pubsub_message.ack() | ||
| if not self.do_not_ack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its part of the job limiter for the Kubernetes service, we can probably use this for implement the job limiter for Batch as well, using the new feature they implemented for us. The rationale behind is if the task cannot be scheduled for Kubernetes because it already reached the limit of jobs, the message should not be acked, allowing the other adapter, such as Batch, to process the message.
| @@ -0,0 +1,61 @@ | |||
| # Copyright 2026 Google LLC | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's difference between thsi and the next template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might have been a good idea to consider knative instead of rebuilding batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I created different templates for raw kubernetes jobs and for Jobs over Kata, but I updated it for having a single template with conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the Knative, it seems good, but I wound't like to tackle it in this PR as it's working fine as is, but we should definetly explore it.
|
This is cool. I maybe would tried cloud run before kata because 1. It is probably less management? 2. It might be more performant because as far as I know doesn't use nested virt. |
|
Are we using preemptibles btw? |
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
eb601c2 to
092e27f
Compare
8adc287 to
9b3a503
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
6040f7d to
941c7d1
Compare
src/clusterfuzz/_internal/tests/core/k8s/k8s_service_e2e_test.py
Outdated
Show resolved
Hide resolved
|
|
||
| # If we get here the task succeeded in running. Acknowledge the message. | ||
| self._pubsub_message.ack() | ||
| if not self.do_not_ack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO readability would be improved by using ack instead of do_not_ack (go/tott/764).
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
9ba397a to
5a92336
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
74cf7da to
5a3a5a2
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
7c57a49 to
1aa9ae5
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
a5b0d43 to
f4e61b7
Compare
Signed-off-by: Javan Lacerda <javanlacerda@google.com>
30cc0ae to
506f583
Compare
This PR introduces full support for scheduling and managing fuzzing tasks on Kubernetes clusters,
specifically targeting GKE. It implements a new KubernetesService to
handle batch job creation, supports Kata Containers for isolation, and includes robust testing
and configuration mechanisms.
Key Features:
Jobs. It supports both standard and Kata Container runtimes, automatic Service Account
creation with Workload Identity, and intelligent job limiting to prevent cluster overload.
routes tasks between the legacy GCP Batch service and the new Kubernetes service based on
configurable probabilities, allowing for a gradual, controlled migration.
behaviors like job concurrency limits.
Detailed Changes by Module:
Kubernetes Integration (
src/clusterfuzz/_internal/k8s/):monitoring, limiting). Includes GKE credential loading, Kata Container spec generation,
and Service Account provisioning.
k8s_service_e2e_test.py (integration on Kind).
Remote Task Management (
src/clusterfuzz/_internal/remote_task/):RemoteTaskInterface. It initializes both GcpBatchService and KubernetesService and
distributes tasks between them based on probabilities defined in job_frequency.py. This
enables traffic splitting (e.g., 10% to K8s, 90% to Batch) for safe rollout.
abstractions.
Datastore & Configuration (
src/clusterfuzz/_internal/datastore/):K8S_PENDING_JOBS_LIMITER).
Batch & Legacy Refactoring (
src/clusterfuzz/_internal/batch/):structure.
Infrastructure & CI:
cluster.
Bot & Metrics:
gate.
Evidences: