-
Notifications
You must be signed in to change notification settings - Fork 131
Feat: Enhance GPU Tests, Docs & Private Cluster Setup #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: Enhance GPU Tests, Docs & Private Cluster Setup #181
Conversation
Summary of ChangesHello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Dataproc environment reproduction scripts by introducing a robust framework for GPU testing, enhancing the clarity and completeness of the documentation, and improving the reliability and idempotency of private cluster setups utilizing Secure Web Proxy. These changes aim to streamline development, testing, and deployment workflows for GPU-accelerated and secure Dataproc environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances the Dataproc reproduction scripts by adding comprehensive GPU testing, overhauling the documentation for clarity, and improving the setup for private clusters with Secure Web Proxy. The changes are well-structured, particularly the move to versioned resources for SWP components, which improves idempotency. My review focuses on improving script robustness, maintainability, and clarity. Key suggestions include refactoring duplicated logic, ensuring IAM permissions are applied robustly, and cleaning up test scripts and documentation examples for better readability and correctness. Overall, this is a solid set of improvements.
| --master-accelerator "type=${M_ACCELERATOR_TYPE}" \ | ||
| --worker-accelerator "type=${PRIMARY_ACCELERATOR_TYPE}" \ | ||
| --secondary-worker-accelerator "type=${SECONDARY_ACCELERATOR_TYPE}" \ | ||
| --master-machine-type "${MASTER_MACHINE_TYPE}" \ | ||
| --master-machine-type "${M_MACHINE_TYPE}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these variable renames improve consistency, this file (shared-functions.sh) appears to contain a significant amount of logic that is either duplicated or superseded by the functions in the more specific lib/ subdirectories (e.g., lib/dataproc/cluster.sh). For instance, it defines its own create_dpgce_cluster. To improve maintainability and reduce confusion, this entire file should be evaluated for refactoring or deprecation, with its functionality fully migrated to the newer, more modular scripts.
gcloud/README.md
Outdated
| # After building the image using scripts in custom-images/examples/secure-boot/ | ||
|
|
||
| # In lib/dataproc/cluster.sh | ||
| CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example for CLUSTER_IMAGE contains a hardcoded project ID (cjac-2025-01). For better consistency and usability, it would be clearer to use a placeholder like YOUR_PROJECT, similar to the example on line 110.
| CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" | |
| CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot" |
| if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then | ||
| master_num="$1" | ||
| echo "master_num: $master_num" | ||
| MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}" | ||
| m_num="$1" | ||
| echo "m_num: $m_num" | ||
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | ||
| shift 1 | ||
| else | ||
| MASTER_HOSTNAME="${MASTER_HOSTNAME}" | ||
| M_HOSTNAME="${M_HOSTNAME}" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to detect the node index can be improved for readability and robustness. The perl command can be replaced with a more idiomatic and native bash regular expression. Additionally, the else block is redundant and can be removed. While the ambiguity of a numeric filename vs. an index remains, these changes make the script cleaner.
| if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then | |
| master_num="$1" | |
| echo "master_num: $master_num" | |
| MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}" | |
| m_num="$1" | |
| echo "m_num: $m_num" | |
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | |
| shift 1 | |
| else | |
| MASTER_HOSTNAME="${MASTER_HOSTNAME}" | |
| M_HOSTNAME="${M_HOSTNAME}" | |
| fi | |
| if [[ "$1" =~ ^[0-9]+$ ]]; then | |
| m_num="$1" | |
| echo "m_num: $m_num" | |
| M_HOSTNAME="${M_HOSTNAME}-${m_num}" | |
| shift 1 | |
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no condition for setting the M_HOSTNAME value when the regex doesn't match.
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20251108-180659-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-002929-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-secure-boot" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-tf" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-secure-boot" | ||
| # --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-tf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section contains a large number of commented-out image URIs. This makes the active configuration difficult to parse and increases maintenance overhead. To improve clarity, consider moving these examples to a separate documentation file (like docs/dataproc_cluster_examples.md) or removing them if they are for temporary development purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mark these as to be removed before release. Until then, they are helping me to recall which images function properly.
| is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True) | ||
| print("is_cuda_gpu_available : " + str(is_cuda_gpu_available)) | ||
|
|
||
| #is_cuda_gpu_min_3 = tf.config.list_physical_devices('GPU',True, (3,0)) | ||
| is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0)) | ||
| print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see if we can verify that both the deprecated code and the code that replaces it work. That seems like a reasonable way to do this for now, right? The next release we can change the default provider and allow the use of the old provider with an argument or something, but eventually the plan is to remove that code from the codebase because it has been fully replaced in all of its use cases with another chunk of code.
| echo gcloud dataproc jobs submit pyspark \ | ||
| --properties="spark:spark.executor.resource.gpu.amount=1" \ | ||
| --properties="spark:spark.task.resource.gpu.amount=1" \ | ||
| --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \ | ||
| --cluster=${CLUSTER_NAME} \ | ||
| --region ${REGION} gs://${BUCKET}/test.py | ||
|
|
||
| get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh" | ||
| echo gcloud dataproc jobs submit spark \ | ||
| --project "${PROJECT_ID}" \ | ||
| --cluster="${CLUSTER_NAME}" \ | ||
| --region "${REGION}" \ | ||
| --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \ | ||
| --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \ | ||
| --properties \ | ||
| "spark.driver.resource.gpu.amount=1,"\ | ||
| "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\ | ||
| "spark.executor.resource.gpu.amount=1,"\ | ||
| "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These gcloud dataproc jobs submit commands are only printed to the console via echo, not executed. This can be misleading for anyone running the script. If they are intended to be examples, they should be commented out with an explanation. If they are meant to be executed, the echo should be removed.
# Example of submitting a PySpark job with GPU resources:
# gcloud dataproc jobs submit pyspark \
# --properties="spark:spark.executor.resource.gpu.amount=1" \
# --properties="spark:spark.task.resource.gpu.amount=1" \
# --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
# --cluster=${CLUSTER_NAME} \
# --region ${REGION} gs://${BUCKET}/test.py
get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
# Example of submitting a Spark job with GPU resources for driver and executor:
# gcloud dataproc jobs submit spark \
# --project "${PROJECT_ID}" \
# --cluster="${CLUSTER_NAME}" \
# --region "${REGION}" \
# --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
# --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
# --properties \
# "spark.driver.resource.gpu.amount=1,"\
# "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
# "spark.executor.resource.gpu.amount=1,"\
# "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I haven't verified whether they work or not. I'm busy and the tests that I am running are passing. I don't know if I want to add new ones right this VERY second.
…oc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.
**Key Changes:**
1. **README.md Overhaul:**
* The README has been substantially restructured and expanded to improve clarity and usability.
* A "Prerequisites" section was added.
* The "Setup" section now provides more detail on `env.json` variables.
* The "Scripts" section is now better organized into "Environment Management," "Audit Scripts," and "Utility Scripts," with more descriptive entries for each script, including the new test scripts.
* The "Using Custom Images" section was added to guide users on building specialized images, particularly for Secure Boot, pointing to the `custom-images` repository.
* A "Script Details" section provides deeper explanations of the `create-*`, `destroy-*`, and `recreate-*` scripts.
* Language has been updated to use "-m node" instead of potentially problematic terms.
2. **New Test Scripts (`t/` directory):**
* `t/scripts/verify_tensorflow.py`: A Python script to check TensorFlow GPU detection.
* `t/scripts/verify_torch.py`: A Python script to check PyTorch CUDA availability and GPU details.
* `t/spark-gpu-test.sh`: A new test runner script that:
* Copies the `t/scripts` directory to the cluster's -m node using `bin/scp-m`.
* Executes the Python GPU verification scripts on the -m node via SSH.
* Runs a series of Spark jobs (SparkPi and JavaIndexToStringExample) with various GPU-related properties to test GPU allocation and functionality within YARN.
3. **Script Modifications:**
* **`bin/destroy-dpgce-private`**: Adjusted the order of resource deletion for the SWP teardown.
* **`bin/recreate-dpgce`**: Enabled the `gsutil cp` command to ensure initialization actions are always updated from the local `init/` directory during cluster recreation.
* **`bin/scp-m`**: Updated usage text to reflect its purpose of copying to the `-m` node.
* **`bin/ssh-m`**: Renamed internal variables for clarity (e.g., `M_HOSTNAME`).
* **`lib/dataproc/cluster.sh`**: Renamed environment variables (`MASTER_MACHINE_TYPE` to `M_MACHINE_TYPE`, `MASTER_ACCELERATOR_TYPE` to `M_ACCELERATOR_TYPE`) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).
* **`lib/env.sh`**: Updated to use the new `M_` prefixed variables and added many new variables to support the private cluster and SWP configurations.
* **`lib/shared-functions.sh`**: Adapted to the variable renames in `env.sh`.
* **`lib/swp/certs.sh`**: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.
* **`lib/swp/policy.sh`**: Added export for `SWP_POLICY_URI_PARTIAL`.
In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.
7887d15 to
7a5a629
Compare
This set of changes introduces significant enhancements to the Dataproc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.
Key Changes:
README.md Overhaul:
env.jsonvariables.custom-imagesrepository.create-*,destroy-*, andrecreate-*scripts.New Test Scripts (
t/directory):t/scripts/verify_tensorflow.py: A Python script to check TensorFlow GPU detection.t/scripts/verify_torch.py: A Python script to check PyTorch CUDA availability and GPU details.t/spark-gpu-test.sh: A new test runner script that:t/scriptsdirectory to the cluster's -m node usingbin/scp-m.Script Modifications:
bin/destroy-dpgce-private: Adjusted the order of resource deletion for the SWP teardown.bin/recreate-dpgce: Enabled thegsutil cpcommand to ensure initialization actions are always updated from the localinit/directory during cluster recreation.bin/scp-m: Updated usage text to reflect its purpose of copying to the-mnode.bin/ssh-m: Renamed internal variables for clarity (e.g.,M_HOSTNAME).lib/dataproc/cluster.sh: Renamed environment variables (MASTER_MACHINE_TYPEtoM_MACHINE_TYPE,MASTER_ACCELERATOR_TYPEtoM_ACCELERATOR_TYPE) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).lib/env.sh: Updated to use the newM_prefixed variables and added many new variables to support the private cluster and SWP configurations.lib/shared-functions.sh: Adapted to the variable renames inenv.sh.lib/swp/certs.sh: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.lib/swp/policy.sh: Added export forSWP_POLICY_URI_PARTIAL.In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.