Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jan 27, 2026

This set of changes introduces significant enhancements to the Dataproc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.

Key Changes:

  1. README.md Overhaul:

    • The README has been substantially restructured and expanded to improve clarity and usability.
    • A "Prerequisites" section was added.
    • The "Setup" section now provides more detail on env.json variables.
    • The "Scripts" section is now better organized into "Environment Management," "Audit Scripts," and "Utility Scripts," with more descriptive entries for each script, including the new test scripts.
    • The "Using Custom Images" section was added to guide users on building specialized images, particularly for Secure Boot, pointing to the custom-images repository.
    • A "Script Details" section provides deeper explanations of the create-*, destroy-*, and recreate-* scripts.
    • Language has been updated to use "-m node" instead of potentially problematic terms.
  2. New Test Scripts (t/ directory):

    • t/scripts/verify_tensorflow.py: A Python script to check TensorFlow GPU detection.
    • t/scripts/verify_torch.py: A Python script to check PyTorch CUDA availability and GPU details.
    • t/spark-gpu-test.sh: A new test runner script that:
      • Copies the t/scripts directory to the cluster's -m node using bin/scp-m.
      • Executes the Python GPU verification scripts on the -m node via SSH.
      • Runs a series of Spark jobs (SparkPi and JavaIndexToStringExample) with various GPU-related properties to test GPU allocation and functionality within YARN.
  3. Script Modifications:

    • bin/destroy-dpgce-private: Adjusted the order of resource deletion for the SWP teardown.
    • bin/recreate-dpgce: Enabled the gsutil cp command to ensure initialization actions are always updated from the local init/ directory during cluster recreation.
    • bin/scp-m: Updated usage text to reflect its purpose of copying to the -m node.
    • bin/ssh-m: Renamed internal variables for clarity (e.g., M_HOSTNAME).
    • lib/dataproc/cluster.sh: Renamed environment variables (MASTER_MACHINE_TYPE to M_MACHINE_TYPE, MASTER_ACCELERATOR_TYPE to M_ACCELERATOR_TYPE) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).
    • lib/env.sh: Updated to use the new M_ prefixed variables and added many new variables to support the private cluster and SWP configurations.
    • lib/shared-functions.sh: Adapted to the variable renames in env.sh.
    • lib/swp/certs.sh: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.
    • lib/swp/policy.sh: Added export for SWP_POLICY_URI_PARTIAL.

In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.

@gemini-code-assist
Copy link

Summary of Changes

Hello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the Dataproc environment reproduction scripts by introducing a robust framework for GPU testing, enhancing the clarity and completeness of the documentation, and improving the reliability and idempotency of private cluster setups utilizing Secure Web Proxy. These changes aim to streamline development, testing, and deployment workflows for GPU-accelerated and secure Dataproc environments.

Highlights

  • Enhanced GPU Testing Framework: Introduced new Python scripts (verify_tensorflow.py, verify_torch.py) to check GPU detection and CUDA availability, along with a comprehensive spark-gpu-test.sh script to run Spark jobs with various GPU configurations on YARN.
  • Improved Documentation: The README.md has been significantly expanded and reorganized, adding sections for prerequisites, detailed env.json variables, structured script categories, guidance on using custom images (especially for Secure Boot), and in-depth explanations of core create-*, destroy-*, and recreate-* scripts.
  • Robust Private Cluster Setup (SWP): Modifications to lib/swp/certs.sh introduce versioned certificates and sentinels for Secure Web Proxy (SWP) components, making their creation and deletion more robust and idempotent. The destroy-dpgce-private script now handles SWP teardown more effectively by adjusting the order of resource deletion.
  • Script Refinements and Consistency: Key scripts like bin/scp-m, bin/ssh-m, lib/dataproc/cluster.sh, and lib/env.sh have undergone internal variable renames (e.g., MASTER_MACHINE_TYPE to M_MACHINE_TYPE) for better consistency. bin/recreate-dpgce now reliably updates initialization actions to GCS.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the Dataproc reproduction scripts by adding comprehensive GPU testing, overhauling the documentation for clarity, and improving the setup for private clusters with Secure Web Proxy. The changes are well-structured, particularly the move to versioned resources for SWP components, which improves idempotency. My review focuses on improving script robustness, maintainability, and clarity. Key suggestions include refactoring duplicated logic, ensuring IAM permissions are applied robustly, and cleaning up test scripts and documentation examples for better readability and correctness. Overall, this is a solid set of improvements.

Comment on lines +26 to +29
--master-accelerator "type=${M_ACCELERATOR_TYPE}" \
--worker-accelerator "type=${PRIMARY_ACCELERATOR_TYPE}" \
--secondary-worker-accelerator "type=${SECONDARY_ACCELERATOR_TYPE}" \
--master-machine-type "${MASTER_MACHINE_TYPE}" \
--master-machine-type "${M_MACHINE_TYPE}" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While these variable renames improve consistency, this file (shared-functions.sh) appears to contain a significant amount of logic that is either duplicated or superseded by the functions in the more specific lib/ subdirectories (e.g., lib/dataproc/cluster.sh). For instance, it defines its own create_dpgce_cluster. To improve maintainability and reduce confusion, this entire file should be evaluated for refactoring or deprecation, with its functionality fully migrated to the newer, more modular scripts.

gcloud/README.md Outdated
# After building the image using scripts in custom-images/examples/secure-boot/

# In lib/dataproc/cluster.sh
CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for CLUSTER_IMAGE contains a hardcoded project ID (cjac-2025-01). For better consistency and usability, it would be clearer to use a placeholder like YOUR_PROJECT, similar to the example on line 110.

Suggested change
CLUSTER_IMAGE="projects/cjac-2025-01/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"
CLUSTER_IMAGE="projects/YOUR_PROJECT/global/images/dataproc-2-2-deb12-20260122-213213-secure-boot"

Comment on lines 26 to 33
if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then
master_num="$1"
echo "master_num: $master_num"
MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
else
MASTER_HOSTNAME="${MASTER_HOSTNAME}"
M_HOSTNAME="${M_HOSTNAME}"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to detect the node index can be improved for readability and robustness. The perl command can be replaced with a more idiomatic and native bash regular expression. Additionally, the else block is redundant and can be removed. While the ambiguity of a numeric filename vs. an index remains, these changes make the script cleaner.

Suggested change
if [[ $(perl -e "print 1 if q{$1} =~ /^\d+$/") == "1" ]]; then
master_num="$1"
echo "master_num: $master_num"
MASTER_HOSTNAME="${MASTER_HOSTNAME}-${master_num}"
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
else
MASTER_HOSTNAME="${MASTER_HOSTNAME}"
M_HOSTNAME="${M_HOSTNAME}"
fi
if [[ "$1" =~ ^[0-9]+$ ]]; then
m_num="$1"
echo "m_num: $m_num"
M_HOSTNAME="${M_HOSTNAME}-${m_num}"
shift 1
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no condition for setting the M_HOSTNAME value when the regex doesn't match.

Comment on lines +76 to +81
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20251108-180659-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-002929-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-secure-boot"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-2-deb12-20260121-213919-tf"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-secure-boot"
# --image "projects/${PROJECT_ID}/global/images/dataproc-2-3-deb12-20260127-175333-tf"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section contains a large number of commented-out image URIs. This makes the active configuration difficult to parse and increases maintenance overhead. To improve clarity, consider moving these examples to a separate documentation file (like docs/dataproc_cluster_examples.md) or removing them if they are for temporary development purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark these as to be removed before release. Until then, they are helping me to recall which images function properly.

Comment on lines 14 to 21
is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))

#is_cuda_gpu_min_3 = tf.config.list_physical_devices('GPU',True, (3,0))
is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function tf.test.is_gpu_available() is deprecated in TensorFlow 2.x and should be avoided. The check for available GPUs is already performed on line 10 using the recommended tf.config.list_physical_devices('GPU'). These lines can be removed to simplify and modernize the script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if we can verify that both the deprecated code and the code that replaces it work. That seems like a reasonable way to do this for now, right? The next release we can change the default provider and allow the use of the old provider with an argument or something, but eventually the plan is to remove that code from the codebase because it has been fully replaced in all of its use cases with another chunk of code.

Comment on lines +24 to +42
echo gcloud dataproc jobs submit pyspark \
--properties="spark:spark.executor.resource.gpu.amount=1" \
--properties="spark:spark.task.resource.gpu.amount=1" \
--properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
--cluster=${CLUSTER_NAME} \
--region ${REGION} gs://${BUCKET}/test.py

get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
echo gcloud dataproc jobs submit spark \
--project "${PROJECT_ID}" \
--cluster="${CLUSTER_NAME}" \
--region "${REGION}" \
--jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
--class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
--properties \
"spark.driver.resource.gpu.amount=1,"\
"spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
"spark.executor.resource.gpu.amount=1,"\
"spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These gcloud dataproc jobs submit commands are only printed to the console via echo, not executed. This can be misleading for anyone running the script. If they are intended to be examples, they should be commented out with an explanation. If they are meant to be executed, the echo should be removed.

# Example of submitting a PySpark job with GPU resources:
# gcloud dataproc jobs submit pyspark \
#   --properties="spark:spark.executor.resource.gpu.amount=1" \
#   --properties="spark:spark.task.resource.gpu.amount=1" \
#   --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
#   --cluster=${CLUSTER_NAME} \
#   --region ${REGION} gs://${BUCKET}/test.py

get_gpu_resources_script="/usr/lib/spark/scripts/gpu/getGpusResources.sh"
# Example of submitting a Spark job with GPU resources for driver and executor:
# gcloud dataproc jobs submit spark \
#   --project "${PROJECT_ID}" \
#   --cluster="${CLUSTER_NAME}" \
#   --region "${REGION}" \
#   --jars "file:///usr/lib/spark/examples/jars/spark-examples.jar" \
#   --class "org.apache.spark.examples.ml.JavaIndexToStringExample" \
#   --properties \
# "spark.driver.resource.gpu.amount=1,"\
# "spark.driver.resource.gpu.discoveryScript=${get_gpu_resources_script},"\
# "spark.executor.resource.gpu.amount=1,"\
# "spark.executor.resource.gpu.discoveryScript=${get_gpu_resources_script}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I haven't verified whether they work or not. I'm busy and the tests that I am running are passing. I don't know if I want to add new ones right this VERY second.

…oc environment reproduction scripts, focusing on documentation, private cluster setups with Secure Web Proxy (SWP), GPU testing, and code clarity.

**Key Changes:**

1.  **README.md Overhaul:**
    *   The README has been substantially restructured and expanded to improve clarity and usability.
    *   A "Prerequisites" section was added.
    *   The "Setup" section now provides more detail on `env.json` variables.
    *   The "Scripts" section is now better organized into "Environment Management," "Audit Scripts," and "Utility Scripts," with more descriptive entries for each script, including the new test scripts.
    *   The "Using Custom Images" section was added to guide users on building specialized images, particularly for Secure Boot, pointing to the `custom-images` repository.
    *   A "Script Details" section provides deeper explanations of the `create-*`, `destroy-*`, and `recreate-*` scripts.
    *   Language has been updated to use "-m node" instead of potentially problematic terms.

2.  **New Test Scripts (`t/` directory):**
    *   `t/scripts/verify_tensorflow.py`: A Python script to check TensorFlow GPU detection.
    *   `t/scripts/verify_torch.py`: A Python script to check PyTorch CUDA availability and GPU details.
    *   `t/spark-gpu-test.sh`: A new test runner script that:
        *   Copies the `t/scripts` directory to the cluster's -m node using `bin/scp-m`.
        *   Executes the Python GPU verification scripts on the -m node via SSH.
        *   Runs a series of Spark jobs (SparkPi and JavaIndexToStringExample) with various GPU-related properties to test GPU allocation and functionality within YARN.

3.  **Script Modifications:**
    *   **`bin/destroy-dpgce-private`**: Adjusted the order of resource deletion for the SWP teardown.
    *   **`bin/recreate-dpgce`**: Enabled the `gsutil cp` command to ensure initialization actions are always updated from the local `init/` directory during cluster recreation.
    *   **`bin/scp-m`**: Updated usage text to reflect its purpose of copying to the `-m` node.
    *   **`bin/ssh-m`**: Renamed internal variables for clarity (e.g., `M_HOSTNAME`).
    *   **`lib/dataproc/cluster.sh`**: Renamed environment variables (`MASTER_MACHINE_TYPE` to `M_MACHINE_TYPE`, `MASTER_ACCELERATOR_TYPE` to `M_ACCELERATOR_TYPE`) for consistency. Includes many commented-out options, indicating active development and testing of various cluster configurations (proxy, image versions, secure boot).
    *   **`lib/env.sh`**: Updated to use the new `M_` prefixed variables and added many new variables to support the private cluster and SWP configurations.
    *   **`lib/shared-functions.sh`**: Adapted to the variable renames in `env.sh`.
    *   **`lib/swp/certs.sh`**: Enhanced to support versioned certificates and sentinels, making SWP component management more robust and idempotent per run.
    *   **`lib/swp/policy.sh`**: Added export for `SWP_POLICY_URI_PARTIAL`.

In essence, this change adds comprehensive GPU validation, improves documentation, refines the private cluster setup, and makes various scripts more robust.
@cjac cjac force-pushed the proxy-sync-2026-01 branch from 7887d15 to 7a5a629 Compare January 28, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant