-
Notifications
You must be signed in to change notification settings - Fork 515
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script #1374
Conversation
Summary of ChangesHello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the GPU driver installation script's reliability and adaptability, particularly in complex network environments requiring HTTP/HTTPS proxies and for systems utilizing Secure Boot. The changes focus on making the installation process more robust, configurable, and resilient to common issues like network restrictions and module signing requirements, while also refining the Conda environment setup and updating documentation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances the GPU driver installation script by introducing robust proxy handling, improving Secure Boot integration, and refining Conda environment setup. Key improvements include flexible proxy configuration with support for HTTPS proxies and custom CA certificates, a new import_gpg_keys function for reliable GPG key fetching, and more thorough verification steps for signed kernel modules under Secure Boot. The documentation has also been updated to reflect these new features and provide better troubleshooting guidance. Overall, these changes make the script more resilient and configurable for diverse network environments and security requirements.
| # Verify signature after installation | ||
| if [[ -n "${PSN}" ]]; then | ||
| configure_dkms_certs | ||
|
|
||
| # Verify signatures and load | ||
| local signed=true | ||
| for module in $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko'); do | ||
| if ! modinfo "${module}" | grep -qi sig ; then | ||
| echo "ERROR: Module ${module} is NOT signed after installation." | ||
| signed=false | ||
| fi | ||
| done | ||
| if [[ "${signed}" != "true" ]]; then | ||
| echo "ERROR: Module signing failed." | ||
| exit 1 | ||
| fi | ||
| execute_with_retries make -j$(nproc) modules \ | ||
| > kernel-open/build.log \ | ||
| 2> kernel-open/build_error.log | ||
| # Sign kernel modules | ||
| if [[ -n "${PSN}" ]]; then | ||
| configure_dkms_certs | ||
| for module in $(find open-gpu-kernel-modules/kernel-open -name '*.ko'); do | ||
| "/lib/modules/${uname_r}/build/scripts/sign-file" sha256 \ | ||
| "${mok_key}" \ | ||
| "${mok_der}" \ | ||
| "${module}" | ||
| done | ||
| clear_dkms_key | ||
|
|
||
| if ! modprobe nvidia; then | ||
| echo "ERROR: Failed to load nvidia module after build and sign." | ||
| exit 1 | ||
| fi | ||
| make modules_install \ | ||
| >> kernel-open/build.log \ | ||
| 2>> kernel-open/build_error.log | ||
| # Collect build logs and installed binaries | ||
| tar czvf "${local_tarball}" \ | ||
| "${workdir}/open-gpu-kernel-modules/kernel-open/"*.log \ | ||
| $(find /lib/modules/${uname_r}/ -iname 'nvidia*.ko') | ||
| ${gsutil_cmd} cp "${local_tarball}" "${gcs_tarball}" | ||
| if ${gsutil_stat_cmd} "${gcs_tarball}.building" ; then ${gsutil_cmd} rm "${gcs_tarball}.building" || true ; fi | ||
| building_file="" | ||
| rm "${local_tarball}" | ||
| make clean | ||
| popd | ||
| echo "NVIDIA modules built, signed, and loaded successfully." | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu/install_gpu_driver.sh
Outdated
| if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then | ||
| if [[ -z "${trusted_pem_path:-}" ]]; then | ||
| echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2 | ||
| else | ||
| curl_retry_args+=(--cacert "${trusted_pem_path}") | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined indicates a potential issue. trusted_pem_path is only set within set_proxy if both a proxy (http-proxy/https-proxy) and a PEM URI are provided. If http-proxy-pem-uri is provided but no http-proxy or https-proxy is set, set_proxy returns early, leaving trusted_pem_path undefined. This could lead to GPG key imports failing to use the custom CA, even if the PEM URI is present.
| pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy" | ||
| cat > "${pkg_proxy_conf_file}" <<EOF | ||
| Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}"; | ||
| Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}"; | ||
| EOF | ||
| echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}" | ||
| echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}" | ||
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" | ||
| elif is_rocky ; then | ||
| pkg_proxy_conf_file="/etc/dnf/dnf.conf" | ||
|
|
||
| touch "${pkg_proxy_conf_file}" | ||
|
|
||
| if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}" | ||
| elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}" | ||
| sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}" | ||
| if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}" | ||
| else | ||
| local TMP_FILE=$(mktemp) | ||
| printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}" | ||
|
|
||
| cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new | ||
| mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}" | ||
| echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}" | ||
| fi | ||
| echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apt and dnf proxy configurations (Acquire::http::Proxy "http://${effective_proxy}"; and proxy=http://${effective_proxy}) use an http:// prefix. If effective_proxy is derived solely from https_proxy_val (meaning only an HTTPS proxy was specified), this could lead to apt/dnf attempting to connect to an HTTPS proxy using an HTTP scheme. While a later sed command attempts to correct this if http-proxy-pem-uri is set, it might be incorrect if http-proxy-pem-uri is not provided.
| pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy" | |
| cat > "${pkg_proxy_conf_file}" <<EOF | |
| Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}"; | |
| Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}"; | |
| EOF | |
| echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}" | |
| echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}" | |
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" | |
| elif is_rocky ; then | |
| pkg_proxy_conf_file="/etc/dnf/dnf.conf" | |
| touch "${pkg_proxy_conf_file}" | |
| if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}" | |
| elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}" | |
| sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}" | |
| if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}" | |
| else | |
| local TMP_FILE=$(mktemp) | |
| printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}" | |
| cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new | |
| mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}" | |
| echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}" | |
| fi | |
| echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}" | |
| fi | |
| echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}" | |
| echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}" | |
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" |
| break | ||
| fi | ||
| sleep 5m | ||
| sleep 1m # could take up to 180 minutes on single core nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if ! modprobe nvidia > /dev/null 2>&1; then | ||
| echo "NVIDIA module not loading. Removing completion marker to force | ||
| re-install." | ||
| mark_incomplete gpu-driver | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early check for nvidia module loadability and the subsequent mark_incomplete gpu-driver call significantly improve the resilience of the driver installation. If the module isn't loading, it correctly forces a re-installation attempt, which can resolve transient issues or incomplete previous installations.
gpu/install_gpu_driver.sh
Outdated
| if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then | ||
| time dd if=/dev/zero of=/zero status=none ; sync ; sleep 3s ; rm -f /zero | ||
| ) fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the exit_handler, disabling the dd if=/dev/zero operation in prepare_to_install using [[ "0" == "1" ]] is an effective way to prevent it from running. However, it would be clearer for future maintainers to either remove or explicitly comment out the code block if it's permanently disabled.
| if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then | |
| time dd if=/dev/zero of=/zero status=none ; sync ; sleep 3s ; rm -f /zero | |
| ) fi | |
| fi | |
| # zero free disk space (only if creating image) | |
| # This operation is currently disabled. | |
| # if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then |
|
|
||
| curl ${curl_retry_args} "${signing_key_url}" \ | ||
| | gpg --import --no-default-keyring --keyring "${kr_path}" | ||
| import_gpg_keys --keyring-file "${kr_path}" --key-url "${signing_key_url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Function to download GPG keys from URLs or Keyservers and import them to a specific keyring | ||
| # Usage: | ||
| # import_gpg_keys --keyring-file <PATH> \ | ||
| # [--key-url <URL1> [--key-url <URL2> ...]] \ | ||
| # [--key-id <ID1> [--key-id <ID2> ...]] \ | ||
| # [--keyserver <KEYSERVER_URI>] | ||
| function import_gpg_keys() { | ||
| local keyring_file="" | ||
| local key_urls=() | ||
| local key_ids=() | ||
| local keyserver="hkp://keyserver.ubuntu.com:80" # Default keyserver | ||
| local tmpdir="${TMPDIR:-/tmp}" # Use TMPDIR if set, otherwise /tmp | ||
| local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args | ||
|
|
||
| # Parse named arguments | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --keyring-file) | ||
| keyring_file="$2" | ||
| shift 2 | ||
| ;; | ||
| --key-url) | ||
| key_urls+=("$2") | ||
| shift 2 | ||
| ;; | ||
| --key-id) | ||
| key_ids+=("$2") | ||
| shift 2 | ||
| ;; | ||
| --keyserver) | ||
| keyserver="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| echo "Unknown option: $1" >&2 | ||
| return 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| # Validate arguments | ||
| if [[ -z "${keyring_file}" ]]; then | ||
| echo "ERROR: --keyring-file is required." >&2 | ||
| return 1 | ||
| fi | ||
| if [[ ${#key_urls[@]} -eq 0 && ${#key_ids[@]} -eq 0 ]]; then | ||
| echo "ERROR: At least one --key-url or --key-id must be specified." >&2 | ||
| return 1 | ||
| fi | ||
|
|
||
| # Ensure the directory for the keyring file exists | ||
| local keyring_dir | ||
| keyring_dir=$(dirname "${keyring_file}") | ||
| if [[ ! -d "${keyring_dir}" ]]; then | ||
| echo "Creating directory for keyring: ${keyring_dir}" | ||
| mkdir -p "${keyring_dir}" | ||
| fi | ||
|
|
||
| local tmp_key_file="" | ||
| local success=true | ||
|
|
||
| # Setup curl proxy arguments if environment variables are set | ||
| local proxy_to_use="" | ||
| if [[ -n "${HTTPS_PROXY:-}" ]]; then | ||
| proxy_to_use="${HTTPS_PROXY}" | ||
| elif [[ -n "${HTTP_PROXY:-}" ]]; then | ||
| proxy_to_use="${HTTP_PROXY}" | ||
| fi | ||
|
|
||
| if [[ -n "${proxy_to_use}" ]]; then | ||
| curl_retry_args+=(-x "${proxy_to_use}") | ||
| fi | ||
|
|
||
| if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then | ||
| if [[ -z "${trusted_pem_path:-}" ]]; then | ||
| echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2 | ||
| else | ||
| curl_retry_args+=(--cacert "${trusted_pem_path}") | ||
| fi | ||
| fi | ||
|
|
||
| # Process Key URLs | ||
| for current_key_url in "${key_urls[@]}"; do | ||
| echo "Attempting to download GPG key from URL: ${current_key_url}" | ||
| tmp_key_file="${tmpdir}/key_$(basename "${current_key_url}")_$(date +%s).asc" | ||
|
|
||
| if curl "${curl_retry_args[@]}" "${current_key_url}" -o "${tmp_key_file}"; then | ||
| if [[ -s "${tmp_key_file}" ]]; then | ||
| echo "Key file downloaded to ${tmp_key_file}." | ||
| if gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then | ||
| echo "Key from ${current_key_url} imported successfully to ${keyring_file}." | ||
| else | ||
| echo "ERROR: gpg --import failed for ${tmp_key_file} from ${current_key_url}." >&2 | ||
| success=false | ||
| fi | ||
| else | ||
| echo "ERROR: Downloaded key file ${tmp_key_file} from ${current_key_url} is empty." >&2 | ||
| success=false | ||
| fi | ||
| else | ||
| echo "ERROR: curl failed to download key from ${current_key_url}." >&2 | ||
| success=false | ||
| fi | ||
| [[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}" | ||
| done | ||
|
|
||
| # Process Key IDs | ||
| for key_id in "${key_ids[@]}"; do | ||
| # Strip 0x prefix if present | ||
| clean_key_id="${key_id#0x}" | ||
| echo "Attempting to fetch GPG key ID ${clean_key_id} using curl from ${keyserver}" | ||
|
|
||
| local fallback_key_url | ||
| local server_host | ||
| server_host=$(echo "${keyserver}" | sed -e 's#hkp[s]*://##' -e 's#:[0-9]*##') | ||
|
|
||
| # Common keyserver URL patterns | ||
| if [[ "${server_host}" == "keyserver.ubuntu.com" ]]; then | ||
| fallback_key_url="https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x${clean_key_id}" | ||
| elif [[ "${server_host}" == "pgp.mit.edu" ]]; then | ||
| fallback_key_url="https://pgp.mit.edu/pks/lookup?op=get&search=0x${clean_key_id}" | ||
| elif [[ "${server_host}" == "keys.openpgp.org" ]]; then | ||
| fallback_key_url="https://keys.openpgp.org/vks/v1/by-fpr/${clean_key_id}" | ||
| else | ||
| fallback_key_url="https://${server_host}/pks/lookup?op=get&search=0x${clean_key_id}" | ||
| echo "WARNING: Using best-guess fallback URL for ${keyserver}: ${fallback_key_url}" | ||
| fi | ||
|
|
||
| tmp_key_file="${tmpdir}/${clean_key_id}.asc" | ||
| if curl "${curl_retry_args[@]}" "${fallback_key_url}" -o "${tmp_key_file}"; then | ||
| if [[ -s "${tmp_key_file}" ]]; then | ||
| if grep -q -iE '<html|<head|<!DOCTYPE' "${tmp_key_file}"; then | ||
| echo "ERROR: Output from keyserver for ${clean_key_id} appears to be HTML, not a key. Key likely not found at ${fallback_key_url}." >&2 | ||
| success=false | ||
| elif gpg --no-default-keyring --keyring "${keyring_file}" --import "${tmp_key_file}"; then | ||
| echo "Key ${clean_key_id} imported successfully to ${keyring_file}." | ||
| else | ||
| echo "ERROR: gpg --import failed for ${clean_key_id} from ${fallback_key_url}." >&2 | ||
| success=false | ||
| fi | ||
| else | ||
| echo "ERROR: Downloaded key file for ${clean_key_id} is empty from ${fallback_key_url}." >&2 | ||
| success=false | ||
| fi | ||
| else | ||
| echo "ERROR: curl failed to download key ${clean_key_id} from ${fallback_key_url}." >&2 | ||
| success=false | ||
| fi | ||
| [[ -f "${tmp_key_file}" ]] && rm -f "${tmp_key_file}" | ||
| done | ||
|
|
||
| if [[ "${success}" == "true" ]]; then | ||
| return 0 | ||
| else | ||
| echo "ERROR: One or more keys failed to import." >&2 | ||
| return 1 | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new import_gpg_keys function is a well-designed and crucial addition. It centralizes the logic for fetching GPG keys from URLs or keyservers, correctly handles proxy settings, and includes retry mechanisms. This significantly improves the robustness and maintainability of repository configuration across the script.
gpu/install_gpu_driver.sh
Outdated
| local key_ids=() | ||
| local keyserver="hkp://keyserver.ubuntu.com:80" # Default keyserver | ||
| local tmpdir="${TMPDIR:-/tmp}" # Use TMPDIR if set, otherwise /tmp | ||
| local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The curl_retry_args in import_gpg_keys are set to (-sSLf --retry 3 --retry-delay 5). This is less robust than the curl_retry_args (-fsSL --retry-connrefused --retry 10 --retry-max-time 30) used elsewhere in the script. Given that GPG key imports are critical for repository setup, using the more aggressive retry settings would improve reliability, especially in flaky network environments.
| local curl_retry_args=(-sSLf --retry 3 --retry-delay 5) # Basic curl retry args | |
| local curl_retry_args=(-fsSL --retry-connrefused --retry 10 --retry-max-time 30) # Basic curl retry args |
| If `yes`, installs PyTorch, Numba, RAPIDS, and PySpark in a Conda | ||
| environment. | ||
| * `gpu-conda-env`: (Optional) Name for the PyTorch Conda environment. | ||
| Default: `dpgce`. | ||
| * `container-runtime`: (Optional) E.g., `docker`, `containerd`, `crio`. | ||
| For NVIDIA Container Toolkit configuration. Auto-detected if not specified. | ||
| * `http-proxy`: (Optional) URL of an HTTP proxy for downloads. | ||
| * `http-proxy`: (Optional) Proxy address and port for HTTP requests (e.g., `your-proxy.com:3128`). | ||
| * `https-proxy`: (Optional) Proxy address and port for HTTPS requests (e.g., `your-proxy.com:3128`). Defaults to `http-proxy` if not set. | ||
| * `proxy-uri`: (Optional) A single proxy URI for both HTTP and HTTPS. Overridden by `http-proxy` or `https-proxy` if they are set. | ||
| * `no-proxy`: (Optional) Comma or space-separated list of hosts/domains to bypass the proxy. Defaults include localhost, metadata server, and Google APIs. User-provided values are appended to the defaults. | ||
| * `http-proxy-pem-uri`: (Optional) A `gs://` path to the | ||
| PEM-encoded certificate file used by the proxy specified in | ||
| `http-proxy`. This is needed if the proxy uses TLS and its | ||
| certificate is not already trusted by the cluster's default trust | ||
| store (e.g., if it's a self-signed certificate or signed by an | ||
| internal CA). The script will install this certificate into the | ||
| system and Java trust stores. | ||
| PEM-encoded CA certificate file for the proxy specified in | ||
| `http-proxy`/`https-proxy`. Required if the proxy uses TLS with a certificate not in the default system trust store. This certificate will be added to the system, Java, and Conda trust stores, and proxy connections will use HTTPS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated descriptions for include-pytorch, http-proxy, https-proxy, proxy-uri, no-proxy, and http-proxy-pem-uri metadata parameters provide much-needed clarity on the new proxy configuration options and their usage. Replacing TensorFlow with Numba in the include-pytorch description also reflects the updated Conda environment package list.
|
/gcbrun |
This commit significantly enhances the robustness and configurability of the GPU driver installation script, particularly for environments with HTTP/HTTPS proxies and those using Secure Boot.
**Key Changes:**
* **Enhanced Proxy Configuration (`set_proxy`):**
* Added support for `https-proxy` and `proxy-uri` metadata, providing more flexibility in proxy setups.
* Improved `NO_PROXY` handling with sensible defaults (including Google APIs) and user-configurable additions.
* Integrated support for custom proxy CA certificates via `http-proxy-pem-uri`, including installation into system, Java, and Conda trust stores.
* Connections to the proxy now use HTTPS when a custom CA is provided.
* Added proxy connection and reachability tests to fail fast on misconfiguration.
* Ensures `curl`, `apt`, `dnf`, `gpg`, and Java all respect the proxy settings.
* **Robust GPG Key Import (`import_gpg_keys`):**
* Introduced a new function to reliably import GPG keys from URLs or keyservers, fully respecting the configured proxy and custom CA settings.
* This replaces direct `curl | gpg --import` calls, making key fetching more resilient in restricted network environments.
* **Secure Boot Signing Refinements:**
* The `configure_dkms_certs` function now always fetches keys from Secret Manager if `private_secret_name` is set, ensuring `modulus_md5sum` is available for GCS cache paths.
* Kernel module signing is now more clearly integrated into the build process.
* Improved checks to ensure modules are actually signed and loadable after installation when Secure Boot is active.
* **Resilient Driver Installation:**
* The script now checks if the `nvidia` module can be loaded at the beginning of `install_nvidia_gpu_driver` and will re-attempt installation if it fails.
* `curl` calls for downloading drivers and other artifacts now use retry flags and honor proxy settings.
* **Conda Environment for PyTorch:**
* Adjusted package list for Conda environment, removing TensorFlow to streamline.
* Added specific workarounds for Debian 10, using `conda` instead of `mamba`.
* **Documentation Updates (`gpu/README.md`):**
* Added details on the new proxy metadata: `https-proxy`, `proxy-uri`, `no-proxy`.
* Created a new section "Enhanced Proxy Support" explaining the features.
* Updated `http-proxy-pem-uri` description.
* Added proxy considerations to the "Troubleshooting" section.
These changes aim to make the GPU initialization action more reliable across a wider range of network environments and improve the Secure Boot workflow.
|
/gcbrun |
This commit refines the `import_gpg_keys` function and a few other areas: - Moves the definition of `curl_retry_args` earlier in `prepare_to_install` to ensure it's available for `set_proxy`. - Removes local `tmpdir` and `curl_retry_args` definitions within `import_gpg_keys`, using the global settings instead. - Eliminates manual proxy and certificate configuration for `curl` within `import_gpg_keys`. The function now relies on the environment variables (`HTTP_PROXY`, `HTTPS_PROXY`, `REQUESTS_CA_BUNDLE`) properly set by the `set_proxy` function. - Restores `tensorflow[and-cuda]` to the default Conda package list in `install_pytorch`. These changes simplify the script, reduce redundancy, and ensure consistent use of proxy settings and curl arguments.
|
I've made substantial changes. Please perform another review, discarding previous review comments. Let's start from the top once again. /gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly refactors the GPU driver installation script, enhancing proxy handling, improving Secure Boot integration, and refining Conda environment setup. The documentation has been updated to reflect these changes, providing clearer guidance on proxy configuration and custom image building with Secure Boot. The introduction of the import_gpg_keys function centralizes GPG key management, making the script more robust in various network environments. Overall, these changes contribute to a more reliable and configurable GPU initialization process.
gpu/install_gpu_driver.sh
Outdated
| local local_tarball="${workdir}/${build_tarball}" | ||
| local gcs_tarball="${pkg_bucket}/conda/${_shortname}/${build_tarball}" | ||
|
|
||
| # If the completion sentinel doesn't exist, force a full rebuild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "If the completion sentinel doesn't exist, force a full rebuild", but the code if is_complete install_gpu_driver-main; then actually checks if the sentinel does exist. This appears to be a logical inversion or a misleading comment. If the intention is to force a rebuild when the main driver installation is already marked complete (perhaps to ensure the Conda environment is always fresh after a driver install), the comment should reflect that. Otherwise, if a rebuild should happen when the driver install is not complete, the condition needs to be if ! is_complete install_gpu_driver-main; then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are in a block where the pyspark completion sentinel does not exist. We are checking here whether despite the pyspark completion sentinel not existing, the main sentinel does. This is an uncommon and contrived situation, and when we find ourselves in this situation, it is our job to clear the cache and re-build.
| tar czvf "${local_tarball}" \ | ||
| "${workdir}/open-gpu-kernel-modules/kernel-open/"*.log \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local_tarball variable is used here without being explicitly defined within the execute_github_driver_build function's scope. It appears to rely on being set in the calling function (build_driver_from_github) and implicitly available in the global scope. While this might work in Bash, it's generally considered bad practice and can lead to unexpected behavior or bugs if the call stack or variable scoping changes. It's best to pass local_tarball as an argument to execute_github_driver_build or define it locally if its value can be determined within the function.
gpu/install_gpu_driver.sh
Outdated
| building_file="" | ||
| rm "${local_tarball}" | ||
| make clean | ||
| popd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execute_github_driver_build function includes a popd command. However, this function is called from build_driver_from_github, which also has a popd at its end (line 1331). This results in an extra popd call, which could lead to a directory stack underflow or unexpected behavior if the script's current directory is not what's expected. Ensure that pushd and popd calls are balanced.
| This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and | ||
| -w nodes in a Dataproc cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from "master and worker nodes" to "-m node(s) and -w nodes" uses internal abbreviations that might not be immediately clear to all users. For documentation, it's generally better to use full, descriptive terms to ensure maximum clarity for a broad audience.
| This initialization action installs GPU driver for NVIDIA GPUs on -m node(s) and | |
| -w nodes in a Dataproc cluster. | |
| This initialization action installs GPU driver for NVIDIA GPUs on master node(s) and | |
| worker nodes in a Dataproc cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a part of a larger inclusive language initiative. We are intentionally replacing potentially less inclusive language
gpu/install_gpu_driver.sh
Outdated
| if version_ge "${DATAPROC_IMAGE_VERSION}" "2.1" ; then | ||
| conda_pkg_list+=("tensorflow") | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conda_pkg_list initially removes tensorflow[and-cuda] (line 931) but then conditionally adds tensorflow back for Dataproc 2.1+ images. This might lead to confusion or unintended package installations. It would be clearer to explicitly manage TensorFlow's inclusion based on the DATAPROC_IMAGE_VERSION from the start, or provide a comment explaining why it's removed and then re-added.
gpu/install_gpu_driver.sh
Outdated
|
|
||
| # zero free disk space (only if creating image) | ||
| if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then | ||
| if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we might want to remove that 0==1 check, I agree.
gpu/install_gpu_driver.sh
Outdated
| fi | ||
| fi | ||
|
|
||
| # Configure package managers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version of set_proxy included explicit gcloud config set proxy/type http, proxy/address, and proxy/port commands. While environment variables are now set, explicitly configuring gcloud might still be necessary for all its operations to respect the proxy, as it doesn't always rely solely on environment variables. Consider re-adding the gcloud config set commands to ensure comprehensive proxy support for gcloud tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. gcloud config set proxy args do not support no_proxy and so we do not use them. I have opened a github issue to have it added, but I do not have the time to do it myself....
oh. It was done in November.
547.0.0 release on November 11th
Thank you team!
gpu/install_gpu_driver.sh
Outdated
|
|
||
| # zero free disk space (only if creating image) | ||
| if [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then ( set +e | ||
| if [[ "0" == "1" ]] && [[ "${IS_CUSTOM_IMAGE_BUILD}" == "true" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fef918e to
25df48a
Compare
…d Reliability
This update significantly improves the `install_gpu_driver.sh` script and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.
**1. `gpu/README.md`:**
* **Comprehensive Documentation for Secure Boot & Proxy:**
* Added a major section: "Building Custom Images with Secure Boot and Proxy Support". This details the end-to-end process using the `GoogleCloudDataproc/custom-images` repository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.
* Added a major section: "Launching a Cluster with the Secure Boot Custom Image". This explains how to use the custom-built images to launch Dataproc clusters with `--shielded-secure-boot`. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from the `GoogleCloudDataproc/cloud-dataproc` repository for VPC, subnet, and proxy configuration.
* Includes essential verification steps for checking driver status, module signatures, and system logs on the cluster nodes.
**2. `gpu/install_gpu_driver.sh`:**
* **Conda/Mamba Environment (`install_pytorch`):**
* The package list for the Conda environment now omits the explicit CUDA runtime specification, allowing the solver more flexibility based on other dependencies and the `cuda-version` constraint.
* Mamba is now used preferentially for faster environment creation, with a fallback to Conda.
* Implements a cache/environment clearing logic: If the main driver installation is marked complete (`install_gpu_driver-main` sentinel exists) but the PyTorch environment setup is not (`pytorch` sentinel missing), it purges the GCS cache and local Conda environment to ensure a clean rebuild.
* Enhanced error handling for Conda/Mamba environment creation.
* **NVIDIA Driver Handling:**
* `set_driver_version`: Uses `curl -I` for lightweight URL HEAD requests.
* `build_driver_from_github`: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.
* `execute_github_driver_build`: Refactored to accept tarball paths as arguments. `popd` removed to balance `pushd` in caller. Removed a debug echo of the `sign-file` exit code.
* Added `make -j$(nproc)` to `modules_install` for parallelization.
* Post-install verification loop checks `modinfo` for `signer:` to confirm modules are signed.
* **Secure Boot Check:** Changed the script to issue a warning rather than exit if Secure Boot is enabled on unsupported Debian versions (<= 2.1).
* **Completion Sentinel:** Added `mark_complete install_gpu_driver-main` at the end of the `main` function.
* **Proxy Configuration (`set_proxy`):** Conditionally adds `gcloud config set proxy/...` commands if the gcloud SDK version is >= 547.0.0. Corrected `sed` command for DNF proxy configuration.
|
I've made substantial changes in response to your previous review. Please perform another review. /gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant enhancements to the GPU installation script, primarily focusing on robust proxy handling and support for Secure Boot. The documentation in gpu/README.md has been extensively updated with detailed instructions for these complex scenarios, which is a great improvement. The script gpu/install_gpu_driver.sh has been refactored to centralize GPG key imports, improve caching logic for driver and source builds, and add comprehensive proxy configuration, including custom CA certificate handling.
My review focuses on ensuring the new logic is robust and consistent. I've identified a minor documentation inconsistency and a couple of potential issues in the script related to command execution order and error handling. Overall, this is a very strong refactoring that greatly improves the reliability and maintainability of the script.
gpu/install_gpu_driver.sh
Outdated
| trusted_pem_dir="/usr/local/share/ca-certificates" | ||
| mkdir -p "${trusted_pem_dir}" | ||
| proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt" | ||
| gsutil cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This direct call to gsutil is inconsistent with the rest of the script, which uses a ${gsutil_cmd} variable defined in prepare_to_install. Using the variable ensures the correct command wrapper is used based on the gcloud version. Please change this to ${gsutil_cmd}. Note that the variable definition in prepare_to_install will also need to be moved before set_proxy is called.
| gsutil cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}" | |
| ${gsutil_cmd} cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}" |
| mount_ramdisk # Updates tmpdir if successful | ||
| install_log="${tmpdir}/install.log" # Set install log path based on final tmpdir | ||
| curl_retry_args="-fsSL --retry-connrefused --retry 10 --retry-max-time 30" | ||
| set_proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the refactor for this is substantial. Can we maybe not for right now if the tests pass?
diff --git a/gpu/install_gpu_driver.sh b/gpu/install_gpu_driver.sh
index a4adf64..ab31529 100644
--- a/gpu/install_gpu_driver.sh
+++ b/gpu/install_gpu_driver.sh
@@ -2693,10 +2693,61 @@ function set_proxy(){
meta_http_proxy=$(get_metadata_attribute 'http-proxy' '')
meta_https_proxy=$(get_metadata_attribute 'https-proxy' '')
meta_proxy_uri=$(get_metadata_attribute 'proxy-uri' '')
+ METADATA_HTTP_PROXY_PEM_URI="$(get_metadata_attribute http-proxy-pem-uri '')"
echo "DEBUG: set_proxy: meta_http_proxy='${meta_http_proxy}'"
echo "DEBUG: set_proxy: meta_https_proxy='${meta_https_proxy}'"
echo "DEBUG: set_proxy: meta_proxy_uri='${meta_proxy_uri}'"
+ echo "DEBUG: set_proxy: METADATA_HTTP_PROXY_PEM_URI='${METADATA_HTTP_PROXY_PEM_URI}'"
+
+ # Install the HTTPS proxy's certificate FIRST, so trusted_pem_path is available
+ if [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]] ; then
+ if [[ ! "${METADATA_HTTP_PROXY_PEM_URI}" =~ ^gs:// ]] ; then echo "ERROR: http-proxy-pem-uri value must start with gs://" ; exit 1 ; fi
+ echo "DEBUG: set_proxy: Processing http-proxy-pem-uri='${METADATA_HTTP_PROXY_PEM_URI}'"
+ local trusted_pem_dir proxy_ca_pem
+ if is_debuntu ; then
+ trusted_pem_dir="/usr/local/share/ca-certificates"
+ proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt"
+ mkdir -p "${trusted_pem_dir}"
+ ${gsutil_cmd} cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}"
+ update-ca-certificates
+ export trusted_pem_path="/etc/ssl/certs/ca-certificates.crt"
+ elif is_rocky ; then
+ trusted_pem_dir="/etc/pki/ca-trust/source/anchors"
+ proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt"
+ mkdir -p "${trusted_pem_dir}"
+ ${gsutil_cmd} cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}"
+ update-ca-trust
+ export trusted_pem_path="/etc/ssl/certs/ca-bundle.crt"
+ fi
+ export REQUESTS_CA_BUNDLE="${trusted_pem_path}"
+ echo "DEBUG: set_proxy: trusted_pem_path set to '${trusted_pem_path}'"
+
+ # Add to Java trust store
+ if [[ -f "/etc/environment" ]]; then
+ JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)"
+ if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
+ echo "DEBUG: Importing proxy CA into Java cacerts"
+ "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}"
+ fi
+ fi
+
+ # Add to Conda trust store
+ if command -v conda &> /dev/null ; then
+ local conda_cert_file="/opt/conda/default/ssl/cacert.pem"
+ if [[ -f "${conda_cert_file}" ]]; then
+ local ca_subject
+ ca_subject="$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject)"
+ openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || {
+ echo "DEBUG: Appending proxy CA to Conda cacert.pem"
+ cat "${proxy_ca_pem}" >> "${conda_cert_file}"
+ }
+ fi
+ fi
+ else
+ echo "DEBUG: set_proxy: No http-proxy-pem-uri metadata found. Skipping custom cert install."
+ export trusted_pem_path="" # Explicitly empty
+ fi
local http_proxy_val=""
local https_proxy_val=""
@@ -2716,7 +2767,7 @@ function set_proxy(){
fi
if [[ -z "${http_proxy_val}" && -z "${https_proxy_val}" ]]; then
- echo "DEBUG: set_proxy: No valid proxy metadata found (http-proxy, https-proxy, or proxy-uri). Skipping proxy setup."
+ echo "DEBUG: set_proxy: No valid proxy metadata found (http-proxy, https-proxy, or proxy-uri). Proxy env vars not set."
return 0
fi
@@ -2730,7 +2781,7 @@ function set_proxy(){
".google.com"
".googleapis.com"
)
-
+
local user_no_proxy
user_no_proxy=$(get_metadata_attribute 'no-proxy' '')
local user_no_proxy_list=()
@@ -2738,17 +2789,22 @@ function set_proxy(){
# Replace spaces with commas, then split by comma
IFS=',' read -r -a user_no_proxy_list <<< "${user_no_proxy// /,}"
fi
-
+
local combined_no_proxy_list=( "${default_no_proxy_list[@]}" "${user_no_proxy_list[@]}" )
local no_proxy
no_proxy=$( IFS=',' ; echo "${combined_no_proxy_list[*]}" )
export NO_PROXY="${no_proxy}"
export no_proxy="${no_proxy}"
-
+
+ local proxy_protocol="http"
+ if [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then
+ proxy_protocol="https"
+ fi
+
# Export environment variables
if [[ -n "${http_proxy_val}" ]]; then
- export HTTP_PROXY="http://${http_proxy_val}"
- export http_proxy="http://${http_proxy_val}"
+ export HTTP_PROXY="${proxy_protocol}://${http_proxy_val}"
+ export http_proxy="${proxy_protocol}://${http_proxy_val}"
else
unset HTTP_PROXY
unset http_proxy
@@ -2756,8 +2812,8 @@ function set_proxy(){
echo "DEBUG: set_proxy: Initial HTTP_PROXY='${HTTP_PROXY:-}'"
if [[ -n "${https_proxy_val}" ]]; then
- export HTTPS_PROXY="http://${https_proxy_val}"
- export https_proxy="http://${https_proxy_val}"
+ export HTTPS_PROXY="${proxy_protocol}://${https_proxy_val}"
+ export https_proxy="${proxy_protocol}://${https_proxy_val}"
else
unset HTTPS_PROXY
unset https_proxy
@@ -2827,23 +2883,23 @@ function set_proxy(){
local pkg_proxy_conf_file
local effective_proxy="${http_proxy_val:-${https_proxy_val}}" # Use a single value for apt/dnf
- if [[ -z "${effective_proxy}" ]]; then
- echo "DEBUG: set_proxy: No HTTP or HTTPS proxy set for package managers."
- elif is_debuntu ; then
- pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
- echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}"
- echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}"
- echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
- elif is_rocky ; then
- pkg_proxy_conf_file="/etc/dnf/dnf.conf"
- touch "${pkg_proxy_conf_file}"
- sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
- if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
- sed -i.bak "/^\\[main\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}"
- else
- echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}"
+ if [[ -n "${effective_proxy}" ]]; then
+ if is_debuntu ; then
+ pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy"
+ echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}"
+ echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}"
+ echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}"
+ elif is_rocky ; then
+ pkg_proxy_conf_file="/etc/dnf/dnf.conf"
+ touch "${pkg_proxy_conf_file}"
+ sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}"
+ if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then
+ sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}"
+ else
+ echo -e "[main]\nproxy=${HTTP_PROXY}" >> "${pkg_proxy_conf_file}"
+ fi
+ echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
- echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}"
fi
# Configure dirmngr to use the HTTP proxy if set
@@ -2858,114 +2914,31 @@ function set_proxy(){
execute_with_retries dnf install -y -q gnupg2-smime
fi
fi
-
+
mkdir -p /etc/gnupg
local dirmngr_conf="/etc/gnupg/dirmngr.conf"
touch "${dirmngr_conf}" # Ensure the file exists
-
+
sed -i.bak '/^http-proxy/d' "${dirmngr_conf}"
- if [[ -n "${http_proxy_val}" ]]; then
- echo "http-proxy http://${http_proxy_val}" >> "${dirmngr_conf}"
+ if [[ -n "${HTTP_PROXY:-}" ]]; then
+ echo "http-proxy ${HTTP_PROXY}" >> "${dirmngr_conf}"
echo "DEBUG: set_proxy: Configured dirmngr proxy in ${dirmngr_conf}"
fi
- # Install the HTTPS proxy's certificate
- METADATA_HTTP_PROXY_PEM_URI="$(get_metadata_attribute http-proxy-pem-uri '')"
- if [[ -z "${METADATA_HTTP_PROXY_PEM_URI}" ]] ; then
- echo "DEBUG: set_proxy: No http-proxy-pem-uri metadata found. Skipping cert install."
- return 0
- fi
- if [[ ! "${METADATA_HTTP_PROXY_PEM_URI}" =~ ^gs:// ]] ; then echo "ERROR: http-proxy-pem-uri value must start with gs://" ; exit 1 ; fi
+ if [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]] ; then
+ local proxy_host="${http_proxy_val:-${https_proxy_val}}"
+ # Verification steps from original script...
+ if [[ -n "$proxy_host" ]]; then
+ # ca_subject is defined above when PEM is processed
+ openssl s_client -connect "${proxy_host}" -CAfile "${proxy_ca_pem}" < /dev/null || { echo "ERROR: proxy cert verification failed" ; exit 1 ; }
+ openssl s_client -connect "${proxy_host}" -CAfile "${trusted_pem_path}" < /dev/null || { echo "ERROR: proxy ca not in system bundle" ; exit 1 ; }
- echo "DEBUG: set_proxy: http-proxy-pem-uri='${METADATA_HTTP_PROXY_PEM_URI}'"
- local trusted_pem_dir proxy_ca_pem ca_subject
- if is_debuntu ; then
- trusted_pem_dir="/usr/local/share/ca-certificates"
- proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt"
- ${gsutil_cmd} cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}"
- update-ca-certificates
- export trusted_pem_path="/etc/ssl/certs/ca-certificates.crt"
- if [[ -n "${effective_proxy}" ]]; then
- sed -i -e 's|http://|https://|' "${pkg_proxy_conf_file}"
- fi
- elif is_rocky ; then
- trusted_pem_dir="/etc/pki/ca-trust/source/anchors"
- proxy_ca_pem="${trusted_pem_dir}/proxy_ca.crt"
- ${gsutil_cmd} cp "${METADATA_HTTP_PROXY_PEM_URI}" "${proxy_ca_pem}"
- update-ca-trust
- export trusted_pem_path="/etc/ssl/certs/ca-bundle.crt"
- if [[ -n "${effective_proxy}" ]]; then
- sed -i -e "s|^proxy=http://|proxy=https://|" "${pkg_proxy_conf_file}"
+ curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://google.com" || { echo "ERROR: curl rejects proxy config for google.com" ; exit 1 ; }
+ curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://developer.download.nvidia.com" || { echo "ERROR: curl rejects proxy config for nvidia.com" ; exit 1 ; }
fi
+ pip install pip-system-certs
+ unset REQUESTS_CA_BUNDLE
fi
- export REQUESTS_CA_BUNDLE="${trusted_pem_path}"
- echo "DEBUG: set_proxy: trusted_pem_path='${trusted_pem_path}'"
-
- local proxy_host="${http_proxy_val:-${https_proxy_val}}"
-
- # Update env vars to use https
- if [[ -n "${http_proxy_val}" ]]; then
- export HTTP_PROXY="https://${http_proxy_val}"
- export http_proxy="https://${http_proxy_val}"
- fi
- if [[ -n "${https_proxy_val}" ]]; then
- export HTTPS_PROXY="https://${https_proxy_val}"
- export https_proxy="https://${https_proxy_val}"
- fi
- sed -i -e 's|http://|https://|g' /etc/environment
- echo "DEBUG: set_proxy: Final HTTP_PROXY='${HTTP_PROXY:-}'"
- echo "DEBUG: set_proxy: Final HTTPS_PROXY='${HTTPS_PROXY:-}'"
-
- if [[ -n "${http_proxy_val}" ]]; then
- sed -i -e "s|^http-proxy http://.*|http-proxy https://${http_proxy_val}|" /etc/gnupg/dirmngr.conf
- fi
-
- # Verification steps from original script...
- ca_subject="$(openssl crl2pkcs7 -nocrl -certfile "${proxy_ca_pem}" | openssl pkcs7 -print_certs -noout | grep ^subject)"
- openssl s_client -connect "${proxy_host}" -CAfile "${proxy_ca_pem}" < /dev/null || { echo "ERROR: proxy cert verification failed" ; exit 1 ; }
- openssl s_client -connect "${proxy_host}" -CAfile "${trusted_pem_path}" < /dev/null || { echo "ERROR: proxy ca not in system bundle" ; exit 1 ; }
-
- curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://google.com" || { echo "ERROR: curl rejects proxy config for google.com" ; exit 1 ; }
- curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 10 --retry-max-time 30 --head "https://developer.download.nvidia.com" || { echo "ERROR: curl rejects proxy config for nvidia.com" ; exit 1 ; }
-
- pip install pip-system-certs
- unset REQUESTS_CA_BUNDLE
-
- if command -v conda &> /dev/null ; then
- local conda_cert_file="/opt/conda/default/ssl/cacert.pem"
- if [[ -f "${conda_cert_file}" ]]; then
- openssl crl2pkcs7 -nocrl -certfile "${conda_cert_file}" | openssl pkcs7 -print_certs -noout | grep -Fxq "${ca_subject}" || {
- cat "${proxy_ca_pem}" >> "${conda_cert_file}"
- }
- fi
- fi
-
- if [[ -f "/etc/environment" ]]; then
- JAVA_HOME="$(awk -F= '/^JAVA_HOME=/ {print $2}' /etc/environment)"
- if [[ -n "${JAVA_HOME:-}" && -f "${JAVA_HOME}/bin/keytool" ]]; then
- "${JAVA_HOME}/bin/keytool" -import -cacerts -storepass changeit -noprompt -alias swp_ca -file "${proxy_ca_pem}"
- fi
- fi
-
- echo "DEBUG: set_proxy: Verifying proxy connectivity..."
-
- # Test fetching a file through the proxy
- local test_url="https://www.gstatic.com/generate_204"
-# local test_url="https://raw.githubusercontent.com/GoogleCloudDataproc/initialization-actions/master/README.md"
- local test_output="${tmpdir}/proxy_test.md"
-
- echo "DEBUG: set_proxy: Attempting to download ${test_url} via proxy ${HTTPS_PROXY}"
-# if curl --verbose --cacert "${trusted_pem_path}" -x "${HTTPS_PROXY}" -fsSL --retry-connrefused --retry 3 --retry-max-time 10 -o "${test_output}" "${test_url}"; then
- if curl -vL ${curl_retry_args} -o /dev/null "${test_url}"; then
- echo "DEBUG: set_proxy: Successfully downloaded test file through proxy."
- rm -f "${test_output}"
- else
- echo "ERROR: Proxy test failed. Unable to download ${test_url} via ${HTTPS_PROXY}"
- # Optionally print more debug info from curl if needed
- exit 1
- fi
-
- echo "DEBUG: set_proxy: Proxy verification successful."
echo "DEBUG: set_proxy: Proxy setup complete."
}This commit incorporates several enhancements and fixes into the `install_gpu_driver.sh` script, particularly improving the Conda environment setup, caching mechanisms, robustness in different environments, and documentation for Secure Boot and Proxy scenarios.
**Script Changes (`install_gpu_driver.sh`):**
1. **Conda/Mamba Environment (`install_pytorch`):**
* Integrates Mamba for faster package installation, with fallback to Conda.
* The `conda_pkg_list` uses a flexible specification including `numba`, `pytorch`, `tensorflow[and-cuda]`, `rapids`, `pyspark`, and `cuda-version<=${CUDA_VERSION}` to guide the solver, balancing `rapids` compatibility with the desired CUDA version.
* Implements logic to force-clean and rebuild the Conda environment if the main driver installation has run, ensuring a fresh state by removing GCS cache and local environment when the `pytorch` sentinel is missing but `install_gpu_driver-main` is present.
* Enhances error handling during environment creation, providing specific guidance for Mamba multi-download failures in proxied environments.
* Cleans up the local Conda pack tarball after uploading to GCS.
2. **Source Tarball Caching (`build_driver_from_github`):**
* Adds GCS caching for the NVIDIA open kernel module source code downloaded from GitHub, reducing external fetches. Checks local, GCS, then GitHub.
3. **Driver Download (`set_driver_version`):**
* Improves NVIDIA driver URL validation using a more robust `curl` and `grep` check for HTTP 200 status.
* Simplifies `curl` download commands, relying on globally set options.
4. **Bug Fixes & Refinements:**
* Fixes module signature check in `build_driver_from_github` by correctly extracting the module name before calling `modinfo`.
* Consolidates `curl` retry arguments (`curl_retry_args`) and removes redundant proxy settings in functions, relying on `set_proxy`.
* Adds a main completion sentinel (`install_gpu_driver-main`) at the end of the `main` function.
**Documentation Changes (`gpu/README.md`):**
* **Secure Boot & Proxy:** Added a detailed section on "Building Custom Images with Secure Boot and Proxy Support".
* Provides a step-by-step guide on using the `GoogleCloudDataproc/custom-images` repository to build custom images with signed NVIDIA drivers, including configuration for proxy environments.
* Includes instructions for setting up the build environment (`env.json`, keys in Secret Manager).
* Details how to launch a cluster using the custom image, including setup for private networks with Secure Web Proxy using scripts from the `GoogleCloudDataproc/cloud-dataproc` repo.
* Adds verification steps for the driver and signature on the created cluster.
* Clarified that the init action applies to master (-m) and worker (-w) nodes.
These changes collectively improve the robustness and reliability of the GPU initialization script and provide crucial documentation for complex setups involving Secure Boot and network proxies. Successfully tested on Dataproc 2.2-debian12.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a significant and valuable pull request that greatly enhances the GPU installation script's reliability, especially in complex environments with proxies and Secure Boot. The refactoring for proxy handling is comprehensive, and the new GPG key import mechanism is a major improvement for robustness. The documentation updates in the README are thorough and will be very helpful for users. The changes are well-executed, but I've identified one high-severity issue related to directory stack management (pushd/popd) in the new execute_github_driver_build function that should be addressed to prevent potential issues with script execution flow.
| if ${gsutil_stat_cmd} "${gcs_tarball}.building" ; then ${gsutil_cmd} rm "${gcs_tarball}.building" || true ; fi | ||
| building_file="" | ||
| rm "${local_tarball}" | ||
| make clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pushd on line 1137 is not balanced with a popd within this function. This causes the caller (build_driver_from_github) to have an unbalanced directory stack, which can lead to subsequent commands executing in an unexpected directory. To ensure this function is self-contained and doesn't alter the directory stack for its caller, a popd should be added before it exits.
| make clean | |
| make clean | |
| popd |
|
/gcbrun |
- Add GCS caching for CUDA runfiles in `set_cuda_runfile_url`. The script now checks gs://[PKG_BUCKET]/nvidia/ for the runfile before downloading from NVIDIA, and uploads after a fresh download. - Define `tmpdir` in the deferred GPU config script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) to prevent unbound variable errors. - Minor adjustments to set_proxy to ensure METADATA_HTTP_PROXY_PEM_URI is read. - Make disk zeroing errors non-fatal during custom image builds.
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
I had to disable Rocky 8 tests entirely. None of them would work. I am not committed to Rocky on 2.0 and 2.1 for this release. Further work would require a customer issue filed to support the platform. note to self: this commit is 739fb91 |

GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability
This large update significantly improves the
install_gpu_driver.shscript and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.1.
gpu/README.md:GoogleCloudDataproc/custom-imagesrepository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.--shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from theGoogleCloudDataproc/cloud-dataprocrepository for VPC, subnet, and proxy configuration.http-proxy,https-proxy,proxy-uri,no-proxy, andhttp-proxy-pem-uri.2.
gpu/install_gpu_driver.sh:set_proxy):http-proxy,https-proxy, andproxy-urimetadata, determining the correct proxy values for HTTP and HTTPS.HTTP_PROXY,HTTPS_PROXY, andNO_PROXYenvironment variables./etc/environmentwith the current proxy settings.gcloudproxy settings only if the gcloud SDK version is 547.0.0 or greater.aptanddnfto use the proxy.dirmngrorgnupg2-smimeis installed and configuresdirmngr.confto use the HTTP proxy.http-proxy-pem-uriinto system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.import_gpg_keys):import_gpg_keysto handle GPG key fetching and importing in a proxy-aware manner usingcurlover HTTPS, replacing directgpg --recv-keyscalls to keyservers.install_pytorch):numba,pytorch,tensorflow[and-cuda],rapids,pyspark, andcuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g.,cudart_spec) is no longer added, allowing the solver more flexibility.install_gpu_driver-mainandpytorchsentinels to allow forced refreshes.set_driver_version: Usescurl -Ifor a more lightweight HEAD request to check URL validity.build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.execute_github_driver_build: Refactored to accept tarball paths.popdremoved to balancepushdin caller. Removed a debug echo of thesign-fileexit code.make -j$(nproc)tomodules_installfor parallelization.modinfoforsigner:to confirm modules are signed.prepare_to_install: Movedcurl_retry_argsdefinition earlier.install_nvidia_gpu_driver: Checks ifnvidiamodule loads at the start and marks incomplete if not.main: Addedmark_complete install_gpu_driver-mainat the end.configure_dkms_certs: Always fetches keys from secret manager ifPSNis set to ensuremodulus_md5sumis available.install_gpu_agent: Checks ifMETADATA_HTTP_PROXY_PEM_URIis non-empty before using it.