From fd364cec2299ed32ccb4ad428807f4b891736274 Mon Sep 17 00:00:00 2001 From: Pablo Fontanilla Date: Wed, 24 Dec 2025 13:00:04 +0100 Subject: [PATCH 1/4] NO-JIRA: Fix documentation inconsistencies with actual code MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - CLAUDE.md: Fix kcli vars filename, remove non-existent kcli/ directory references - helpers/README.md: Add missing code block closure - README-external-host.md: Fix vars sample filename - README-kcli.md: Fix defaults path, ocp_version, ocp_tag, ksushy_port values - proxy-setup/README.md: Clarify proxy_user is auto-detected - install-dev/README.md: Fix method variable docs and task structure - kcli-install/README.md: Fix defaults file link path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 15 +-------------- deploy/openshift-clusters/README-external-host.md | 2 +- deploy/openshift-clusters/README-kcli.md | 10 +++++----- .../roles/dev-scripts/install-dev/README.md | 9 +++++---- .../roles/kcli/kcli-install/README.md | 2 +- .../roles/proxy-setup/README.md | 2 +- helpers/README.md | 1 + 7 files changed, 15 insertions(+), 26 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 9fc8463..ea34093 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -115,7 +115,7 @@ make shellcheck - `roles/dev-scripts/install-dev/files/pull-secret.json`: OpenShift pull secret #### Kcli Method -- `vars/kcli-install.yml`: Variable override file for persistent configuration +- `vars/kcli.yml`: Variable override file for persistent configuration - `roles/kcli/kcli-install/files/pull-secret.json`: OpenShift pull secret - SSH key automatically read from `~/.ssh/id_ed25519.pub` on ansible controller @@ -150,10 +150,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters ## Development Guidelines and Standards -### Critical Repository Structure Rules - -**IMPORTANT**: The `kcli/` directory is included for reference only and should NEVER be modified. It contains the upstream kcli tool that we integrate with, but all development work happens in the `deploy/` and `docs/` directories. - ### File Organization **Development Areas:** @@ -161,7 +157,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters - `deploy/aws-hypervisor/`: AWS hypervisor setup scripts - `deploy/openshift-clusters/`: OpenShift cluster deployment with Ansible - **`docs/`**: Project documentation for different topologies -- **`kcli/`**: **READ-ONLY** - Reference copy of upstream kcli tool (DO NOT MODIFY) ### Coding Standards @@ -213,7 +208,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters ### Development Workflow Rules #### When Making Changes -- **NEVER modify anything in the `kcli/` directory** - it's reference material only - Focus changes on `deploy/` scripts and `docs/` documentation - Consider impact on multiple virtualization providers when updating deployment scripts - Test deployment scenarios end-to-end @@ -222,13 +216,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters - Check for credential exposure in logs or output - Validate Ansible playbooks and shell scripts before committing -#### Working with kcli Integration -- Use `kcli/` directory as reference for understanding kcli capabilities -- Study `kcli/kvirt/providers/` to understand provider implementations -- Reference `kcli/kvirt/cluster/openshift/` for OpenShift deployment patterns -- Check `kcli/samples/` for configuration examples -- **Remember**: Read from kcli for understanding, implement in `deploy/` for our use - ### Dependencies and Configuration #### Dependencies diff --git a/deploy/openshift-clusters/README-external-host.md b/deploy/openshift-clusters/README-external-host.md index b376e0b..f32ed96 100644 --- a/deploy/openshift-clusters/README-external-host.md +++ b/deploy/openshift-clusters/README-external-host.md @@ -57,7 +57,7 @@ See [hands-off deployment](../aws-hypervisor/README.md#automated-rhsm-registrati #### Option B: Local Variable File ```bash -cp vars/init-host.yml.sample vars/init-host.yml.local +cp vars/init-host.yml vars/init-host.yml.local # Edit vars/init-host.yml.local with your credentials ``` diff --git a/deploy/openshift-clusters/README-kcli.md b/deploy/openshift-clusters/README-kcli.md index 761fbab..10cb1a5 100644 --- a/deploy/openshift-clusters/README-kcli.md +++ b/deploy/openshift-clusters/README-kcli.md @@ -91,7 +91,7 @@ You can configure the deployment using any combination of these methods (in prec 1. **Command line variables** (highest precedence) 2. **Playbook vars section** 3. **vars/kcli.yml** (user configuration file) -4. **Role defaults** (lowest precedence) (`roles/kcli/kcli-install/defaults/main.yml`) +4. **Role defaults** (lowest precedence) (`vars/kcli.yml.template`) For simple overrides, the command line is recommended. For setting your preferred permanent config, copy [kcli.yml.template](vars/kcli.yml.template) to [kcli.yml](vars/kcli.yml) and update the values to your preference. This file is not tracked by Git and will persist between TNT updates. @@ -126,8 +126,8 @@ ansible-playbook kcli-install.yml \ | `vm_memory` | `32768` | Memory per node (MB) | | `vm_numcpus` | `16` | CPU cores per node | | `vm_disk_size` | `120` | Disk size per node (GB) | -| `ocp_version` | `"stable"` | OpenShift version channel | -| `ocp_tag` | `"4.19"` | Specific version tag | +| `ocp_version` | `"candidate"` | OpenShift version channel | +| `ocp_tag` | `"4.20"` | Specific version tag | | `network_name` | `"default"` | kcli network name | | `bmc_user` | `"admin"` | BMC username (fencing) | | `bmc_password` | `"admin123"` | BMC password (fencing) | @@ -141,7 +141,7 @@ topology: "fencing" bmc_user: "admin" bmc_password: "admin123" bmc_driver: "redfish" -ksushy_port: 8000 +ksushy_port: 9000 ``` ## 5. Deployment @@ -305,7 +305,7 @@ The playbook uses reasonable defaults that work for typical kcli deployments: | `ksushy_ip` | `192.168.122.1` | Standard libvirt network gateway | | `bmc_user` | `admin` | From kcli-install defaults | | `bmc_password` | `admin123` | From kcli-install defaults | -| `ksushy_port` | `8000` | From kcli-install defaults | +| `ksushy_port` | `9000` | From kcli-install defaults | These defaults work for standard kcli deployments where VMs use the default libvirt network (`192.168.122.x/24`). diff --git a/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md b/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md index 22839f7..e40a36a 100644 --- a/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md +++ b/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md @@ -24,7 +24,7 @@ The install-dev role handles the complete setup of OpenShift bare metal developm - `dev_scripts_path`: Path to dev-scripts directory (default: "openshift-metal3/dev-scripts") - `dev_scripts_branch`: Git branch to use (default: "master") - `test_cluster_name`: OpenShift cluster name (default: "ostest") -- `method`: Deployment method (default: "ipi") +- `method`: Deployment method (set by calling playbook, e.g., "ipi") ### Computed Variables (vars/main.yml) @@ -42,10 +42,11 @@ ansible-playbook setup.yml ## Task Structure -- `dev-scripts.yml`: Dev-scripts environment setup -- `create.yml`: OpenShift cluster creation (conditional) -- `proxy.yml`: Proxy configuration setup - `main.yml`: Orchestrates all tasks and configures aliases +- `bounce.yml`: Cluster bounce/restart operations +- `check_vars.yml`: Variable validation +- `config.yml`: Configuration setup +- `teardown.yml`: Cluster teardown operations ## Notes diff --git a/deploy/openshift-clusters/roles/kcli/kcli-install/README.md b/deploy/openshift-clusters/roles/kcli/kcli-install/README.md index 0f3b638..e8c06b8 100644 --- a/deploy/openshift-clusters/roles/kcli/kcli-install/README.md +++ b/deploy/openshift-clusters/roles/kcli/kcli-install/README.md @@ -78,7 +78,7 @@ This role follows the same authentication file conventions as the dev-scripts ro - `vm_disk_size`: Disk size per node in GB (default: 120) ### OpenShift Version -See [defaults](../kcli-install/defaults/main.yml.template) for default values +See [defaults](../../../vars/kcli.yml.template) for default values If you're installing a specific openshift release image, you will need to set the proper channel in ocp_version - `ocp_version`: OpenShift version channel diff --git a/deploy/openshift-clusters/roles/proxy-setup/README.md b/deploy/openshift-clusters/roles/proxy-setup/README.md index 3fb409c..fece8c0 100644 --- a/deploy/openshift-clusters/roles/proxy-setup/README.md +++ b/deploy/openshift-clusters/roles/proxy-setup/README.md @@ -31,7 +31,7 @@ This role enables easy access to OpenShift clusters deployed in restricted netwo ### Optional Variables - `proxy_port`: Port for proxy service (default: 8213) -- `proxy_user`: Default user for squid configuration (default: ec2-user) +- `proxy_user`: User for squid configuration (auto-detected from system) ## Usage diff --git a/helpers/README.md b/helpers/README.md index a6faaf1..150350b 100644 --- a/helpers/README.md +++ b/helpers/README.md @@ -103,6 +103,7 @@ The `build-and-patch-resource-agents.yml` playbook automates the entire workflow # From the deploy/ directory # Simplest, no customization. Uses resource-agents repo, main branch, auto sets next version make patch-nodes +``` #### Using Ansible Directly From cfd94a626bcbe771b1cf98c2c9afb16c0e6c3a6c Mon Sep 17 00:00:00 2001 From: Pablo Fontanilla Date: Wed, 24 Dec 2025 13:09:21 +0100 Subject: [PATCH 2/4] Fix SSH key propagation to cluster VMs failing silently MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The task was using `|| true` which hid failures when adding local SSH keys to cluster VMs, causing ProxyJump connections to fail with "Permission denied". Now properly fails with retries and shows results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../common/tasks/update-cluster-inventory.yml | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml b/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml index 3d74ffb..c982774 100644 --- a/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml +++ b/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml @@ -108,14 +108,26 @@ - local_ssh_pubkey_content is defined - local_ssh_pubkey_content | length > 0 shell: | + set -e VM_IP="{{ item.ip }}" # ssh-keyscan can use bare IP for both IPv4 and IPv6 ssh-keyscan -H "$VM_IP" >> ~/.ssh/known_hosts 2>/dev/null || true # SSH to bare IP (works for both IPv4 and IPv6) - ssh -o StrictHostKeyChecking=no core@"$VM_IP" "echo '{{ local_ssh_pubkey_content }}' >> ~/.ssh/authorized_keys && sort -u ~/.ssh/authorized_keys -o ~/.ssh/authorized_keys" || true + ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 core@"$VM_IP" \ + "echo '{{ local_ssh_pubkey_content }}' >> ~/.ssh/authorized_keys && sort -u ~/.ssh/authorized_keys -o ~/.ssh/authorized_keys" + register: ssh_key_propagation changed_when: false + retries: 3 + delay: 5 + until: ssh_key_propagation.rc == 0 + + - name: Display SSH key propagation results + debug: + msg: "SSH key added to {{ item.item.name }} ({{ item.item.ip }})" + loop: "{{ ssh_key_propagation.results }}" + when: ssh_key_propagation is defined - name: Update inventory file with cluster VMs delegate_to: localhost From 9284f6c5bf328b6897c66cc31bf362ff69e51768 Mon Sep 17 00:00:00 2001 From: Pablo Fontanilla Date: Wed, 24 Dec 2025 15:08:48 +0100 Subject: [PATCH 3/4] Start sushy-tools BMC simulator for fencing topology on cluster startup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When starting a fencing topology cluster, ensure the BMC simulator is running so STONITH fencing works properly. Supports both dev-scripts (sushy-tools container) and kcli (ksushy systemd service) deployments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../scripts/startup-cluster.sh | 63 ++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/deploy/openshift-clusters/scripts/startup-cluster.sh b/deploy/openshift-clusters/scripts/startup-cluster.sh index 4e31bb5..95adb52 100755 --- a/deploy/openshift-clusters/scripts/startup-cluster.sh +++ b/deploy/openshift-clusters/scripts/startup-cluster.sh @@ -26,6 +26,14 @@ fi INSTANCE_ID=$(cat "${SHARED_DIR}/aws-instance-id") echo "Starting up OpenShift cluster VMs on instance ${INSTANCE_ID}..." +# Check cluster topology from state file +CLUSTER_STATE_FILE="${SHARED_DIR}/cluster-vm-state.json" +CLUSTER_TOPOLOGY="" +if [[ -f "${CLUSTER_STATE_FILE}" ]]; then + CLUSTER_TOPOLOGY=$(grep -o '"topology":[[:space:]]*"[^"]*"' "${CLUSTER_STATE_FILE}" | cut -d'"' -f4 2>/dev/null || echo "") + echo "Detected cluster topology: ${CLUSTER_TOPOLOGY:-unknown}" +fi + # Check current instance state INSTANCE_STATE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].State.Name' --output text --no-cli-pager) @@ -152,11 +160,64 @@ ssh "$(cat "${SHARED_DIR}/ssh_user")@${HOST_PUBLIC_IP}" << 'EOF' echo "" echo "You can check the cluster status as usual, depending on your setup." echo "It might take a few minutes for the cluster to be fully ready." - + # Clean up the cluster VMs list rm -f ~/cluster-vms.txt EOF +# Start sushy-tools BMC simulator for fencing topology +if [[ "${CLUSTER_TOPOLOGY}" == "fencing" ]]; then + echo "" + echo "Fencing topology detected. Ensuring sushy-tools BMC simulator is running..." + + ssh "$(cat "${SHARED_DIR}/ssh_user")@${HOST_PUBLIC_IP}" << 'EOF' + # Check if sushy-tools container exists (dev-scripts deployment) + if sudo podman container exists sushy-tools 2>/dev/null; then + CONTAINER_STATUS=$(sudo podman inspect sushy-tools --format '{{.State.Status}}' 2>/dev/null || echo "unknown") + echo "sushy-tools container status: ${CONTAINER_STATUS}" + + if [[ "${CONTAINER_STATUS}" == "running" ]]; then + echo "sushy-tools BMC simulator is already running" + else + echo "Starting sushy-tools container..." + sudo podman start sushy-tools + + # Wait and verify + sleep 2 + CONTAINER_STATUS=$(sudo podman inspect sushy-tools --format '{{.State.Status}}' 2>/dev/null || echo "unknown") + if [[ "${CONTAINER_STATUS}" == "running" ]]; then + echo "sushy-tools container started successfully" + else + echo "Warning: Failed to start sushy-tools container" + echo "STONITH fencing may not work properly" + echo "You can try manually: sudo podman start sushy-tools" + fi + fi + # Fallback: check for ksushy user service (kcli deployment) + elif systemctl --user list-unit-files ksushy.service &>/dev/null; then + KSUSHY_STATUS=$(systemctl --user is-active ksushy.service 2>/dev/null || echo "inactive") + + if [[ "${KSUSHY_STATUS}" == "active" ]]; then + echo "ksushy BMC simulator is already running" + else + echo "Starting ksushy BMC simulator..." + systemctl --user start ksushy.service + + sleep 2 + if systemctl --user is-active ksushy.service &>/dev/null; then + echo "ksushy BMC simulator started successfully" + else + echo "Warning: Failed to start ksushy service" + echo "STONITH fencing may not work properly" + fi + fi + else + echo "Warning: No BMC simulator found (sushy-tools container or ksushy service)" + echo "STONITH fencing may not work properly" + fi +EOF +fi + echo "" echo "OpenShift cluster startup completed successfully!" echo "If you need to redeploy the cluster, use: make redeploy-cluster" \ No newline at end of file From 85b930d7461974fce47ffbc20e30b060aff0cd40 Mon Sep 17 00:00:00 2001 From: Pablo Fontanilla Date: Wed, 24 Dec 2025 17:46:18 +0100 Subject: [PATCH 4/4] Fix SSH key detection when ssh-agent has no identities MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The ssh-add -L command returns exit code 0 even when it outputs "The agent has no identities", causing garbage to be added to authorized_keys. Now filters out error messages and validates the key format before propagating. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../common/tasks/update-cluster-inventory.yml | 24 +++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml b/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml index c982774..a295847 100644 --- a/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml +++ b/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml @@ -79,8 +79,12 @@ shell: | # Get the SSH key Ansible is using (check in order of preference) if [ -n "$SSH_AUTH_SOCK" ]; then - # If using ssh-agent, get the first key - ssh-add -L 2>/dev/null | head -n1 && exit 0 + # If using ssh-agent, get the first key (filter out error messages) + KEY=$(ssh-add -L 2>/dev/null | grep -v "^The agent has no identities" | head -n1) + if [ -n "$KEY" ]; then + echo "$KEY" + exit 0 + fi fi # Check common key locations @@ -102,11 +106,23 @@ local_ssh_pubkey_content: "{{ detected_ssh_key.stdout | trim }}" when: detected_ssh_key.rc == 0 + - name: Validate SSH public key format + delegate_to: localhost + set_fact: + ssh_key_valid: "{{ local_ssh_pubkey_content is defined and local_ssh_pubkey_content is match('^(ssh-rsa|ssh-ed25519|ecdsa-sha2-nistp256|ecdsa-sha2-nistp384|ecdsa-sha2-nistp521|ssh-dss) ') }}" + + - name: Warn if no valid SSH key found + debug: + msg: | + WARNING: No valid SSH public key detected on the local machine. + ProxyJump SSH access to cluster VMs will not work. + Ensure you have an SSH key pair in ~/.ssh/ (id_ed25519, id_rsa, etc.) + when: not (ssh_key_valid | default(false)) + - name: Add local user's SSH key to cluster VMs loop: "{{ parsed_vm_entries }}" when: - - local_ssh_pubkey_content is defined - - local_ssh_pubkey_content | length > 0 + - ssh_key_valid | default(false) shell: | set -e VM_IP="{{ item.ip }}"