openshift-eng · fonta-rh · Dec 24, 2025 · Dec 24, 2025 · Dec 24, 2025 · Dec 24, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -115,7 +115,7 @@ make shellcheck
 - `roles/dev-scripts/install-dev/files/pull-secret.json`: OpenShift pull secret
 
 #### Kcli Method
-- `vars/kcli-install.yml`: Variable override file for persistent configuration
+- `vars/kcli.yml`: Variable override file for persistent configuration
 - `roles/kcli/kcli-install/files/pull-secret.json`: OpenShift pull secret
 - SSH key automatically read from `~/.ssh/id_ed25519.pub` on ansible controller
 
@@ -150,18 +150,13 @@ The repository includes comprehensive README files in `deploy/openshift-clusters
 
 ## Development Guidelines and Standards
 
-### Critical Repository Structure Rules
-
-**IMPORTANT**: The `kcli/` directory is included for reference only and should NEVER be modified. It contains the upstream kcli tool that we integrate with, but all development work happens in the `deploy/` and `docs/` directories.
-
 ### File Organization
 
 **Development Areas:**
 - **`deploy/`**: All deployment automation and infrastructure code
   - `deploy/aws-hypervisor/`: AWS hypervisor setup scripts  
   - `deploy/openshift-clusters/`: OpenShift cluster deployment with Ansible
 - **`docs/`**: Project documentation for different topologies
-- **`kcli/`**: **READ-ONLY** - Reference copy of upstream kcli tool (DO NOT MODIFY)
 
 ### Coding Standards
 
@@ -213,7 +208,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters
 ### Development Workflow Rules
 
 #### When Making Changes
-- **NEVER modify anything in the `kcli/` directory** - it's reference material only
 - Focus changes on `deploy/` scripts and `docs/` documentation
 - Consider impact on multiple virtualization providers when updating deployment scripts
 - Test deployment scenarios end-to-end
@@ -222,13 +216,6 @@ The repository includes comprehensive README files in `deploy/openshift-clusters
 - Check for credential exposure in logs or output
 - Validate Ansible playbooks and shell scripts before committing
 
-#### Working with kcli Integration
-- Use `kcli/` directory as reference for understanding kcli capabilities
-- Study `kcli/kvirt/providers/` to understand provider implementations
-- Reference `kcli/kvirt/cluster/openshift/` for OpenShift deployment patterns
-- Check `kcli/samples/` for configuration examples
-- **Remember**: Read from kcli for understanding, implement in `deploy/` for our use
-
 ### Dependencies and Configuration
 
 #### Dependencies

diff --git a/deploy/openshift-clusters/README-external-host.md b/deploy/openshift-clusters/README-external-host.md
@@ -57,7 +57,7 @@ See [hands-off deployment](../aws-hypervisor/README.md#automated-rhsm-registrati
 
 #### Option B: Local Variable File
 ```bash
-cp vars/init-host.yml.sample vars/init-host.yml.local
+cp vars/init-host.yml vars/init-host.yml.local
 # Edit vars/init-host.yml.local with your credentials
 ```
 

diff --git a/deploy/openshift-clusters/README-kcli.md b/deploy/openshift-clusters/README-kcli.md
@@ -91,7 +91,7 @@ You can configure the deployment using any combination of these methods (in prec
 1. **Command line variables** (highest precedence)
 2. **Playbook vars section**
 3. **vars/kcli.yml** (user configuration file)
-4. **Role defaults** (lowest precedence) (`roles/kcli/kcli-install/defaults/main.yml`)
+4. **Role defaults** (lowest precedence) (`vars/kcli.yml.template`)
 
 For simple overrides, the command line is recommended. For setting your preferred permanent config, copy [kcli.yml.template](vars/kcli.yml.template) to [kcli.yml](vars/kcli.yml) and update the values to your preference. This file is not tracked by Git and will persist between TNT updates. 
 
@@ -126,8 +126,8 @@ ansible-playbook kcli-install.yml \
 | `vm_memory` | `32768` | Memory per node (MB) |
 | `vm_numcpus` | `16` | CPU cores per node |
 | `vm_disk_size` | `120` | Disk size per node (GB) |
-| `ocp_version` | `"stable"` | OpenShift version channel |
-| `ocp_tag` | `"4.19"` | Specific version tag |
+| `ocp_version` | `"candidate"` | OpenShift version channel |
+| `ocp_tag` | `"4.20"` | Specific version tag |
 | `network_name` | `"default"` | kcli network name |
 | `bmc_user` | `"admin"` | BMC username (fencing) |
 | `bmc_password` | `"admin123"` | BMC password (fencing) |
@@ -141,7 +141,7 @@ topology: "fencing"
 bmc_user: "admin"
 bmc_password: "admin123"
 bmc_driver: "redfish"  
-ksushy_port: 8000
+ksushy_port: 9000
 ```
 
 ## 5. Deployment
@@ -305,7 +305,7 @@ The playbook uses reasonable defaults that work for typical kcli deployments:
 | `ksushy_ip` | `192.168.122.1` | Standard libvirt network gateway |
 | `bmc_user` | `admin` | From kcli-install defaults |
 | `bmc_password` | `admin123` | From kcli-install defaults |
-| `ksushy_port` | `8000` | From kcli-install defaults |
+| `ksushy_port` | `9000` | From kcli-install defaults |
 
 These defaults work for standard kcli deployments where VMs use the default libvirt network (`192.168.122.x/24`).
 

diff --git a/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml b/deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
@@ -79,8 +79,12 @@
       shell: |
         # Get the SSH key Ansible is using (check in order of preference)
         if [ -n "$SSH_AUTH_SOCK" ]; then
-          # If using ssh-agent, get the first key
-          ssh-add -L 2>/dev/null | head -n1 && exit 0
+          # If using ssh-agent, get the first key (filter out error messages)
+          KEY=$(ssh-add -L 2>/dev/null | grep -v "^The agent has no identities" | head -n1)
+          if [ -n "$KEY" ]; then
+            echo "$KEY"
+            exit 0
+          fi
         fi
 
         # Check common key locations
@@ -102,20 +106,44 @@
         local_ssh_pubkey_content: "{{ detected_ssh_key.stdout | trim }}"
       when: detected_ssh_key.rc == 0
 
+    - name: Validate SSH public key format
+      delegate_to: localhost
+      set_fact:
+        ssh_key_valid: "{{ local_ssh_pubkey_content is defined and local_ssh_pubkey_content is match('^(ssh-rsa|ssh-ed25519|ecdsa-sha2-nistp256|ecdsa-sha2-nistp384|ecdsa-sha2-nistp521|ssh-dss) ') }}"
+
+    - name: Warn if no valid SSH key found
+      debug:
+        msg: |
+          WARNING: No valid SSH public key detected on the local machine.
+          ProxyJump SSH access to cluster VMs will not work.
+          Ensure you have an SSH key pair in ~/.ssh/ (id_ed25519, id_rsa, etc.)
+      when: not (ssh_key_valid | default(false))
+
     - name: Add local user's SSH key to cluster VMs
       loop: "{{ parsed_vm_entries }}"
       when:
-        - local_ssh_pubkey_content is defined
-        - local_ssh_pubkey_content | length > 0
+        - ssh_key_valid | default(false)
       shell: |
+        set -e
         VM_IP="{{ item.ip }}"
 
         # ssh-keyscan can use bare IP for both IPv4 and IPv6
         ssh-keyscan -H "$VM_IP" >> ~/.ssh/known_hosts 2>/dev/null || true
 
         # SSH to bare IP (works for both IPv4 and IPv6)
-        ssh -o StrictHostKeyChecking=no core@"$VM_IP" "echo '{{ local_ssh_pubkey_content }}' >> ~/.ssh/authorized_keys && sort -u ~/.ssh/authorized_keys -o ~/.ssh/authorized_keys" || true
+        ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 core@"$VM_IP" \
+          "echo '{{ local_ssh_pubkey_content }}' >> ~/.ssh/authorized_keys && sort -u ~/.ssh/authorized_keys -o ~/.ssh/authorized_keys"
+      register: ssh_key_propagation
       changed_when: false
+      retries: 3
+      delay: 5
+      until: ssh_key_propagation.rc == 0
+
+    - name: Display SSH key propagation results
+      debug:
+        msg: "SSH key added to {{ item.item.name }} ({{ item.item.ip }})"
+      loop: "{{ ssh_key_propagation.results }}"
+      when: ssh_key_propagation is defined
 
     - name: Update inventory file with cluster VMs
       delegate_to: localhost

diff --git a/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md b/deploy/openshift-clusters/roles/dev-scripts/install-dev/README.md
@@ -24,7 +24,7 @@ The install-dev role handles the complete setup of OpenShift bare metal developm
 - `dev_scripts_path`: Path to dev-scripts directory (default: "openshift-metal3/dev-scripts")
 - `dev_scripts_branch`: Git branch to use (default: "master")
 - `test_cluster_name`: OpenShift cluster name (default: "ostest")
-- `method`: Deployment method (default: "ipi")
+- `method`: Deployment method (set by calling playbook, e.g., "ipi")
 
 ### Computed Variables (vars/main.yml)
 
@@ -42,10 +42,11 @@ ansible-playbook setup.yml
 
 ## Task Structure
 
-- `dev-scripts.yml`: Dev-scripts environment setup
-- `create.yml`: OpenShift cluster creation (conditional)
-- `proxy.yml`: Proxy configuration setup
 - `main.yml`: Orchestrates all tasks and configures aliases
+- `bounce.yml`: Cluster bounce/restart operations
+- `check_vars.yml`: Variable validation
+- `config.yml`: Configuration setup
+- `teardown.yml`: Cluster teardown operations
 
 ## Notes
 

diff --git a/deploy/openshift-clusters/roles/kcli/kcli-install/README.md b/deploy/openshift-clusters/roles/kcli/kcli-install/README.md
@@ -78,7 +78,7 @@ This role follows the same authentication file conventions as the dev-scripts ro
 - `vm_disk_size`: Disk size per node in GB (default: 120)
 
 ### OpenShift Version
-See [defaults](../kcli-install/defaults/main.yml.template) for default values
+See [defaults](../../../vars/kcli.yml.template) for default values
 
 If you're installing a specific openshift release image, you will need to set the proper channel in ocp_version
 - `ocp_version`: OpenShift version channel

diff --git a/deploy/openshift-clusters/roles/proxy-setup/README.md b/deploy/openshift-clusters/roles/proxy-setup/README.md
@@ -31,7 +31,7 @@ This role enables easy access to OpenShift clusters deployed in restricted netwo
 ### Optional Variables
 
 - `proxy_port`: Port for proxy service (default: 8213)
-- `proxy_user`: Default user for squid configuration (default: ec2-user)
+- `proxy_user`: User for squid configuration (auto-detected from system)
 
 ## Usage
 

diff --git a/deploy/openshift-clusters/scripts/startup-cluster.sh b/deploy/openshift-clusters/scripts/startup-cluster.sh
@@ -26,6 +26,14 @@ fi
 INSTANCE_ID=$(cat "${SHARED_DIR}/aws-instance-id")
 echo "Starting up OpenShift cluster VMs on instance ${INSTANCE_ID}..."
 
+# Check cluster topology from state file
+CLUSTER_STATE_FILE="${SHARED_DIR}/cluster-vm-state.json"
+CLUSTER_TOPOLOGY=""
+if [[ -f "${CLUSTER_STATE_FILE}" ]]; then
+    CLUSTER_TOPOLOGY=$(grep -o '"topology":[[:space:]]*"[^"]*"' "${CLUSTER_STATE_FILE}" | cut -d'"' -f4 2>/dev/null || echo "")
+    echo "Detected cluster topology: ${CLUSTER_TOPOLOGY:-unknown}"
+fi
+
 # Check current instance state
 INSTANCE_STATE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].State.Name' --output text --no-cli-pager)
 
@@ -152,11 +160,64 @@ ssh "$(cat "${SHARED_DIR}/ssh_user")@${HOST_PUBLIC_IP}" << 'EOF'
     echo ""
     echo "You can check the cluster status as usual, depending on your setup."
     echo "It might take a few minutes for the cluster to be fully ready."
-    
+
     # Clean up the cluster VMs list
     rm -f ~/cluster-vms.txt
 EOF
 
+# Start sushy-tools BMC simulator for fencing topology
+if [[ "${CLUSTER_TOPOLOGY}" == "fencing" ]]; then
+    echo ""
+    echo "Fencing topology detected. Ensuring sushy-tools BMC simulator is running..."
+
+    ssh "$(cat "${SHARED_DIR}/ssh_user")@${HOST_PUBLIC_IP}" << 'EOF'
+        # Check if sushy-tools container exists (dev-scripts deployment)
+        if sudo podman container exists sushy-tools 2>/dev/null; then
+            CONTAINER_STATUS=$(sudo podman inspect sushy-tools --format '{{.State.Status}}' 2>/dev/null || echo "unknown")
+            echo "sushy-tools container status: ${CONTAINER_STATUS}"
+
+            if [[ "${CONTAINER_STATUS}" == "running" ]]; then
+                echo "sushy-tools BMC simulator is already running"
+            else
+                echo "Starting sushy-tools container..."
+                sudo podman start sushy-tools
+
+                # Wait and verify
+                sleep 2
+                CONTAINER_STATUS=$(sudo podman inspect sushy-tools --format '{{.State.Status}}' 2>/dev/null || echo "unknown")
+                if [[ "${CONTAINER_STATUS}" == "running" ]]; then
+                    echo "sushy-tools container started successfully"
+                else
+                    echo "Warning: Failed to start sushy-tools container"
+                    echo "STONITH fencing may not work properly"
+                    echo "You can try manually: sudo podman start sushy-tools"
+                fi
+            fi
+        # Fallback: check for ksushy user service (kcli deployment)
+        elif systemctl --user list-unit-files ksushy.service &>/dev/null; then
+            KSUSHY_STATUS=$(systemctl --user is-active ksushy.service 2>/dev/null || echo "inactive")
+
+            if [[ "${KSUSHY_STATUS}" == "active" ]]; then
+                echo "ksushy BMC simulator is already running"
+            else
+                echo "Starting ksushy BMC simulator..."
+                systemctl --user start ksushy.service
+
+                sleep 2
+                if systemctl --user is-active ksushy.service &>/dev/null; then
+                    echo "ksushy BMC simulator started successfully"
+                else
+                    echo "Warning: Failed to start ksushy service"
+                    echo "STONITH fencing may not work properly"
+                fi
+            fi
+        else
+            echo "Warning: No BMC simulator found (sushy-tools container or ksushy service)"
+            echo "STONITH fencing may not work properly"
+        fi
+EOF
+fi
+
 echo ""
 echo "OpenShift cluster startup completed successfully!"
 echo "If you need to redeploy the cluster, use: make redeploy-cluster" 
diff --git a/helpers/README.md b/helpers/README.md
@@ -103,6 +103,7 @@ The `build-and-patch-resource-agents.yml` playbook automates the entire workflow
 # From the deploy/ directory
 # Simplest, no customization. Uses resource-agents repo, main branch, auto sets next version
 make patch-nodes
+```
 
 #### Using Ansible Directly