Skip to content

Commit 157deee

Browse files
ccardenosaclaude
andauthored
feat(telco-kpi): add lock-free job preemption based on OCP version priority (#72894)
Problem: Multiple Telco KPI Prow jobs compete for the same baremetal host. Lower OCP version jobs can block higher version jobs for extended periods, delaying critical testing for newer releases. Solution: Implement a lock-free preemption mechanism where higher OCP version jobs can signal lower version jobs to quit, freeing the baremetal host sooner. How it works: 1. WAITING PHASE: Each job creates a unique waiting file on the bastion BEFORE attempting to acquire the lock: <lock>.waiting.<nanosecond_timestamp>.<ocp_version> Example: spoke-baremetal-50-7c-6f-5c-47-8c.lock.waiting.1766568440841947242.4.22 This ensures the job's presence is visible even if it immediately gets the lock. 2. LOCK ACQUISITION: When a job acquires the lock, it checks for higher priority waiters BEFORE removing its own waiting file (deferred deletion). If higher priority found, it releases the lock and keeps its waiting file for retry. Only when no higher priority is found does it remove its waiting file. 3. PERIODIC CHECKS: While holding the lock, the job periodically checks for higher priority waiters at key points: - cluster-install: every QUIT_CHECK_INTERVAL iterations (default: 3) - oslat test: before running tests - cpu-util test: before running tests 4. QUIT MODES: - 'graceful' (exit 0): Used by test steps. Allows remaining steps like PTP reporting to complete. Job exits cleanly. - 'force' (exit 1): Used by cluster-install. If installation is interrupted, remaining steps are meaningless. Job aborts immediately. 5. CLEANUP: Each job always removes its own waiting file during cleanup, regardless of whether it acquired the lock. This prevents orphaned files. Priority logic: - ONLY the OCP version determines priority (e.g., 4.22 > 4.20) - The nanosecond timestamp is NOT used for priority decisions - Same-version jobs (e.g., two 4.22 jobs) compete equally for the lock - Timestamp is used solely for: (1) unique filenames, (2) self-cleanup Key benefits: - Lock-free: No shared mutable state, each job manages its own file - Race-safe: Nanosecond timestamps ensure unique filenames - Deferred deletion: Waiting file persists until validation passes - Self-cleaning: Jobs clean up only their own files - Configurable: QUIT_CHECK_INTERVAL controls check frequency New shared functions: - extract_ocp_version: Gets version from JOB_NAME - create_waiting_request_file: Creates unique waiting file - remove_own_waiting_file: Removes job's waiting file - check_for_higher_priority_waiter: Scans waiting files for higher version - should_quit: Determines if quit is needed - check_for_quit: Main entry point (supports graceful/force modes) Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
1 parent 86776ad commit 157deee

10 files changed

+599
-18
lines changed

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/hacks/clean-up/telcov10n-metal-single-node-spoke-kpis-hacks-clean-up-commands.sh

Lines changed: 33 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,11 @@ function release_locked_host {
1515
network_spoke_mac_address=$(cat $SHARED_DIR/hosts.yaml|grep 'mac:'|awk -F'mac:' '{print $2}'|tr -d '[:blank:]')
1616
local spoke_lock_filename="/var/run/lock/ztp-baremetal-pool/spoke-baremetal-${network_spoke_mac_address//:/-}.lock"
1717

18-
echo "************ telcov10n Releasing Lock for the host used by this Spoke cluster deployemnt ************"
18+
echo "************ telcov10n Releasing Lock for the host used by this Spoke cluster deployment ************"
1919

2020
set -x
21-
timeout -s 9 10m ssh "${SSHOPTS[@]}" "root@${AUX_HOST}" bash -s -- \
22-
"${spoke_lock_filename}" << 'EOF'
23-
set -o nounset
24-
set -o errexit
25-
set -o pipefail
26-
sudo rm -fv ${1}
27-
EOF
21+
timeout -s 9 10m ssh "${SSHOPTS[@]}" "root@${AUX_HOST}" \
22+
"sudo rm -fv ${spoke_lock_filename} && echo 'Lock released successfully.'"
2823
set +x
2924
}
3025

@@ -48,17 +43,44 @@ function server_poweroff {
4843

4944
}
5045

46+
function cleanup_waiting_file {
47+
# Always clean up our waiting file, regardless of lock status
48+
49+
if [ ! -f "${SHARED_DIR}/own_waiting_file.txt" ]; then
50+
echo "[INFO] No waiting file to clean up."
51+
return 0
52+
fi
53+
54+
local own_waiting_file
55+
own_waiting_file=$(cat "${SHARED_DIR}/own_waiting_file.txt")
56+
57+
if [ -z "${own_waiting_file}" ]; then
58+
echo "[INFO] Waiting file path is empty, nothing to clean up."
59+
return 0
60+
fi
61+
62+
echo "************ telcov10n Cleaning up waiting file ************"
63+
echo "[INFO] Removing waiting file: ${own_waiting_file}"
64+
65+
timeout -s 9 2m ssh "${SSHOPTS[@]}" "root@${AUX_HOST}" \
66+
"rm -fv ${own_waiting_file} 2>/dev/null || true"
67+
}
68+
5169
function main {
5270

71+
# Setup SSH access once for all operations
72+
setup_aux_host_ssh_access
73+
74+
# Always clean up our waiting file first (handles both timeout and normal cases)
75+
cleanup_waiting_file
76+
77+
# Only do lock-related cleanup if we hold the lock
5378
local does_the_current_job_hold_a_lock_to_use_a_baremetal_server
5479
does_the_current_job_hold_a_lock_to_use_a_baremetal_server=$( \
5580
cat ${SHARED_DIR}/do_you_hold_the_lock_for_the_sno_spoke_cluster_server.txt || echo "no")
5681

5782
if [ "${does_the_current_job_hold_a_lock_to_use_a_baremetal_server}" == "yes" ]; then
58-
59-
setup_aux_host_ssh_access
6083
server_poweroff
61-
6284
# This must be run the latest one since it releases its server lock
6385
release_locked_host
6486
fi

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/hacks/deploy/telcov10n-metal-single-node-spoke-kpis-hacks-deploy-commands.sh

Lines changed: 116 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,27 @@ echo "************ telcov10n Fix user IDs in a container ************"
99

1010
source ${SHARED_DIR}/common-telcov10n-bash-functions.sh
1111

12+
function extract_and_set_ocp_version {
13+
14+
echo "************ telcov10n Extracting OCP version from JOB_NAME ************"
15+
16+
echo "[INFO] JOB_NAME: ${JOB_NAME:-not set}"
17+
18+
OCP_VERSION=$(extract_ocp_version)
19+
20+
if [ -z "${OCP_VERSION}" ]; then
21+
echo "[ERROR] Could not extract OCP version from JOB_NAME"
22+
exit 1
23+
fi
24+
25+
echo "[INFO] OCP Version: ${OCP_VERSION}"
26+
27+
# Store OCP version for other steps
28+
echo -n "${OCP_VERSION}" >| ${SHARED_DIR}/ocp_version.txt
29+
30+
export OCP_VERSION
31+
}
32+
1233
function define_spoke_cluster_name {
1334

1435
#### Spoke cluster
@@ -36,6 +57,83 @@ function set_spoke_cluster_kubeconfig {
3657
export KUBECONFIG="${SHARED_DIR}/spoke-${secret_kubeconfig}.yaml"
3758
}
3859

60+
# Track if we've already created the waiting request file (stored path for cleanup)
61+
WAITING_FILE_PATH=""
62+
63+
function create_waiting_request_on_bastion {
64+
65+
local spoke_lock_filename="${1}"
66+
67+
# Only create the waiting file once per session
68+
# Each job gets a unique file (with timestamp) that only it will delete
69+
if [ -n "${WAITING_FILE_PATH}" ]; then
70+
return 0
71+
fi
72+
73+
echo
74+
echo "************ telcov10n Registering wait request before lock attempt ************"
75+
echo
76+
77+
local waiting_file
78+
waiting_file=$(create_waiting_request_file "${AUX_HOST}" "${spoke_lock_filename}" "${OCP_VERSION}")
79+
80+
if [ -n "${waiting_file}" ]; then
81+
echo "[INFO] Created waiting request file: ${waiting_file}"
82+
echo " This signals that a job with OCP version ${OCP_VERSION} is waiting."
83+
WAITING_FILE_PATH="${waiting_file}"
84+
# Store the path in SHARED_DIR for cleanup step
85+
echo -n "${waiting_file}" >| ${SHARED_DIR}/own_waiting_file.txt
86+
else
87+
echo "[WARNING] Failed to create waiting request file."
88+
fi
89+
90+
echo
91+
}
92+
93+
function validate_lock_for_higher_priority {
94+
95+
local spoke_lock_filename="${1}"
96+
97+
echo
98+
echo "************ telcov10n Validating lock acquisition for priority ************"
99+
echo
100+
101+
# Check if there's a higher priority job waiting BEFORE removing our waiting file
102+
# This way, if we need to release the lock, our waiting file stays intact
103+
local check_result
104+
check_result=$(check_for_higher_priority_waiter "${AUX_HOST}" "${spoke_lock_filename}" "${OCP_VERSION}")
105+
106+
if [[ "${check_result}" == quit:* ]]; then
107+
local higher_version=${check_result#quit:}
108+
echo
109+
echo "[WARNING] Lock acquired but a higher priority job is waiting!"
110+
echo " Current job version: ${OCP_VERSION}"
111+
echo " Higher version waiting: ${higher_version}"
112+
echo " Releasing lock to allow higher priority job to proceed..."
113+
echo " (Keeping own waiting file for next attempt)"
114+
echo
115+
# Release the lock to let the higher priority job acquire it
116+
# Keep our waiting file - we're still waiting!
117+
timeout -s 9 10m ssh "${SSHOPTS[@]}" "root@${AUX_HOST}" "rm -fv ${spoke_lock_filename}"
118+
return 1
119+
fi
120+
121+
echo "[INFO] No higher priority jobs waiting. Proceeding with lock."
122+
123+
# NOW remove our waiting file since we're proceeding
124+
if [ -n "${WAITING_FILE_PATH}" ]; then
125+
echo "[INFO] Removing own waiting file: ${WAITING_FILE_PATH}"
126+
remove_own_waiting_file "${AUX_HOST}" "${WAITING_FILE_PATH}"
127+
WAITING_FILE_PATH=""
128+
rm -f ${SHARED_DIR}/own_waiting_file.txt 2>/dev/null || true
129+
fi
130+
131+
# Store lock filename for later use by other steps
132+
echo -n "${spoke_lock_filename}" >| ${SHARED_DIR}/spoke_lock_filename.txt
133+
134+
return 0
135+
}
136+
39137
function select_baremetal_host_from_pool {
40138

41139
echo "************ telcov10n select a baremetal host from the pool ************"
@@ -60,13 +158,24 @@ function select_baremetal_host_from_pool {
60158
local network_spoke_mac_address
61159
network_spoke_mac_address="$(cat ${baremetal_host_path}/network_spoke_mac_address)"
62160
local spoke_lock_filename="/var/run/lock/ztp-baremetal-pool/spoke-baremetal-${network_spoke_mac_address//:/-}.lock"
161+
162+
# Create waiting request file BEFORE trying to acquire lock (only once)
163+
# This ensures our presence is visible even if we immediately get the lock
164+
create_waiting_request_on_bastion "${spoke_lock_filename}"
165+
63166
try_to_lock_host "${AUX_HOST}" "${spoke_lock_filename}" "${host_lock_timestamp}" "${LOCK_TIMEOUT}"
64-
[[ "$(check_the_host_was_locked "${AUX_HOST}" "${spoke_lock_filename}" "${host_lock_timestamp}")" == "locked" ]] &&
65-
{
66-
update_host_and_master_yaml_files "$(dirname ${host})" ;
67-
echo -n "yes" >| ${SHARED_DIR}/do_you_hold_the_lock_for_the_sno_spoke_cluster_server.txt
68-
return 0 ;
69-
}
167+
if [[ "$(check_the_host_was_locked "${AUX_HOST}" "${spoke_lock_filename}" "${host_lock_timestamp}")" == "locked" ]]; then
168+
# Validate that no higher priority job is waiting
169+
if validate_lock_for_higher_priority "${spoke_lock_filename}"; then
170+
update_host_and_master_yaml_files "$(dirname ${host})"
171+
echo -n "yes" >| ${SHARED_DIR}/do_you_hold_the_lock_for_the_sno_spoke_cluster_server.txt
172+
return 0
173+
else
174+
# Higher priority job is waiting, lock was released
175+
# Our waiting file is still intact (not removed until validation passes)
176+
echo "[INFO] Will retry acquiring lock..."
177+
fi
178+
fi
70179
fi
71180
done
72181

@@ -218,6 +327,7 @@ function hack_spoke_deployment {
218327
function main {
219328

220329
setup_aux_host_ssh_access
330+
extract_and_set_ocp_version
221331
define_spoke_cluster_name
222332
set_spoke_cluster_kubeconfig
223333
hack_spoke_deployment

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/hacks/deploy/telcov10n-metal-single-node-spoke-kpis-hacks-deploy-ref.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,6 @@ ref:
4646
If cluster endpoints are reachables through a socks5 proxy
4747
documentation: |-
4848
This step allows to adapt the SNO Spoke cluster deployment for
49-
the new baremetal server pool in the new lab location
49+
the new baremetal server pool in the new lab location.
50+
OCP version is automatically extracted from RELEASE_IMAGE_LATEST for
51+
graceful quit priority when multiple jobs compete for the same baremetal host.

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/tests/cpu-util/telcov10n-metal-single-node-spoke-kpis-tests-cpu-util-commands.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,14 +247,21 @@ function test_kpis {
247247

248248
echo "************ telcov10n Run CPU Utilization Telco KPIs test ************"
249249

250+
# Check for graceful quit request before starting this test
251+
check_for_quit "cpu_utils_test" "graceful"
252+
250253
make_up_inventory
251254
make_up_remote_test_command
252255
make_up_ansible_playbook
253256
run_ansible_playbook
254257
setup_test_result_for_component_readiness
258+
259+
# Mark successful completion
260+
echo -n "completed" >| ${SHARED_DIR}/cpu_util_test_status.txt
255261
}
256262

257263
function main {
264+
setup_ssh_and_lock_info
258265
set_spoke_cluster_kubeconfig
259266
test_kpis
260267
}

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/tests/cpu-util/telcov10n-metal-single-node-spoke-kpis-tests-cpu-util-ref.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,7 @@ ref:
5353
[see: https://github.com/neisw/ci-test-mapping/blob/main/pkg/components/telcoperformance/component.go]
5454
documentation: |-
5555
This step allows to verify the SNO Spoke cluster deployed through its kubeconfig.
56+
OCP version is loaded from SHARED_DIR/ocp_version.txt (set by deploy step).
57+
If a graceful quit is requested by a higher version job, this test will be skipped
58+
to release the baremetal host lock faster. The oslat test will have already completed
59+
by this point, so PTP reporting can still collect those results.

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/tests/oslat/telcov10n-metal-single-node-spoke-kpis-tests-oslat-commands.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,14 +239,21 @@ function test_kpis {
239239

240240
echo "************ telcov10n Run oslat Telco KPIs test ************"
241241

242+
# Check for graceful quit request before starting this test
243+
check_for_quit "oslat_test" "graceful"
244+
242245
make_up_inventory
243246
make_up_remote_test_command
244247
make_up_ansible_playbook
245248
run_ansible_playbook
246249
setup_test_result_for_component_readiness
250+
251+
# Mark successful completion
252+
echo -n "completed" >| ${SHARED_DIR}/oslat_test_status.txt
247253
}
248254

249255
function main {
256+
setup_ssh_and_lock_info
250257
set_spoke_cluster_kubeconfig
251258
test_kpis
252259
}

ci-operator/step-registry/telcov10n/metal-single-node-spoke-kpis/tests/oslat/telcov10n-metal-single-node-spoke-kpis-tests-oslat-ref.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,6 @@ ref:
4646
[see: https://github.com/neisw/ci-test-mapping/blob/main/pkg/components/telcoperformance/component.go]
4747
documentation: |-
4848
This step allows to verify the SNO Spoke cluster deployed through its kubeconfig.
49+
OCP version is loaded from SHARED_DIR/ocp_version.txt (set by deploy step).
50+
If a graceful quit is requested by a higher version job, this test will be skipped
51+
and the job will exit gracefully to release the baremetal host lock.

ci-operator/step-registry/telcov10n/metal-single-node-spoke/cluster/install/telcov10n-metal-single-node-spoke-cluster-install-commands.sh

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,10 @@ function checking_installation_progress {
160160
timeout=$(date -d "${ABORT_INSTALLATION_TIMEOUT}" +%s)
161161
abort_installation=/tmp/abort.installation
162162

163+
# Counter for quit check - only check every QUIT_CHECK_INTERVAL iterations
164+
local quit_check_counter=0
165+
local quit_check_interval="${QUIT_CHECK_INTERVAL:-3}"
166+
163167
while true; do
164168

165169
test -f ${abort_installation} && {
@@ -203,6 +207,14 @@ function checking_installation_progress {
203207
echo "$ touch ${abort_installation}"
204208
fi
205209

210+
# Check for quit request every N iterations (QUIT_CHECK_INTERVAL)
211+
# Use "force" mode since if interrupted, the rest of the steps are meaningless (cluster not ready)
212+
((quit_check_counter++))
213+
if [ "${quit_check_counter}" -ge "${quit_check_interval}" ]; then
214+
check_for_quit "cluster_installation_progress" "force"
215+
quit_check_counter=0
216+
fi
217+
206218
sleep ${refresh_timing:="10m"} ;
207219
} || echo
208220
done
@@ -238,6 +250,10 @@ function get_and_save_kubeconfig_and_creds {
238250
}
239251

240252
function main {
253+
254+
# Setup SSH and load lock info for quit checks
255+
setup_ssh_and_lock_info
256+
241257
set_hub_cluster_kubeconfig
242258
generate_cluster_image_set
243259
create_spoke_namespace

ci-operator/step-registry/telcov10n/metal-single-node-spoke/cluster/install/telcov10n-metal-single-node-spoke-cluster-install-ref.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,14 @@ ref:
3939
Set the amount of time, the step must wait at most, before unconditionally forcing the workflow to continue
4040
with the next steps which will clean up and free all resources used in the deployment of this cluster
4141
(MIN: "REFRESH_TIME value", default: "1 hours + 45 min", MAX: "2 hours - REFRESH_TIME value").
42+
- name: QUIT_CHECK_INTERVAL
43+
default: "3"
44+
documentation: |-
45+
Number of REFRESH_TIME iterations between each quit condition check. A higher priority job (newer OCP version)
46+
waiting for the baremetal host lock can request the current job to quit. This setting controls how often
47+
the installation loop checks for such quit requests. Uses "force" mode (exit 1) since an interrupted
48+
installation leaves the cluster unusable. Default is 3 (check every 3 iterations, i.e., every 9 min
49+
with default REFRESH_TIME of 3m).
4250
- name: BIOS_SETTINGS
4351
default: "{}"
4452
documentation: |-

0 commit comments

Comments
 (0)