Skip to content

Conversation

@csibbitt
Copy link
Contributor

@csibbitt csibbitt commented Dec 3, 2025

  • Currently nvidia GPU drivers
  • Can install from:
    • DNF Repo
    • .rpm from custom URL
    • .rpm from local file (like NFS mount)
    • .run file for testing on CS9/ubi9 systems

Jira: OSPRH-22193

@openshift-ci openshift-ci bot requested review from olliewalsh and stuggi December 3, 2025 03:08
@csibbitt
Copy link
Contributor Author

csibbitt commented Dec 3, 2025

I'm still working on improved molecule coverage, but wanted this up for review as early as possible.

In addition to the default molecule scenario, I ran some tests against an EDPM node from my install_yamls deployment, restoring it to a snapshot in between each test:

$ ANSIBLE_ROLES_PATH=./roles ansible-playbook -i inventory playbooks/accel_drivers.yml -e edpm_accel_drivers_install_nvidia=true -e edpm_accel_drivers_nvidia_dnf_module_stream=open-dkms -e edpm_accel_drivers_nvidia_package_name=nvidia-open
[...]
TASK [edpm_accel_drivers : Enable NVIDIA driver DNF module for selected version stream] **********************************************************************************
changed: [edpm-compute-0]

TASK [edpm_accel_drivers : Install the nvidia driver package] ************************************************************************************************************
changed: [edpm-compute-0]
[...]
PLAY RECAP ***************************************************************************************************************************************************************
edpm-compute-0             : ok=9    changed=6    unreachable=0    failed=0    skipped=13   rescued=0    ignored=0   

✓ Test 'From RPM repo' PASSED
$ ANSIBLE_ROLES_PATH=./roles ansible-playbook -i inventory playbooks/accel_drivers.yml -e edpm_accel_drivers_install_nvidia=true -e edpm_accel_drivers_nvidia_custom_url=https://*****/driver-storage/nvidia/grid/latest-GRID/host-driver-aie-9.6.rpm
[...]
TASK [edpm_accel_drivers : Download NVIDIA driver from custom URL] *************************************************************************************************************
changed: [edpm-compute-0]

TASK [edpm_accel_drivers : Install NVIDIA driver from RPM] *********************************************************************************************************************
changed: [edpm-compute-0]
[...]
PLAY RECAP *********************************************************************************************************************************************************************
edpm-compute-0             : ok=13   changed=6    unreachable=0    failed=0    skipped=9    rescued=0    ignored=0   

✓ Test 'AIE host driver .rpm from lookaside' PASSED
$ ANSIBLE_ROLES_PATH=./roles ansible-playbook -i inventory playbooks/accel_drivers.yml -e edpm_accel_drivers_install_nvidia=true -e edpm_accel_drivers_nvidia_custom_url=https://*****/driver-storage/nvidia/grid/latest-GRID/host-driver-aie.run
[...]
TASK [edpm_accel_drivers : Execute .run file (UNSUPPORTED)] ********************************************************************************************************************
changed: [edpm-compute-0]
[...]
PLAY RECAP *********************************************************************************************************************************************************************
edpm-compute-0             : ok=17   changed=7    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0   

✓ Test 'AIE host driver .run from lookaside' PASSED

@bogdando bogdando requested a review from slagle December 3, 2025 09:11
Copy link
Contributor

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but some design adjustments for download cache role

@SeanMooney
Copy link
Contributor

i have not reviewd this yet but you shoudl be rebaseign and squahsing the lint and nameing fixes into the orginal commit not building up a long chain of fix up patches.
we commit shoudl be independely correctl before we merge the pr.
can you clean up the git histroy to make this easeier to reveiw and more correct.

@csibbitt
Copy link
Contributor Author

csibbitt commented Dec 3, 2025

i have not reviewd this yet but you shoudl be rebaseign and squahsing the lint and nameing fixes into the orginal commit not building up a long chain of fix up patches. we commit shoudl be independely correctl before we merge the pr. can you clean up the git histroy to make this easeier to reveiw and more correct.

Most of the GH projects I work on do the squash at merge time so that the back/forth of review/changes is captured on the record, and you don't end up with missing commits (GH isn't as friendly with force-pushes as gerrit is). I see that your preferences for this repo are the opposite, so I'll squash these on the next posting. Thanks.

@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from 8809a27 to 8029acc Compare December 4, 2025 19:42
@@ -0,0 +1,20 @@
---

- name: Check if grub2-mkconfig has --update-bls-cmdline option
Copy link
Contributor Author

@csibbitt csibbitt Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from the nova04delta DT that @bogdando linked[1].

I found that the version of grub2-mkconfig on my CS9 EDPM node has the option.

[cloud-admin@edpm-compute-0 ~]$ cat /etc/redhat-release 
CentOS Stream release 9
[cloud-admin@edpm-compute-0 ~]$ grub2-mkconfig --help | grep bls
  --update-bls-cmdline    overwrite BLS cmdline args with default args

But the version on my CS9 vagrant node does not, and would fail without this detection code.

[vagrant@edpm-compute-0 ~]$ cat /etc/redhat-release 
CentOS Stream release 9
[vagrant@edpm-compute-0 ~]$ grub2-mkconfig --help | grep bls
[vagrant@edpm-compute-0 ~]$ echo $?
1

I did a bit of reading, but haven't managed to track down exactly why or if this matters. I gather that --update-bls-cmdline is required when available, and hopefully this task is doing what we need even when it's not available. As previously mentioned[2], I've never rebuilt the initramfs while doing vGPU testing and I'm not sure why any of this is required, though I'm not arguing that it isn't, I just don't know.

[1] https://github.com/openstack-k8s-operators/architecture/blob/b8c7d29a61f91f6100b8bba6c1b5c8d26ac78b97/dt/nova/nova04delta/edpm/nodeset/nova_gpu.yaml#L75
[2] #1074 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamepark4 hi, do you recall why are we doing this --update-bls-cmdline magic for vGPU passthrough cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed up with James. He said:

"could you reach out to @sbauza, tbh in my initial testing downstream I never needed to rebuild the initramfs, I carried over his work from another repo to the dt/va so I'm not sure if rebuilding it is always necessary.  I believe he got that advice from the gpu team." (emphasis mine)

In my own testing I also never needed to rebuild the initramfs. Perhaps Sylvain will have some more info.

@SeanMooney - any insight into this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --update-bls-cmdline magic is related to the fact that this code supproted rhel 9.2 and 9.4 and between those rhel change how we use grub

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index#new-features-boot-loader

that is becasue we needed to supprot mixed rhel nodes for adoption and upgrade testign starting with 17.1 rhel 9.2 adopting to 18.0 rhel 9.2 and greenfield 18.0 rhel 9.4

the reason for regernating the initramfs in general is in the event we need to prebind device to non default driver in teh initramdisk to prevent the default kernel driver form initsliating the device or where the device driver is needed in early boot.

we may or may not need this for gpus it depend on the usecase (passhtough vs vgpu/mig) and if the drivers are included in base rhel or not.

this is not something we shoudl be afraid of doing when required i just don't know of the top of my head if it is required in this instance.

@csibbitt
Copy link
Contributor Author

csibbitt commented Dec 4, 2025

The latest push contains most of the requested changes. Still working on more complete molecule coverage.

The only requested change I haven't done is Sean's suggestion to "rework this to install the correct package for the running kernel". We discussed this out of band, and the situation boils down to the following:

  • There are no kernel-specific vGPU Host driver rpms available for CS9; so for this role to work (ie. actually install the useful drivers) on CS9, compiling from a .run file is required. This is for lab testing and community use only, not supported in downstream product, and the code makes that clear.
  • There are no non-dkms nvidia drivers rpms of any sort available on any dnf repo that will work on CS9; so to test the rpm-from-dnf-repo path in this code, I will continue to use the dkms-enabled packages. This is for lab testing and community use only, not supported in downstream product, and the code makes that clear.

The compromise we agreed to is that in addition to all the "(UNSUPPORTED)" labels and warning outputs in the code, I've added asserts that simply prevent these forbidden code paths from executing on RHEL nodes at all, preventing the accidental tainting of an otherwise supported kernel.

csibbitt added a commit to csibbitt/openstack-operator that referenced this pull request Dec 8, 2025
@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from 8029acc to 0c45c11 Compare December 8, 2025 21:46
@csibbitt
Copy link
Contributor Author

csibbitt commented Dec 8, 2025

I've just pushed an update with significantly enhanced molecule coverage.

All molecule tests are passing locally, but there are hiccups that I'm not sure the best way to solve, or if they even matter.

I locally changed the delegated scenarios to SSH to an EDPM node instead of using ansible_connection: local, and ran into a problem where the "prepare" stage in the delegated scenarios was only running for the first (default) scenario, and not working in the custom_url_ scenarios:

INFO     Running custom_url_rpm > destroy
WARNING  Skipping, instances are delegated.
INFO     Running custom_url_rpm > create
WARNING  Skipping, instances are delegated.
INFO     Running custom_url_rpm > prepare
WARNING  Skipping, instances already prepared.
[...]
fatal: [compute-1]: FAILED! => {"attempts": 5, "changed": false, "dest": "/var/cache/nvidia-drivers/mock-nvidia-driver.rpm", "elapsed": 0, "msg": "Request failed: <urlopen error [Errno 2] No such file or directory: '/tmp/mock-nvidia-driver.rpm'>", "url": "file:///tmp/mock-nvidia-driver.rpm"}

The file is missing because the "prepare" step didn't run. I think it won't prepare it because that instance was already prepared by the default scenario, and the "destroy" step, which would usually reset this, doesn't do anything on a delegated instance. If I use molecule prepare --force then it's fine, but I'm not sure if this will be a problem in CI or not. Do I need to move everything to a single "prepare" step and ensure that scenario runs first somehow?

I restore a snapshot of my EDPM node between scenarios to keep them independent (reset.sh); I'm not sure how this matches up to the delegated nodes in CI - are they reused, or destroyed between scenarios?

As you know, the vagrant-based scenarios have test_sequence: [], so I need to destroy/converge instead of "test" for those ones. Ultimately, this is what I end up running instead of molecule test --all:

$ ../../reset.sh && molecule prepare --force -s default && \
molecule test -s default && \
../../reset.sh && molecule prepare --force -s custom_url_rpm && \
molecule test -s custom_url_rpm && \
../../reset.sh && molecule prepare --force -s custom_url_run && \
molecule test -s custom_url_run && \
molecule destroy -s default_vagrant && \
molecule converge -s default_vagrant && \
molecule destroy -s custom_url_rpm_vagrant && \
molecule converge -s custom_url_rpm_vagrant && \
molecule destroy -s custom_url_run_vagrant && \
molecule converge -s custom_url_run_vagrant 

[...]

$ echo $?
0

So all the tests are passing, but I wanted to document exactly how I run them, and see if perhaps I'm doing anything wrong.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1722 is needed.

@bogdando
Copy link
Contributor

bogdando commented Dec 10, 2025

if this will be a problem in CI or not

@csibbitt you can add a CI job and see how it works for this PR exactly.

For example, see

zuul.d/projects.yaml:        - edpm-ansible-molecule-edpm_nova
zuul.d/jobs.yaml:    name: edpm-ansible-molecule-edpm_nova

then you can also speed up CI test suit by only including the role being worked out, like https://github.com/openstack-k8s-operators/edpm-ansible/pull/900/files

- create
- prepare
- converge
- verify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no verify.yml to verify. I guess we should verify the drivers installed and configs created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a check for this. There aren't really any configs to check for, but I can verify that one of the binaries from the package made it onto the system at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update that adds a simple verification to all scenarios.

{{ '--update-bls-cmdline'
if _edpm_accel_drivers_check_update_bls_cmdline.rc == 0
else '' }}
changed_when: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need changed_when: true here?

Copy link
Contributor

@bogdando bogdando Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. the original code from the arch DT was:

        - name: Check if grub2-mkconfig has --update-bls-cmdline option
          ansible.builtin.shell:
            cmd: grub2-mkconfig --help | grep '\-\-update-bls-cmdline'
          ignore_errors: true
          register: check_update_bls_cmdline
          changed_when: false

        - name: Regenerate initramfs
          become: true
          ansible.builtin.command: "{{ item }}"
          loop:
            - 'dracut --force'
            - >-
              grub2-mkconfig -o /boot/grub2/grub.cfg
              {{ '--update-bls-cmdline'
              if check_update_bls_cmdline.rc == 0
              else '' }}
          when: (_blacklist_nvidia.changed or _load_vfio.changed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple answe is that the original code Bogdan is showing (with no changed_when) fails lint tests.

roles/edpm_accel_drivers/tasks/boot_config.yml:10: no-changed-when: Commands should not change things if nothing needs doing.

IMO, it's correct to assume that dracut --force will have made changes (at very least overwriting the existing initramfs file since that's what --force means), and therefor the ansible changed flag should always be true.

It's certainly possible to do deeper inspection to see if the contents of the initramfs or grub.conf actually changed as a result, but I'm not sure any added complexity is warranted here unless this changed state is pivotal for some other purpose I'm not aware of. Would you like me to add before and after comparisons of these files to more correctly set this flag?

FWIW, it's also up in the air whether we need to do any of this or not. See #1074 (comment)

@bogdando bogdando removed the approved label Dec 12, 2025
@bogdando bogdando self-requested a review December 12, 2025 10:36
@bogdando bogdando removed the lgtm label Dec 12, 2025
@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from fbc115b to d8cc41a Compare December 15, 2025 18:10
@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from d8cc41a to 22395c3 Compare December 15, 2025 20:01
@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from 22395c3 to 280a7d4 Compare December 16, 2025 05:28
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/655c07ffb56c46ff9d7f2e1621a9b2e7

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 22m 55s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 18m 32s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 38m 59s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 28s
✔️ cifmw-crc-podified-edpm-baremetal-bootc SUCCESS in 1h 41m 09s
✔️ noop SUCCESS in 0s
edpm-ansible-tempest-multinode RETRY_LIMIT in 20m 25s
✔️ edpm-ansible-baremetal-minor-update SUCCESS in 2h 08m 34s
✔️ edpm-ansible-molecule-edpm_accel_drivers SUCCESS in 15m 46s
adoption-standalone-to-crc-ceph-provider RETRY_LIMIT in 20m 32s

@csibbitt
Copy link
Contributor Author

All three of the failed tests failed with this error:

    error: cannot update repo 'ubi-9-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried; Last error: Curl error (52): Server returned nothing (no headers, no data) for https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/baseos/os/repodata/repomd.xml [Empty reply from server]

This has nothing to do with my PR, and currently that URL is returning a proper repomd.xml to me, so I'm going to assume that "the failure cause has been resolved " and try running the tests again (after another rebase)

@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from 280a7d4 to 8bf729e Compare December 16, 2025 18:37
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/72e7d5c07af3414b933d6cfa86c8217b

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 22m 24s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 20m 14s
cifmw-crc-podified-edpm-baremetal FAILURE in 49m 58s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 23s
✔️ cifmw-crc-podified-edpm-baremetal-bootc SUCCESS in 1h 31m 03s
✔️ noop SUCCESS in 0s
edpm-ansible-tempest-multinode RETRY_LIMIT in 22m 42s
✔️ edpm-ansible-baremetal-minor-update SUCCESS in 2h 03m 55s
✔️ edpm-ansible-molecule-edpm_accel_drivers SUCCESS in 17m 27s
adoption-standalone-to-crc-ceph-provider RETRY_LIMIT in 18m 47s

@csibbitt
Copy link
Contributor Author

recheck

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d509b43a2e844a84bc61cd13fd2175e8

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 28m 49s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 25m 11s
cifmw-crc-podified-edpm-baremetal FAILURE in 43m 45s
✔️ cifmw-pod-zuul-files SUCCESS in 22m 00s
✔️ cifmw-crc-podified-edpm-baremetal-bootc SUCCESS in 1h 37m 11s
✔️ noop SUCCESS in 0s
✔️ edpm-ansible-tempest-multinode SUCCESS in 2h 03m 29s
edpm-ansible-baremetal-minor-update FAILURE in 28m 33s
✔️ edpm-ansible-molecule-edpm_accel_drivers SUCCESS in 15m 42s
✔️ adoption-standalone-to-crc-ceph-provider SUCCESS in 3h 12m 19s

@csibbitt
Copy link
Contributor Author

Current CI failures - unrelated to this code because these both passed yesterday and the only change since then has been a rebase onto the main branch.

FAILED - RETRYING: [localhost]: Wait for OpenStack subscription creation (1 retries left).  
task path: /home/zuul/src/github.com/openstack-k8s-operators/ci-framework/roles/edpm_prepare/tasks/main.yml:98

and

    - oc apply -f /home/zuul/ci-framework-data/artifacts/manifests/openstack/nncp/cr/
    - 'Error from server (InternalError): error when creating "/home/zuul/ci-framework-data/artifacts/manifests/openstack/nncp/cr/crc_nncp.yaml":
      Internal error occurred: failed calling webhook "nodenetworkconfigurationpolicies-mutate.nmstate.io":
      failed to call webhook: Post "https://nmstate-webhook.openshift-nmstate.svc:443/nodenetworkconfigurationpolicies-mutate?timeout=10s":
      tls: failed to verify certificate: x509: certificate signed by unknown authority'

@csibbitt
Copy link
Contributor Author

@olliewalsh @bogdando @stuggi - Is there any work remaining before this can be merged?

@bogdando
Copy link
Contributor

recheck

Copy link
Contributor

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm label Dec 18, 2025
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/32e3c19b864f4a44a7a0915383fb7e44

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 23s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 28m 46s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 14m 48s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 32s
✔️ cifmw-crc-podified-edpm-baremetal-bootc SUCCESS in 1h 17m 35s
✔️ noop SUCCESS in 0s
✔️ edpm-ansible-tempest-multinode SUCCESS in 1h 46m 50s
edpm-ansible-baremetal-minor-update FAILURE in 2h 03m 45s
✔️ edpm-ansible-molecule-edpm_accel_drivers SUCCESS in 15m 21s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 41m 38s

* Currently nvidia GPU drivers
* Can install from:
  * DNF Repo
  * .rpm from custom URL
  * .rpm from local file (like NFS mount)
  * .run file for testing on CS9/ubi9 systems
@csibbitt csibbitt force-pushed the csibbitt_OSPRH-22193_accel_drivers branch from 8bf729e to 0bc9164 Compare December 18, 2025 15:47
@openshift-ci openshift-ci bot removed the lgtm label Dec 18, 2025
@bogdando bogdando self-requested a review December 18, 2025 15:49
Copy link
Contributor

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm label Dec 18, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bogdando, csibbitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 0857bb2 into openstack-k8s-operators:main Dec 18, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants