-
Notifications
You must be signed in to change notification settings - Fork 88
New role for accelerator drivers #1074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New role for accelerator drivers #1074
Conversation
|
I'm still working on improved molecule coverage, but wanted this up for review as early as possible. In addition to the default molecule scenario, I ran some tests against an EDPM node from my install_yamls deployment, restoring it to a snapshot in between each test: |
bogdando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but some design adjustments for download cache role
|
i have not reviewd this yet but you shoudl be rebaseign and squahsing the lint and nameing fixes into the orginal commit not building up a long chain of fix up patches. |
Most of the GH projects I work on do the squash at merge time so that the back/forth of review/changes is captured on the record, and you don't end up with missing commits (GH isn't as friendly with force-pushes as gerrit is). I see that your preferences for this repo are the opposite, so I'll squash these on the next posting. Thanks. |
8809a27 to
8029acc
Compare
| @@ -0,0 +1,20 @@ | |||
| --- | |||
|
|
|||
| - name: Check if grub2-mkconfig has --update-bls-cmdline option | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from the nova04delta DT that @bogdando linked[1].
I found that the version of grub2-mkconfig on my CS9 EDPM node has the option.
[cloud-admin@edpm-compute-0 ~]$ cat /etc/redhat-release
CentOS Stream release 9
[cloud-admin@edpm-compute-0 ~]$ grub2-mkconfig --help | grep bls
--update-bls-cmdline overwrite BLS cmdline args with default args
But the version on my CS9 vagrant node does not, and would fail without this detection code.
[vagrant@edpm-compute-0 ~]$ cat /etc/redhat-release
CentOS Stream release 9
[vagrant@edpm-compute-0 ~]$ grub2-mkconfig --help | grep bls
[vagrant@edpm-compute-0 ~]$ echo $?
1
I did a bit of reading, but haven't managed to track down exactly why or if this matters. I gather that --update-bls-cmdline is required when available, and hopefully this task is doing what we need even when it's not available. As previously mentioned[2], I've never rebuilt the initramfs while doing vGPU testing and I'm not sure why any of this is required, though I'm not arguing that it isn't, I just don't know.
[1] https://github.com/openstack-k8s-operators/architecture/blob/b8c7d29a61f91f6100b8bba6c1b5c8d26ac78b97/dt/nova/nova04delta/edpm/nodeset/nova_gpu.yaml#L75
[2] #1074 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamepark4 hi, do you recall why are we doing this --update-bls-cmdline magic for vGPU passthrough cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed up with James. He said:
"could you reach out to @sbauza, tbh in my initial testing downstream I never needed to rebuild the initramfs, I carried over his work from another repo to the dt/va so I'm not sure if rebuilding it is always necessary. I believe he got that advice from the gpu team." (emphasis mine)
In my own testing I also never needed to rebuild the initramfs. Perhaps Sylvain will have some more info.
@SeanMooney - any insight into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the --update-bls-cmdline magic is related to the fact that this code supproted rhel 9.2 and 9.4 and between those rhel change how we use grub
that is becasue we needed to supprot mixed rhel nodes for adoption and upgrade testign starting with 17.1 rhel 9.2 adopting to 18.0 rhel 9.2 and greenfield 18.0 rhel 9.4
the reason for regernating the initramfs in general is in the event we need to prebind device to non default driver in teh initramdisk to prevent the default kernel driver form initsliating the device or where the device driver is needed in early boot.
we may or may not need this for gpus it depend on the usecase (passhtough vs vgpu/mig) and if the drivers are included in base rhel or not.
this is not something we shoudl be afraid of doing when required i just don't know of the top of my head if it is required in this instance.
|
The latest push contains most of the requested changes. The only requested change I haven't done is Sean's suggestion to "rework this to install the correct package for the running kernel". We discussed this out of band, and the situation boils down to the following:
The compromise we agreed to is that in addition to all the "(UNSUPPORTED)" labels and warning outputs in the code, I've added asserts that simply prevent these forbidden code paths from executing on RHEL nodes at all, preventing the accidental tainting of an otherwise supported kernel. |
8029acc to
0c45c11
Compare
|
I've just pushed an update with significantly enhanced molecule coverage. All molecule tests are passing locally, but there are hiccups that I'm not sure the best way to solve, or if they even matter. I locally changed the delegated scenarios to SSH to an EDPM node instead of using The file is missing because the "prepare" step didn't run. I think it won't prepare it because that instance was already prepared by the default scenario, and the "destroy" step, which would usually reset this, doesn't do anything on a delegated instance. If I use I restore a snapshot of my EDPM node between scenarios to keep them independent (reset.sh); I'm not sure how this matches up to the delegated nodes in CI - are they reused, or destroyed between scenarios? As you know, the vagrant-based scenarios have So all the tests are passing, but I wanted to document exactly how I run them, and see if perhaps I'm doing anything wrong. |
|
This change depends on a change that failed to merge. Change openstack-k8s-operators/openstack-operator#1722 is needed. |
@csibbitt you can add a CI job and see how it works for this PR exactly. For example, see then you can also speed up CI test suit by only including the role being worked out, like https://github.com/openstack-k8s-operators/edpm-ansible/pull/900/files |
| - create | ||
| - prepare | ||
| - converge | ||
| - verify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no verify.yml to verify. I guess we should verify the drivers installed and configs created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a check for this. There aren't really any configs to check for, but I can verify that one of the binaries from the package made it onto the system at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an update that adds a simple verification to all scenarios.
| {{ '--update-bls-cmdline' | ||
| if _edpm_accel_drivers_check_update_bls_cmdline.rc == 0 | ||
| else '' }} | ||
| changed_when: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need changed_when: true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. the original code from the arch DT was:
- name: Check if grub2-mkconfig has --update-bls-cmdline option
ansible.builtin.shell:
cmd: grub2-mkconfig --help | grep '\-\-update-bls-cmdline'
ignore_errors: true
register: check_update_bls_cmdline
changed_when: false
- name: Regenerate initramfs
become: true
ansible.builtin.command: "{{ item }}"
loop:
- 'dracut --force'
- >-
grub2-mkconfig -o /boot/grub2/grub.cfg
{{ '--update-bls-cmdline'
if check_update_bls_cmdline.rc == 0
else '' }}
when: (_blacklist_nvidia.changed or _load_vfio.changed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple answe is that the original code Bogdan is showing (with no changed_when) fails lint tests.
roles/edpm_accel_drivers/tasks/boot_config.yml:10: no-changed-when: Commands should not change things if nothing needs doing.
IMO, it's correct to assume that dracut --force will have made changes (at very least overwriting the existing initramfs file since that's what --force means), and therefor the ansible changed flag should always be true.
It's certainly possible to do deeper inspection to see if the contents of the initramfs or grub.conf actually changed as a result, but I'm not sure any added complexity is warranted here unless this changed state is pivotal for some other purpose I'm not aware of. Would you like me to add before and after comparisons of these files to more correctly set this flag?
FWIW, it's also up in the air whether we need to do any of this or not. See #1074 (comment)
fbc115b to
d8cc41a
Compare
d8cc41a to
22395c3
Compare
22395c3 to
280a7d4
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/655c07ffb56c46ff9d7f2e1621a9b2e7 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 22m 55s |
|
All three of the failed tests failed with this error: This has nothing to do with my PR, and currently that URL is returning a proper repomd.xml to me, so I'm going to assume that "the failure cause has been resolved " and try running the tests again (after another rebase) |
280a7d4 to
8bf729e
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/72e7d5c07af3414b933d6cfa86c8217b ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 22m 24s |
|
recheck |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d509b43a2e844a84bc61cd13fd2175e8 ✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 28m 49s |
|
Current CI failures - unrelated to this code because these both passed yesterday and the only change since then has been a rebase onto the main branch. and |
|
@olliewalsh @bogdando @stuggi - Is there any work remaining before this can be merged? |
|
recheck |
bogdando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/32e3c19b864f4a44a7a0915383fb7e44 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 23s |
* Currently nvidia GPU drivers * Can install from: * DNF Repo * .rpm from custom URL * .rpm from local file (like NFS mount) * .run file for testing on CS9/ubi9 systems
8bf729e to
0bc9164
Compare
bogdando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bogdando, csibbitt The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0857bb2
into
openstack-k8s-operators:main
Jira: OSPRH-22193