✨ Shutdown VMs before Deletion #2835

shaardie · 2025-11-14T15:23:10Z

Tries to shutdown the OpenStack VM before deleting it. This way even Pods form Daemonsets are shut down more gracefully and services like license daemons on the VMs can be properly shutdown.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1973

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
if necessary:
- includes documentation
- adds unit tests

/hold

k8s-ci-robot · 2025-11-14T15:23:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-11-14T15:23:18Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`47154ea`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-cluster-api-openstack/deploys/691ed07141ecd2000816910a
😎 Deploy Preview	https://deploy-preview-2835--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-11-14T15:23:21Z

Hi @shaardie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

lentzi90

/ok-to-test
Thinking about unnecessary API calls, should we skip trying to shut it down all together if the timeout is 0?

lentzi90 · 2025-11-19T15:40:15Z

/retitle ✨ Shutdown VMs before Deletion

shaardie · 2025-11-20T08:09:11Z

/ok-to-test Thinking about unnecessary API calls, should we skip trying to shut it down all together if the timeout is 0?

Which timeout do you mean exactly? timeoutInstanceDelete seems to be hardcoded to 5min.

Tries to shutdown the OpenStack VM before deleting it. This way even Pods form Daemonsets are shut down more gracefully and services like license daemons on the VMs can be properly shutdown. Related to kubernetes-sigs#1973

lentzi90 · 2025-11-20T08:32:18Z

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

shaardie · 2025-11-20T09:16:34Z

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

So you suggest a new configuration option via CRD?

lentzi90 · 2025-11-20T11:22:44Z

Hmm let me gather some second opinions. I want to have more than a gut feeling before we start modifying the CRDs 😄

mnaser · 2025-11-21T20:02:15Z

I'm not sure how I feel about failing if the system doesn't shut down, I feel like it would be better if the it tries to shut it down for 5 minutes, and if it doesn't shut down, it moves on to termination.

Anyways, OpenStack will flip from a graceful to hard shutdown after 60s by default:

https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.graceful_shutdown_timeout

So the 5 minute timeout seems overkill as well, unless something is seriously wrong (or the cloud has that config changed).

shaardie · 2025-11-25T10:24:33Z

I can also change the PR to continue with deleting the VM instead of failing after the period of time.

For me personally 60s would also be okay for a timeout, but I can think of situations where this can be a little bit short. For example, if there are some custom mounts of nfs, cifs, gpfs, what so ever. This can easily take more than 60s to shutdown.

Maybe you should first decide, if you want to have this value configurable via CRD?

lentzi90 · 2025-11-25T10:51:49Z

I have checked with my downstream and they do not have any concerns with the feature (always enabled).

However, it sounds like there are quite many ways to do things and people will want different things. Some do not care about the shutdown and definitely want to force it or just straight delete. Some want to make sure everything is properly shut down, rather error than force. And some will want a different timeout.

So how should we do this? I can see it working with either a flag or CRD field(s).

Then we have one more thing to consider. We want to make use of ORC for managing the servers. See #2814 for more details. Hopefully we can get this done sooner rather than later, which means that this feature would make more sense to implement in ORC directly. Otherwise we will end up having to migrate it later.

shaardie · 2025-11-25T13:38:26Z

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

Atomsoldat · 2025-11-25T19:15:44Z

So how should we do this?

Human interaction analogy

Let's go through the scenarios where Alice (User) wants Bob (CAPO) to delete a VM in a "regular" talk to your colleague kind of interaction:

If Alice tells Bob "Please shut down this VM for me, i need one less now" and gives no further details, Bob will have to go with what the best practice is, and will perform an ordinary shutdown, giving the OS some time to properly terminate processes.
- If the normal shutdown does not proceed as planned, a human operator would typically ask Alice for confirmation whether she would agree to a forced shutdown
- An automated process can not do this, so it has to do what minimises the deviation from the desired state (VM off, no improperly shut down processes causing trouble) while maximising the velocity of reaching the desired state. We know, that we can not maximise both, but (at least from my perspective) there are many things that can go wrong when we immediately shut everything off without proper cleanup, while the repercussions of a delay of 60 seconds before we get the big stick seem rather less stark in comparison. So any delay should be better than no delay in the majority of cases
If Alice knows, that the VM will take a long time to shut down normally, she should tell Bob this information, so that he does not get surprised.
If Alice wants Bob to immediately pull the plug on the machine instead of performing the usual shutdown, she should tell him this beforehand, because she can not expect this to be his regular modus operandi

In all of the above cases, Alice should provide Bob with the information he needs to proceed in an ideal way. To me, this hints in the direction of Alice (User) providing this information to Bob (CAPO) beforehand, in a way Bob understands (CRD field). If Bob tries to have one solution that applies to all possible use cases (Configuration Flag) he might get some cases wrong, in which Alice has different requirements.

You might also have the case that you have one CAPO instance managing VMs that you want to be deleted immediately as well as VMs that you want to give time for an orderly shutdown. That would also make the CRD field approach more desirable.

Which value should the feature use by default

In my opinion, the Venn diagram representing the group of people for whom one minute of additional VM runtime would be more than even a minor inconvenience (which could then be fixed easily) and the group of people who would be caught unaware of such a change should have a very small intersection.

Whereas with the way things currently work, the Venn diagram representing the group of people for whom an immediate VM termination would be more than even a minor inconvenience (which may or may not be easily remedied) and the group of people who might be bitten by this in the future probably has a larger intersection (in my opinion).

So i think under those conditions, there is no need to treat the previous default (which is unusual and can definetly cause headaches) with a lot of reverence. I think 60 seconds before forcing termination (which may then be adjusted for individual VMs with special considerations) is a reasonable default. If people never want their VMs to be force terminated, set it to -1, and if they want them terminated immediately, set it to0.

But this is just my opinion, just trying to give some input to give you a perspective on the choices you mentioned.

lentzi90 · 2025-11-28T06:33:17Z

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

I am basically saying that I think we need an option to either turn this feature off or to allow more granular configuration of it. I do not have a strong opinion on how to do that so I am leaving it up to you to propose what to do. The issue description already suggests a waitForShutdown field. That sounds reasonable and other people seem to agree also.

If you don't need this urgently, I also suggest looking into ORC first so that we can get an implementation that will work with it. Otherwise we risk breaking this feature later.

smoshiur1237 · 2025-12-01T14:14:47Z

As we are expecting to have graceful shut down. Should we follow the following steps so that ungraceful shutdown doesn't happen:

If VM is running,
Issue a "stop" or "poweroff" command via the OpenStack API.
Wait for the VM to reach "SHUTOFF" state.
Delete the instance as usual.
I can see the stop server option, but is it syncing with the deletion of the instance?

github-project-automation bot added this to CAPO Roadmap Nov 14, 2025

github-project-automation bot moved this to Inbox in CAPO Roadmap Nov 14, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 14, 2025

k8s-ci-robot requested review from EmilienM and smoshiur1237 November 14, 2025 15:23

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 14, 2025

lentzi90 reviewed Nov 19, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 19, 2025

k8s-ci-robot changed the title ~~Shutdown VMs before Deletion~~ ✨ Shutdown VMs before Deletion Nov 19, 2025

Shutdown VMs before Deletion

47154ea

Tries to shutdown the OpenStack VM before deleting it. This way even Pods form Daemonsets are shut down more gracefully and services like license daemons on the VMs can be properly shutdown. Related to kubernetes-sigs#1973

shaardie force-pushed the shutdown-before-deletion branch from 5e56463 to 47154ea Compare November 20, 2025 08:25

✨ Shutdown VMs before Deletion #2835

Are you sure you want to change the base?

✨ Shutdown VMs before Deletion #2835

Conversation

shaardie commented Nov 14, 2025

Uh oh!

k8s-ci-robot commented Nov 14, 2025

Uh oh!

netlify bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Uh oh!

k8s-ci-robot commented Nov 14, 2025

Uh oh!

lentzi90 left a comment

Choose a reason for hiding this comment

Uh oh!

lentzi90 commented Nov 19, 2025

Uh oh!

shaardie commented Nov 20, 2025

Uh oh!

lentzi90 commented Nov 20, 2025

Uh oh!

shaardie commented Nov 20, 2025

Uh oh!

lentzi90 commented Nov 20, 2025

Uh oh!

mnaser commented Nov 21, 2025

Uh oh!

shaardie commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lentzi90 commented Nov 25, 2025

Uh oh!

shaardie commented Nov 25, 2025

Uh oh!

Atomsoldat commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Human interaction analogy

Which value should the feature use by default

Uh oh!

lentzi90 commented Nov 28, 2025

Uh oh!

smoshiur1237 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Nov 14, 2025 •

edited

Loading

shaardie commented Nov 25, 2025 •

edited

Loading

Atomsoldat commented Nov 25, 2025 •

edited

Loading

smoshiur1237 commented Dec 1, 2025 •

edited

Loading