Skip to content

Conversation

@shaardie
Copy link

Tries to shutdown the OpenStack VM before deleting it. This way even Pods form Daemonsets are shut down more gracefully and services like license daemons on the VMs can be properly shutdown.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1973

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Nov 14, 2025

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 47154ea
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cluster-api-openstack/deploys/691ed07141ecd2000816910a
😎 Deploy Preview https://deploy-preview-2835--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @shaardie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
Thinking about unnecessary API calls, should we skip trying to shut it down all together if the timeout is 0?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 19, 2025
@lentzi90
Copy link
Contributor

/retitle ✨ Shutdown VMs before Deletion

@k8s-ci-robot k8s-ci-robot changed the title Shutdown VMs before Deletion ✨ Shutdown VMs before Deletion Nov 19, 2025
@shaardie
Copy link
Author

/ok-to-test Thinking about unnecessary API calls, should we skip trying to shut it down all together if the timeout is 0?

Which timeout do you mean exactly? timeoutInstanceDelete seems to be hardcoded to 5min.

Tries to shutdown the OpenStack VM before deleting it. This way even
Pods form Daemonsets are shut down more gracefully and services like
license daemons on the VMs can be properly shutdown.

Related to kubernetes-sigs#1973
@shaardie shaardie force-pushed the shutdown-before-deletion branch from 5e56463 to 47154ea Compare November 20, 2025 08:25
@lentzi90
Copy link
Contributor

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

@shaardie
Copy link
Author

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

So you suggest a new configuration option via CRD?

@lentzi90
Copy link
Contributor

Hmm let me gather some second opinions. I want to have more than a gut feeling before we start modifying the CRDs 😄

@mnaser
Copy link
Contributor

mnaser commented Nov 21, 2025

I'm not sure how I feel about failing if the system doesn't shut down, I feel like it would be better if the it tries to shut it down for 5 minutes, and if it doesn't shut down, it moves on to termination.

Anyways, OpenStack will flip from a graceful to hard shutdown after 60s by default:

https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.graceful_shutdown_timeout

So the 5 minute timeout seems overkill as well, unless something is seriously wrong (or the cloud has that config changed).

@shaardie
Copy link
Author

shaardie commented Nov 25, 2025

I can also change the PR to continue with deleting the VM instead of failing after the period of time.

For me personally 60s would also be okay for a timeout, but I can think of situations where this can be a little bit short. For example, if there are some custom mounts of nfs, cifs, gpfs, what so ever. This can easily take more than 60s to shutdown.

Maybe you should first decide, if you want to have this value configurable via CRD?

@lentzi90
Copy link
Contributor

I have checked with my downstream and they do not have any concerns with the feature (always enabled).

However, it sounds like there are quite many ways to do things and people will want different things. Some do not care about the shutdown and definitely want to force it or just straight delete. Some want to make sure everything is properly shut down, rather error than force. And some will want a different timeout.

So how should we do this? I can see it working with either a flag or CRD field(s).

Then we have one more thing to consider. We want to make use of ORC for managing the servers. See #2814 for more details. Hopefully we can get this done sooner rather than later, which means that this feature would make more sense to implement in ORC directly. Otherwise we will end up having to migrate it later.

@shaardie
Copy link
Author

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

@Atomsoldat
Copy link

Atomsoldat commented Nov 25, 2025

So how should we do this?

Human interaction analogy

Let's go through the scenarios where Alice (User) wants Bob (CAPO) to delete a VM in a "regular" talk to your colleague kind of interaction:

  • If Alice tells Bob "Please shut down this VM for me, i need one less now" and gives no further details, Bob will have to go with what the best practice is, and will perform an ordinary shutdown, giving the OS some time to properly terminate processes.
    • If the normal shutdown does not proceed as planned, a human operator would typically ask Alice for confirmation whether she would agree to a forced shutdown
    • An automated process can not do this, so it has to do what minimises the deviation from the desired state (VM off, no improperly shut down processes causing trouble) while maximising the velocity of reaching the desired state. We know, that we can not maximise both, but (at least from my perspective) there are many things that can go wrong when we immediately shut everything off without proper cleanup, while the repercussions of a delay of 60 seconds before we get the big stick seem rather less stark in comparison. So any delay should be better than no delay in the majority of cases
  • If Alice knows, that the VM will take a long time to shut down normally, she should tell Bob this information, so that he does not get surprised.
  • If Alice wants Bob to immediately pull the plug on the machine instead of performing the usual shutdown, she should tell him this beforehand, because she can not expect this to be his regular modus operandi

In all of the above cases, Alice should provide Bob with the information he needs to proceed in an ideal way. To me, this hints in the direction of Alice (User) providing this information to Bob (CAPO) beforehand, in a way Bob understands (CRD field). If Bob tries to have one solution that applies to all possible use cases (Configuration Flag) he might get some cases wrong, in which Alice has different requirements.

You might also have the case that you have one CAPO instance managing VMs that you want to be deleted immediately as well as VMs that you want to give time for an orderly shutdown. That would also make the CRD field approach more desirable.

Which value should the feature use by default

In my opinion, the Venn diagram representing the group of people for whom one minute of additional VM runtime would be more than even a minor inconvenience (which could then be fixed easily) and the group of people who would be caught unaware of such a change should have a very small intersection.

Whereas with the way things currently work, the Venn diagram representing the group of people for whom an immediate VM termination would be more than even a minor inconvenience (which may or may not be easily remedied) and the group of people who might be bitten by this in the future probably has a larger intersection (in my opinion).

So i think under those conditions, there is no need to treat the previous default (which is unusual and can definetly cause headaches) with a lot of reverence. I think 60 seconds before forcing termination (which may then be adjusted for individual VMs with special considerations) is a reasonable default. If people never want their VMs to be force terminated, set it to -1, and if they want them terminated immediately, set it to0.

But this is just my opinion, just trying to give some input to give you a perspective on the choices you mentioned.

@lentzi90
Copy link
Contributor

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

I am basically saying that I think we need an option to either turn this feature off or to allow more granular configuration of it. I do not have a strong opinion on how to do that so I am leaving it up to you to propose what to do. The issue description already suggests a waitForShutdown field. That sounds reasonable and other people seem to agree also.

If you don't need this urgently, I also suggest looking into ORC first so that we can get an implementation that will work with it. Otherwise we risk breaking this feature later.

@smoshiur1237
Copy link
Contributor

smoshiur1237 commented Dec 1, 2025

As we are expecting to have graceful shut down. Should we follow the following steps so that ungraceful shutdown doesn't happen:

  1. If VM is running,
  2. Issue a "stop" or "poweroff" command via the OpenStack API.
  3. Wait for the VM to reach "SHUTOFF" state.
  4. Delete the instance as usual.
    I can see the stop server option, but is it syncing with the deletion of the instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: Inbox

Development

Successfully merging this pull request may close these issues.

Option to Shutdown VM before deleting it

6 participants