Skip to content

Comments

docs: document NoExecute taint risks and add admission warning#120

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
ketanjani21:noexecute-1
Feb 24, 2026
Merged

docs: document NoExecute taint risks and add admission warning#120
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
ketanjani21:noexecute-1

Conversation

@ketanjani21
Copy link
Contributor

@ketanjani21 ketanjani21 commented Feb 5, 2026

Description

Addresses the risk of using NoExecute taints with continuous enforcement mode, which can cause unintended pod evictions.

Changes:

  • Add "Choosing a Taint Effect" section to concepts.md explaining the implications of each taint effect and when to use them
  • Return admission warnings when creating/updating rules with NoExecute effect
  • Add tests for the new warning logic
  • Update taint field godoc with brief caution note

Type of Change

/kind documentation
/kind feature

Testing

# Test 1: NoExecute + continuous  (CAUTION warning)
⎈ kind-nrr-test  cat noexecute-continuous.yaml
  apiVersion: readiness.node.x-k8s.io/v1alpha1
  kind: NodeReadinessRule
  metadata:
    name: test-noexecute-continuous
  spec:
    conditions:
      - type: NetworkReady
        requiredStatus: "True"
    nodeSelector:
      matchLabels:
        reserved-for: worker
    taint:
      key: readiness.k8s.io/NetworkReady
      effect: NoExecute
    enforcementMode: continuous
    
⎈ kind-nrr-test  kubectl apply -f noexecute-continuous.yaml
Warning: CAUTION: NoExecute with continuous mode evicts pods when conditions fail, risking workload disruption. Consider NoSchedule or bootstrap-only
nodereadinessrule.readiness.node.x-k8s.io/test-noexecute-continuous created

# Test 2: NoExecute + bootstrap-only  (NOTE warning)
⎈ kind-nrr-test  cat noexecute-boostrap-only.yaml
  apiVersion: readiness.node.x-k8s.io/v1alpha1
  kind: NodeReadinessRule
  metadata:
    name: test-noexecute-bootstrap
  spec:
    conditions:
      - type: NetworkReady
        requiredStatus: "True"
    nodeSelector:
      matchLabels:
        reserved-for: worker
    taint:
      key: readiness.k8s.io/NetworkReady2
      effect: NoExecute
    enforcementMode: bootstrap-only
    
⎈ kind-nrr-test  kubectl apply -f noexecute-boostrap-only.yaml
Warning: NOTE: NoExecute will evict existing pods without tolerations. Ensure critical system pods have appropriate tolerations
nodereadinessrule.readiness.node.x-k8s.io/test-noexecute-bootstrap created

# Test 3: empty nodeSelector  (rejected)
⎈ kind-nrr-test  cat invalid-test.yaml
  apiVersion: readiness.node.x-k8s.io/v1alpha1
  kind: NodeReadinessRule
  metadata:
    name: test-invalid
  spec:
    conditions:
      - type: NetworkReady
        requiredStatus: "True"
    nodeSelector: {}
    taint:
      key: readiness.k8s.io/NetworkReady
      effect: NoSchedule
    enforcementMode: continuous
   
⎈ kind-nrr-test  kubectl apply -f invalid-test.yaml
Error from server (Forbidden): error when creating "invalid-test.yaml": admission webhook "vnodereadinessrule.kb.io" denied the request: validation failed: [spec.nodeSelector: Required value: nodeSelector must not be empty]


# Test 4: Conflict detection: same taint key+effect+selector as any other rule (rejected)
⎈ kind-nrr-test  cat conflict-detection.yaml
  apiVersion: readiness.node.x-k8s.io/v1alpha1
  kind: NodeReadinessRule
  metadata:
    name: test-conflict
  spec:
    conditions:
      - type: StorageReady
        requiredStatus: "True"
    nodeSelector:
      matchLabels:
        reserved-for: worker
    taint:
      key: readiness.k8s.io/NetworkReady3
      effect: NoSchedule
    enforcementMode: continuous
  
⎈ kind-nrr-test  kubectl apply -f conflict-detection.yaml
Error from server (Forbidden): error when creating "conflict-detection.yaml": admission webhook "vnodereadinessrule.kb.io" denied the request: validation failed: [spec.taint.key: Invalid value: "readiness.k8s.io/NetworkReady3": conflicts with existing rule 'test-noschedule' - same taint key 'readiness.k8s.io/NetworkReady3' and effect 'NoSchedule' with overlapping node selectors]

Checklist

  • make test passes
  • make lint passes

Does this PR introduce a user-facing change?

Add admission warning when creating NodeReadinessRule with NoExecute taint effect. Using NoExecute with continuous enforcement can cause pod evictions during transient condition failures. The webhook now returns a warning guiding users toward safer configurations.

Doc #9

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 5, 2026
@k8s-ci-robot
Copy link
Contributor

Welcome @ketanjani21!

It looks like this is your first PR to kubernetes-sigs/node-readiness-controller 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/node-readiness-controller has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 5, 2026
@netlify
Copy link

netlify bot commented Feb 5, 2026

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 1209a23
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/699ac7fa57709d0008b42c41


This allows you to preview exactly which nodes would be affected and identifying any potential misconfigurations (like a typo in a label selector) before they impact your cluster.

## Choosing a Taint Effect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, taint introduction doesnt need to be here. Maybe we could link to the official documentation?

And what do you think about creating a separate page under user-guide for creating rules? More like how-to or getting started page?

I'm thinking aspects like #46, prefix restrictions introduced by #106 could also go in there.

I imagine this is generated with some llm help? IMO, I'm generally open to AI generating docs, but it could be concise to give more value to reader from reading less. just my thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajaysundark Thank you for the feedback. I noticed there was a getting-started.md already under docs, so I moved it under user-guide, updated it, and re-arranged few sections.

I also updated the information about prefix restrictions.

The maximum limits for the controller are already documented here, so I didn't duplicate the info again.

Happy to make further updates based on your feedback.

Copy link
Contributor

@ajaysundark ajaysundark Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds fair. we could leave the limits at the reference spec.

^it looks to me that docs/spec.md is a missed clean up as it is not maintained (crd-ref-docs only updates docs/book/src/reference/api-spec.md).
Would you be able to help on this as a follow-up PR?

cc @sreeram-venkitesh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #124

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, happy to do a follow-up PR to cleanup docs/spec.md 👍

@ajaysundark
Copy link
Contributor

/assign @ketanjani21

This is the priority class used by other critical cluster components (eg: core-dns).

**Images**: The official releases use multi-arch images (AMD64, Arm64).
**Images**: The official releases use multi-arch images (AMD64, Arm64) available in the Kubernetes staging registry:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct. staging registry is different than 'registry.k8s.io/xx' which is where the official images can be consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching, it got carried over from the root level getting-started.md. updated 👍

**Images**: The official releases use multi-arch images (AMD64, Arm64) available in the Kubernetes staging registry:

### Option 2: Deploy Using Kustomize
```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have installation flows captured: simple consume from official release images, and using kustomize for advanced setup.

I think documenting 3 different installation paths would just be confusing the reader. Maybe we could hold on to this for now and come back to updating the installation section for #94.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I've simplified the installation method for now, we can re-visit later as per your suggestion.

Copy link
Contributor

@ajaysundark ajaysundark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ketanjani21 Left some suggestions, ptal.
Thanks for looking into this.

@@ -36,16 +34,10 @@ This is the priority class used by other critical cluster components (eg: core-d

**Images**: The official releases use multi-arch images (AMD64, Arm64).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Images**: The official releases use multi-arch images (AMD64, Arm64).
#### Images
The official releases use multi-arch images (AMD64, Arm64) and are available at `registry.k8s.io/node-readiness-controller/node-readiness-controller`


If you have cloned the repository and want to deploy from source, you can use Kustomize.

```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could skip this detail


## Deployment Options

### Option 1: Install Official Release (Recommended)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could keep this as well


**Images**: The official releases use multi-arch images (AMD64, Arm64).

### Option 2: Deploy Using Kustomize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert to existing, let us keep the kustomize option

Comment on lines 14 to 31
```yaml
apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: storage-readiness-rule
spec:
conditions:
- type: "storage.kubernetes.io/CSIReady"
requiredStatus: "True"
- type: "storage.kubernetes.io/VolumePluginReady"
requiredStatus: "True"
taint:
key: "readiness.k8s.io/StorageReady"
effect: "NoSchedule"
value: "pending"
enforcementMode: "bootstrap-only"
nodeSelector:
matchLabels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```yaml
apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: nfs-storage-readiness-rule
spec:
conditions:
- type: "csi.example.net/NodePluginRegistered"
requiredStatus: "True"
- type: "csi.example.net/BackendReachable"
requiredStatus: "True"
- type: "DiskPressure"
requiredStatus: "False"
taint:
key: "readiness.k8s.io/vendor.com/nfs-unhealthy"
effect: "NoSchedule"
enforcementMode: "bootstrap-only"
nodeSelector:
matchLabels:

effect: "NoSchedule"

# Rule 2 - This will be REJECTED
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spec:
spec:
conditions:
- type: "cniplugin.example.net/rdma/NetworkReady"
requiredStatus: "True"

- Removes bootstrap taint when conditions are first satisfied
- Marks completion with node annotation
- Stops monitoring after successful removal (fail-safe)
- Ideal for one-time setup conditions (storage, installing node daemons e.g: security agent or kernel-module update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ideal for one-time setup conditions (storage, installing node daemons e.g: security agent or kernel-module update)
- Ideal for one-time setup conditions (installing node daemons e.g: security agent or kernel-module update)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us leave Cluster Autoscaler for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us leave security for now, we'll find a separate page with more details later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us also remove the Development section

@ketanjani21
Copy link
Contributor Author

@ajaysundark - Thank you for reviewing the PR, I've updated it further based on your suggestions. Please take a look.


If you have cloned the repository and want to deploy from source, you can use Kustomize.
```sh
REPO="registry.k8s.io/node-readiness-controller/node-readiness-reporter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
REPO="registry.k8s.io/node-readiness-controller/node-readiness-reporter"
REPO="registry.k8s.io/node-readiness-controller/node-readiness-controller"

**`NoExecute` with `continuous` enforcement mode will evict existing workloads when conditions fail.**

## Configuration
If a critical component becomes temporarily unavailable (e.g., CNI daemon restart), all pods without matching tolerations are immediately evicted from the node. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If a critical component becomes temporarily unavailable (e.g., CNI daemon restart), all pods without matching tolerations are immediately evicted from the node. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.
If a readiness condition on the node is failing temporarily (eg., the component restarted), all pods without matching tolerations are immediately evicted from the node, if configured with a `NoExecute` taint. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.

warnings = append(warnings,
"CAUTION: Using NoExecute taint effect with continuous enforcement mode. "+
"This configuration will evict existing pods when conditions fail, which may cause "+
"significant workload disruption if conditions are unstable. Consider using NoSchedule "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"significant workload disruption if conditions are unstable. Consider using NoSchedule "+
"workload disruption if conditions are unstable. Consider using NoSchedule "+

// NoExecute with continuous mode is particularly risky
if spec.EnforcementMode == readinessv1alpha1.EnforcementModeContinuous {
warnings = append(warnings,
"CAUTION: Using NoExecute taint effect with continuous enforcement mode. "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also handling API warnings. Though these are nice to have, there seem to be length restrictions on processing the admission warnings. ref: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#response.

Could you check if this satisfies the criteria?

## Configuration
If a critical component becomes temporarily unavailable (e.g., CNI daemon restart), all pods without matching tolerations are immediately evicted from the node. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.

The admission webhook warns when using `NoExecute`.
Copy link
Contributor

@ajaysundark ajaysundark Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get a chance to test your changes w/ warnings? You may need to follow this PR: #122 for testing the webhook.

nit / optional: If you get a chance to verify it, you could also add the warning here as response to kubectl.

@ajaysundark
Copy link
Contributor

@ajaysundark - Thank you for reviewing the PR, I've updated it further based on your suggestions. Please take a look.

@ketanjani21 Thanks for iterating on the PR. Just found some quick notes on the warnings length limitations, did you get a chance to test them in a kind cluster (docs/TEST_README.md) ?

@ketanjani21
Copy link
Contributor Author

@ajaysundark - Thank you for reviewing the PR, I've updated it further based on your suggestions. Please take a look.

@ketanjani21 Thanks for iterating on the PR. Just found some quick notes on the warnings length limitations, did you get a chance to test them in a kind cluster (docs/TEST_README.md) ?

yes, I've kept the warnings under the character limits, and tested them in a kind cluster. Please see PR description testing section for test result details.

restructure docs

update taint key naming instructions

update registry url, simplify installation methods

refactor based on PR feedback

fix repo url

minor updates

update warnings, and getting started docs
@ajaysundark
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2026
@ajaysundark
Copy link
Contributor

/approve

This looks great, Thanks for handling these changes @ketanjani21!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajaysundark, ketanjani21

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2026
@k8s-ci-robot k8s-ci-robot merged commit 7215277 into kubernetes-sigs:main Feb 24, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants