Skip to content

Conversation

@stefanhipfel
Copy link
Contributor

@stefanhipfel stefanhipfel commented Mar 25, 2025

Proposed Changes

  • Adds ServerMaintenanceReplica CRD and controller, which allows to set a list of servers into maintenance.
  • Adds Complete state to Maintenance CRD:
    • Now a Server can be removed from Maintenance by either deleting its Maintenance CRD or setting it to state Completed
  • Add Maintenance list index

Summary by CodeRabbit

Release Notes

  • New Features

    • ServerMaintenanceSet: new resource for managing multiple server maintenance operations across servers matching specified label selectors
  • Updates

    • Refactored ServerMaintenance specification to use template-based configuration structure
    • Extended maintenance state machine with new Completed state
    • Added RBAC ClusterRoles (admin, editor, viewer) for granular ServerMaintenanceSet access control
  • Documentation

    • Added API reference and architectural documentation for ServerMaintenanceSet
    • Added sample manifests demonstrating ServerMaintenanceSet usage patterns

✏️ Tip: You can customize this high-level summary in your review settings.

Copy link
Contributor

@Nuckal777 Nuckal777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this.

SzymonSAP and others added 7 commits April 1, 2025 10:07
* Add helm charts

* Add node selector and tolerations to charts

* Add charts documentation

* Add make helm and check it in a workflow

* Add hostNetwork value

* Restrict helm workflow to read only
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.36.2 to 1.36.3.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.36.2...v1.36.3)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Bump golangci/golangci-lint-action from 6 to 7

Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 6 to 7.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v6...v7)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump golangci-lint to v2.0 and fix linting issues

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andreas Fritzler <andreas.fritzler@sap.com>
* implements maintenance controller

* fixes lint errors

* adds missing license info to maintenance controller

* minor refactoring

* uses server/serverclaim watches

* fixes review comments

* adds rbac roles; fixes test

* enhances test with multiple maintenance objects

* fixes enqueueMaintenanceByServerRefs

* fixes lint errors

* updates server reconciliation logic to handle ServerClaimRef state transitions

* rebase main; fixes linting errors

* renames createServerBootConfiguration func

---------

Co-authored-by: Stefan Hipfel <stefan@Mac.fritz.box>
@github-actions github-actions bot added size/XL documentation Improvements or additions to documentation and removed size/XXL labels Apr 1, 2025
@stefanhipfel stefanhipfel changed the title implements serverMaintenanceReplica implements serverMaintenanceSet Apr 1, 2025
@github-actions github-actions bot added size/XXL and removed size/XL labels Apr 1, 2025
@stefanhipfel stefanhipfel marked this pull request as ready for review April 1, 2025 15:54
@stefanhipfel stefanhipfel linked an issue Apr 10, 2025 that may be closed by this pull request
@afritzler afritzler changed the title implements serverMaintenanceSet Add ServerMaintenanceSet type and controller implementation Apr 14, 2025
@stefanhipfel stefanhipfel requested a review from afritzler April 22, 2025 14:20
@afritzler
Copy link
Member

Can we also add some docs under https://github.com/ironcore-dev/metal-operator/tree/main/docs/concepts and reference it in the overall arch chapter https://github.com/ironcore-dev/metal-operator/blob/main/docs/architecture.md (no need to expand the graph here as this will be to big)

Copy link
Contributor

@Nuckal777 Nuckal777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking another look at this. 👍

Comment on lines 192 to 204
for _, maintenance := range maintenancelist.Items {
log.V(1).Info("Checking for orphaned maintenance", "maintenance", maintenance.Name)
// Check if the maintenance is part of the server maintenance set
if !slices.Contains(serverNames, maintenance.Spec.ServerRef.Name) {
log.V(1).Info("Maintenance is orphaned, deleting", "maintenance", maintenance.Name)
if err := r.Delete(ctx, &maintenance); err != nil {
errs = errors.Join(errs, fmt.Errorf("failed to delete orphaned maintenance %s: %w", maintenance.Name, err))
} else {
log.V(1).Info("Deleted orphaned maintenance", "maintenance", maintenance.Name)
}
continue
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we simply check that the Server targeted by the managed ServerMaintenances still exist in serverList (similar to what is done here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking if server is being deleted happens in line 188

@stefanhipfel stefanhipfel requested a review from a team as a code owner December 19, 2025 13:59
… enhance tests for server deletion and status updates
@coderabbitai
Copy link

coderabbitai bot commented Dec 23, 2025

Walkthrough

This PR introduces a new ServerMaintenanceSet CRD to manage multiple ServerMaintenance resources. It refactors ServerMaintenance to use an embedded ServerMaintenanceTemplate for configuration fields, adds a dedicated controller for ServerMaintenanceSet reconciliation, and provides supporting infrastructure including RBAC rules, documentation, and tests.

Changes

Cohort / File(s) Summary
API Type Definitions
PROJECT, api/v1alpha1/servermaintenance_types.go, api/v1alpha1/servermaintenanceset_types.go
Adds ServerMaintenanceSet CRD entry to project config. Refactors ServerMaintenanceSpec to embed ServerMaintenanceTemplate (containing Policy, ServerPower, ServerBootConfigurationTemplate) and moves ServerRef to top level. Introduces ServerMaintenanceStateCompleted state. Defines new ServerMaintenanceSet and ServerMaintenanceSetList types with spec (serverSelector, serverMaintenanceTemplate) and status counters.
Generated Deep-Copy Methods
api/v1alpha1/zz_generated.deepcopy.go
Adds DeepCopyInto, DeepCopy, and DeepCopyObject methods for ServerMaintenanceSet, ServerMaintenanceSetList, ServerMaintenanceSetSpec, ServerMaintenanceSetStatus, and ServerMaintenanceTemplate. Updates ServerMaintenanceSpec.DeepCopyInto to handle embedded ServerMaintenanceTemplate.
CRD Manifest & Kustomization
config/crd/bases/metal.ironcore.dev_servermaintenancesets.yaml, config/crd/kustomization.yaml
Introduces new CustomResourceDefinition for ServerMaintenanceSet with full openAPIV3 schema, validation rules, and status subresource. Adds resource entry to kustomization.
RBAC Roles
config/rbac/role.yaml, config/rbac/servermaintenanceset_*.yaml
Adds servermaintenancesets, servermaintenancesets/finalizers, and servermaintenancesets/status to main ClusterRole. Creates three new roles: servermaintenanceset-admin-role (full permissions), servermaintenanceset-editor-role (create/delete/patch/update/list/watch), and servermaintenanceset-viewer-role (read-only).
Sample Manifests & Configuration
config/samples/kustomization.yaml, config/samples/metal_v1alpha1_servermaintenance.yaml, config/samples/metal_v1alpha1_servermaintenanceset.yaml
Updates ServerMaintenance sample with actual spec fields (policy, serverPower, serverRef). Adds new ServerMaintenanceSet sample with serverSelector and serverMaintenanceTemplate. Updates kustomization to include both samples.
Controller Implementation
internal/controller/servermaintenance_controller.go, internal/controller/servermaintenanceset_controller.go, internal/controller/index_fields.go, internal/controller/helper.go
Implements ServerMaintenanceSetReconciler with finalizer handling, server selection, ServerMaintenance lifecycle management, and status aggregation. Updates ServerMaintenance controller to handle Completed state and added nil-server guards. Adds ServerMaintenance indexer by spec.serverRef.name. Introduces truncateString helper.
Controller Tests
internal/controller/server_controller_test.go, internal/controller/servermaintenance_controller_test.go, internal/controller/servermaintenanceset_controller_test.go, internal/controller/suite_test.go, internal/controller/test_helper.go
Updates existing tests to reflect ServerMaintenanceTemplate nesting. Adds comprehensive ServerMaintenanceSetReconciler test suite verifying maintenance creation, deletion, status aggregation, and selector-based reconciliation. Disables dev mode logging. Reorders test validation.
Documentation
docs/api-reference/api.md, docs/architecture.md, docs/concepts/servermaintenanceset.md
Adds ServerMaintenanceSet to API reference with field definitions and cross-references. Adds ServerMaintenanceSet to architecture overview. Introduces new concept documentation explaining ServerMaintenanceSet workflow, label-based server selection, and status tracking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant K8s API
    participant SMSReconciler as ServerMaintenanceSet<br/>Reconciler
    participant Server as Server<br/>Resource
    participant SM as ServerMaintenance<br/>Resource
    
    User->>K8s API: Create ServerMaintenanceSet<br/>(with serverSelector)
    K8s API->>SMSReconciler: Reconcile triggered
    
    SMSReconciler->>SMSReconciler: Ensure finalizer attached
    
    SMSReconciler->>K8s API: Query servers matching<br/>serverSelector labels
    K8s API-->>SMSReconciler: List of matching Servers
    
    loop For each matching Server
        SMSReconciler->>SMSReconciler: Check if ServerMaintenance<br/>already exists
        alt ServerMaintenance not owned
            SMSReconciler->>K8s API: Create ServerMaintenance<br/>(from template + serverRef)
            K8s API-->>SM: ServerMaintenance created
            SMSReconciler->>SM: Set ControllerReference<br/>to ServerMaintenanceSet
        end
    end
    
    SMSReconciler->>K8s API: Query all owned<br/>ServerMaintenance resources
    K8s API-->>SMSReconciler: List of owned maintenances
    
    loop For each owned ServerMaintenance
        SMSReconciler->>SM: Get current state<br/>(Pending/InMaintenance/Failed/Completed)
        SM-->>SMSReconciler: State + condition info
        SMSReconciler->>SMSReconciler: Aggregate status counts
    end
    
    alt ServerMaintenance references<br/>deleted server
        SMSReconciler->>K8s API: Delete orphaned<br/>ServerMaintenance
    end
    
    SMSReconciler->>K8s API: Patch ServerMaintenanceSet<br/>status with aggregated counts
    K8s API-->>User: Status updated<br/>(pending, inMaintenance,<br/>failed, completed, total)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The description covers key changes but contains terminology mismatches and is incomplete relative to the template and actual changeset. Clarify that the CRD is ServerMaintenanceSet (not ServerMaintenanceReplica), expand on the Complete state lifecycle changes, and document the refactoring of ServerMaintenanceTemplate and restructured ServerMaintenanceSpec.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the primary change: adding a new ServerMaintenanceSet type and its controller implementation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch servermaintenancereplica

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
internal/controller/helper.go (1)

156-162: Consider edge cases for truncateString.

Two potential issues:

  1. Negative maxLength: If maxLength < 0, the slice s[:maxLength] will panic. Consider adding a guard.
  2. Multi-byte UTF-8 characters: Slicing by byte index can split a multi-byte character, producing invalid UTF-8. If this is used for Kubernetes object names (which should be ASCII anyway), this may be acceptable, but worth noting.
Proposed defensive fix
 func truncateString(s string, maxLength int) string {
+	if maxLength <= 0 {
+		return ""
+	}
 	if len(s) <= maxLength {
 		return s
 	}
 	return s[:maxLength]
 }
api/v1alpha1/servermaintenanceset_types.go (1)

11-16: Consider adding kubebuilder validation for required fields.

The ServerSelector field appears to be required for the controller to function correctly. Consider adding a +kubebuilder:validation:Required marker or ensuring the field has validation to prevent creating a ServerMaintenanceSet without a selector.

🔎 Suggested enhancement
 // ServerMaintenanceSetSpec defines the desired state of ServerMaintenanceSet.
 type ServerMaintenanceSetSpec struct {
 	// ServerLabelSelector specifies a label selector to identify the servers that are to be maintained.
+	// +required
 	ServerSelector metav1.LabelSelector `json:"serverSelector"`
 	// Template specifies the template for the server maintenance.
 	ServerMaintenanceTemplate ServerMaintenanceTemplate `json:"serverMaintenanceTemplate"`
 }
internal/controller/servermaintenance_controller.go (1)

235-255: Redundant error check and potential nil dereference.

In handleCompletedState:

  1. Lines 239-243: The apierrors.IsNotFound(err) check is redundant because getServerRef already returns (nil, nil) for not found errors (lines 356-358).
  2. Line 245: The condition checks server != nil && server.Spec.ServerMaintenanceRef == nil, but line 250 calls r.cleanup(ctx, log, server) without rechecking if server is nil after the early return at line 248. While logically correct due to control flow, line 253 accesses server.Name which could panic if server is nil (though current logic prevents this).
🔎 Suggested cleanup
 func (r *ServerMaintenanceReconciler) handleCompletedState(ctx context.Context, log logr.Logger, serverMaintenance *metalv1alpha1.ServerMaintenance) (ctrl.Result, error) {
 	log.V(1).Info("Handling completed state for server maintenance", "ServerMaintenance", serverMaintenance.Name)
 	server, err := r.getServerRef(ctx, serverMaintenance)
 	if err != nil {
-		if apierrors.IsNotFound(err) {
-			log.V(1).Info("Server not found, nothing to cleanup", "ServerMaintenance", serverMaintenance.Name)
-			return ctrl.Result{}, nil
-		}
 		return ctrl.Result{}, err
 	}
-	if server != nil && server.Spec.ServerMaintenanceRef == nil {
+	if server == nil {
+		log.V(1).Info("Server not found, nothing to cleanup", "ServerMaintenance", serverMaintenance.Name)
+		return ctrl.Result{}, nil
+	}
+	if server.Spec.ServerMaintenanceRef == nil {
 		log.V(1).Info("Server is not in maintenance, nothing to cleanup", "ServerMaintenance", serverMaintenance.Name)
 		return ctrl.Result{}, nil
 	}
internal/controller/servermaintenanceset_controller.go (1)

275-314: Consider optimizing enqueue logic for large clusters.

The enqueueMaintenanceSetByServers handler lists all ServerMaintenanceSet resources for every server event, which could be expensive in clusters with many sets. Consider:

  1. Adding an index on servers by label keys used in selectors
  2. Caching selector matches
  3. Using informer caches more efficiently

For now, this is acceptable for typical deployments but worth noting for future optimization.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37bd373 and 9f84cdc.

⛔ Files ignored due to path filters (5)
  • dist/chart/templates/crd/metal.ironcore.dev_servermaintenancesets.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/role.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/servermaintenanceset_admin_role.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/servermaintenanceset_editor_role.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/servermaintenanceset_viewer_role.yaml is excluded by !**/dist/**
📒 Files selected for processing (25)
  • PROJECT
  • api/v1alpha1/servermaintenance_types.go
  • api/v1alpha1/servermaintenanceset_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/metal.ironcore.dev_servermaintenancesets.yaml
  • config/crd/kustomization.yaml
  • config/rbac/role.yaml
  • config/rbac/servermaintenanceset_admin_role.yaml
  • config/rbac/servermaintenanceset_editor_role.yaml
  • config/rbac/servermaintenanceset_viewer_role.yaml
  • config/samples/kustomization.yaml
  • config/samples/metal_v1alpha1_servermaintenance.yaml
  • config/samples/metal_v1alpha1_servermaintenanceset.yaml
  • docs/api-reference/api.md
  • docs/architecture.md
  • docs/concepts/servermaintenanceset.md
  • internal/controller/helper.go
  • internal/controller/index_fields.go
  • internal/controller/server_controller_test.go
  • internal/controller/servermaintenance_controller.go
  • internal/controller/servermaintenance_controller_test.go
  • internal/controller/servermaintenanceset_controller.go
  • internal/controller/servermaintenanceset_controller_test.go
  • internal/controller/suite_test.go
  • internal/controller/test_helper.go
🧰 Additional context used
🧬 Code graph analysis (9)
internal/controller/suite_test.go (1)
internal/controller/servermaintenanceset_controller.go (1)
  • ServerMaintenanceSetReconciler (34-37)
internal/controller/servermaintenanceset_controller_test.go (4)
api/v1alpha1/servermaintenanceset_types.go (2)
  • ServerMaintenanceSet (36-42)
  • ServerMaintenanceSetSpec (11-16)
api/v1alpha1/bmcsecret_types.go (3)
  • BMCSecret (24-57)
  • BMCSecretUsernameKeyName (13-13)
  • BMCSecretPasswordKeyName (15-15)
api/v1alpha1/bmc_types.go (2)
  • Protocol (118-129)
  • ProtocolRedfishLocal (18-18)
api/v1alpha1/servermaintenance_types.go (4)
  • ServerMaintenanceReasonAnnotationKey (15-15)
  • ServerMaintenanceStateInMaintenance (77-77)
  • ServerMaintenanceStateCompleted (81-81)
  • ServerMaintenancePolicyOwnerApproval (59-59)
internal/controller/index_fields.go (1)
api/v1alpha1/servermaintenance_types.go (1)
  • ServerMaintenance (94-100)
internal/controller/server_controller_test.go (1)
api/v1alpha1/servermaintenance_types.go (2)
  • ServerMaintenanceTemplate (32-43)
  • ServerMaintenancePolicyEnforced (61-61)
internal/controller/servermaintenance_controller_test.go (3)
api/v1alpha1/servermaintenance_types.go (4)
  • ServerMaintenanceTemplate (32-43)
  • ServerMaintenancePolicyEnforced (61-61)
  • ServerBootConfigurationTemplate (21-29)
  • ServerMaintenancePolicyOwnerApproval (59-59)
api/v1alpha1/server_types.go (1)
  • ServerStateDiscovery (150-150)
api/v1alpha1/serverbootconfiguration_types.go (1)
  • ServerBootConfigurationSpec (12-26)
internal/controller/servermaintenance_controller.go (2)
api/v1alpha1/servermaintenance_types.go (4)
  • ServerMaintenanceStateFailed (79-79)
  • ServerMaintenanceStateCompleted (81-81)
  • ServerMaintenance (94-100)
  • ServerMaintenanceList (105-109)
api/v1alpha1/server_types.go (1)
  • Server (428-434)
internal/controller/test_helper.go (1)
api/v1alpha1/servermaintenance_types.go (1)
  • ServerMaintenanceList (105-109)
api/v1alpha1/zz_generated.deepcopy.go (2)
api/v1alpha1/servermaintenanceset_types.go (4)
  • ServerMaintenanceSet (36-42)
  • ServerMaintenanceSetList (47-51)
  • ServerMaintenanceSetSpec (11-16)
  • ServerMaintenanceSetStatus (19-30)
api/v1alpha1/servermaintenance_types.go (3)
  • ServerMaintenanceTemplate (32-43)
  • ServerMaintenanceSpec (46-52)
  • ServerBootConfigurationTemplate (21-29)
internal/controller/servermaintenanceset_controller.go (3)
api/v1alpha1/servermaintenanceset_types.go (3)
  • ServerMaintenanceSet (36-42)
  • ServerMaintenanceSetStatus (19-30)
  • ServerMaintenanceSetList (47-51)
api/v1alpha1/servermaintenance_types.go (9)
  • ServerMaintenanceList (105-109)
  • ServerMaintenance (94-100)
  • ServerMaintenanceReasonAnnotationKey (15-15)
  • ServerMaintenanceSpec (46-52)
  • ServerMaintenanceTemplate (32-43)
  • ServerMaintenanceStatePending (75-75)
  • ServerMaintenanceStateInMaintenance (77-77)
  • ServerMaintenanceStateFailed (79-79)
  • ServerMaintenanceStateCompleted (81-81)
api/v1alpha1/server_types.go (1)
  • ServerList (439-443)
🪛 Checkov (3.2.334)
config/rbac/servermaintenanceset_admin_role.yaml

[medium] 8-27: Minimize wildcard use in Roles and ClusterRoles

(CKV_K8S_49)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run linter
  • GitHub Check: Run tests
  • GitHub Check: Run test e2e
  • GitHub Check: Run test chart
🔇 Additional comments (37)
config/rbac/servermaintenanceset_viewer_role.yaml (1)

1-29: LGTM!

The viewer role is well-structured with appropriate read-only permissions and clear documentation. The role follows standard Kubernetes RBAC patterns.

config/samples/metal_v1alpha1_servermaintenance.yaml (1)

10-16: Sample looks good; minor trailing whitespace.

The sample correctly demonstrates the updated ServerMaintenance spec structure with policy, serverPower, and serverRef fields. There appears to be trailing blank lines at the end of the file (lines 15-16) which could be cleaned up.

PROJECT (1)

77-85: LGTM!

The new ServerMaintenanceSet entry follows the established pattern and correctly configures the CRD as namespaced with controller scaffolding.

config/crd/bases/metal.ironcore.dev_servermaintenancesets.yaml (2)

82-99: Verify serverRef semantics in the template.

The serverBootConfigurationTemplate.spec.serverRef is marked as required (line 97-98). In a ServerMaintenanceSet context where multiple servers are selected via serverSelector, having a static serverRef in the template seems unusual—typically the controller would populate this dynamically per-server.

Please verify this is the intended behavior, or if serverRef should be optional in the template and populated by the controller during ServerMaintenance creation.


1-194: CRD structure looks good overall.

The CRD is well-defined with:

  • Proper metadata and controller-gen annotations
  • Comprehensive spec schema with serverMaintenanceTemplate and serverSelector
  • Status counters for tracking maintenance lifecycle states
  • Appropriate use of x-kubernetes-map-type: atomic for label selectors
config/crd/kustomization.yaml (1)

12-12: LGTM!

The new CRD base is correctly added to the kustomization resources in the appropriate position.

config/samples/kustomization.yaml (1)

10-10: LGTM!

The new sample resource is correctly added to the kustomization.

docs/architecture.md (1)

86-86: LGTM!

The ServerMaintenanceSet CRD entry is appropriately documented and linked to its concept page. This addresses the reviewer's request to add documentation references to the architecture chapter.

internal/controller/test_helper.go (1)

46-48: LGTM!

The ServerMaintenanceList cleanup validation follows the established pattern and is correctly positioned to ensure all maintenance resources are cleaned up after tests.

internal/controller/index_fields.go (1)

59-70: LGTM!

The new indexer for ServerMaintenance follows the established pattern and correctly handles nil checks. Reusing the serverRefField constant is appropriate since both BIOSSettings and ServerMaintenance index by the same field path.

internal/controller/suite_test.go (1)

215-218: LGTM!

The ServerMaintenanceSetReconciler setup follows the established pattern of other reconcilers in the test suite.

config/samples/metal_v1alpha1_servermaintenanceset.yaml (1)

1-16: LGTM!

The sample is now complete with a meaningful example demonstrating serverSelector with label matching and serverMaintenanceTemplate configuration. This addresses the previous reviewer's request to fill in the example.

config/rbac/servermaintenanceset_editor_role.yaml (1)

1-33: LGTM!

The editor role follows least-privilege principles with explicit verb permissions and read-only access to the status subresource. The header comments appropriately explain the role's intended use.

config/rbac/role.yaml (1)

57-57: LGTM!

The RBAC permissions for servermaintenancesets are correctly added across all three rule groups (main resource, finalizers, and status), following the established pattern and maintaining alphabetical ordering.

Also applies to: 83-83, 103-103

config/rbac/servermaintenanceset_admin_role.yaml (1)

15-27: Wildcard verb usage is appropriate and consistent with all admin roles in the project.

This ClusterRole follows the standard pattern used across all admin roles in the project (9 total files), and includes clear documentation explaining that it grants full permissions for users authorized to manage roles and bindings. The wildcard '*' verb on the main resource (line 21) is intentional and properly scoped to the admin role context with appropriate status subresource restrictions.

internal/controller/server_controller_test.go (1)

716-721: LGTM!

The test correctly uses the new ServerMaintenanceTemplate structure, placing Policy inside the template while keeping ServerRef at the top level of ServerMaintenanceSpec. This aligns with the API refactoring introduced in this PR.

api/v1alpha1/servermaintenanceset_types.go (2)

32-42: LGTM!

The ServerMaintenanceSet type definition follows Kubernetes CRD conventions correctly with proper kubebuilder markers for the status subresource. The type is well-structured.


53-55: LGTM!

The type registration in init() follows the standard pattern for registering Kubernetes API types with the scheme builder.

internal/controller/servermaintenanceset_controller_test.go (2)

96-117: LGTM!

The test correctly initializes a ServerMaintenanceSet with the proper spec structure including ServerSelector and ServerMaintenanceTemplate. The annotation for maintenance reason is also correctly propagated.


330-378: Good coverage for server label change scenarios.

This test verifies that the controller correctly reacts when server labels change and no longer match the selector, properly updating the maintenance count. This addresses the past review comment requesting tests for label change scenarios.

internal/controller/servermaintenance_controller.go (4)

353-362: LGTM!

The getServerRef helper cleanly encapsulates the server lookup logic, returning (nil, nil) for not found cases. This allows callers to distinguish between "not found" and actual errors.


294-297: Good defensive check for empty ServerPower.

This addition prevents unnecessary patches when no power state is specified in the maintenance template, improving efficiency and avoiding unintended state changes.


488-493: LGTM!

The optimized list query using client.MatchingFields with the serverRefField index improves performance by filtering at the API server level rather than client-side.


523-526: Good nil check for ServerRef on claims.

This guard prevents nil pointer dereferences when processing claims without a ServerRef, which could occur during claim creation or in edge cases.

docs/api-reference/api.md (3)

1336-1390: LGTM!

The API documentation for ServerMaintenanceSet, ServerMaintenanceSetSpec, and ServerMaintenanceSetStatus is complete and follows the established documentation format. All fields are properly documented with descriptions.


1446-1463: LGTM!

The ServerMaintenanceTemplate documentation is complete with proper field descriptions and "Appears in" references to ServerMaintenanceSetSpec and ServerMaintenanceSpec.


1427-1427: LGTM!

The new Completed state is properly documented in the ServerMaintenanceState enum table.

api/v1alpha1/servermaintenance_types.go (3)

31-43: LGTM!

The ServerMaintenanceTemplate type is well-designed, encapsulating the reusable maintenance configuration fields (Policy, ServerPower, ServerBootConfigurationTemplate). This enables clean sharing between ServerMaintenance and ServerMaintenanceSet resources.


45-52: LGTM!

The refactored ServerMaintenanceSpec correctly uses Go's inline struct embedding for ServerMaintenanceTemplate while keeping ServerRef as a top-level required field. This design cleanly separates instance-specific data (ServerRef) from template data that can be shared via ServerMaintenanceSet.


80-81: LGTM!

The new ServerMaintenanceStateCompleted state enables the maintenance lifecycle to be completed without deleting the resource, providing more flexibility in maintenance workflows.

internal/controller/servermaintenanceset_controller.go (3)

119-172: LGTM!

The createMaintenances function correctly:

  • Uses a map for O(1) lookup of existing maintenances (addresses past review comment)
  • Uses GenerateName with truncation to avoid DNS name length issues
  • Propagates annotations from the set to individual maintenances
  • Sets controller references for proper GC

238-263: LGTM!

The calculateStatus function correctly aggregates maintenance states into the set's status counters. All four states (Pending, InMaintenance, Failed, Completed) are properly counted.


69-77: LGTM!

The delete handler correctly removes only the finalizer, allowing Kubernetes garbage collection to clean up owned ServerMaintenance resources via the controller references. This aligns with the discussion in past review comments.

internal/controller/servermaintenance_controller_test.go (3)

76-89: LGTM!

The test correctly uses the new ServerMaintenanceSpec structure with ServerRef at the top level and ServerMaintenanceTemplate containing Policy, ServerPower, and ServerBootConfigurationTemplate. This aligns with the API refactoring.


358-372: Good test coverage for the new Completed state.

This test verifies that setting a maintenance to Completed state properly triggers cleanup of the server's maintenance references (ServerMaintenanceRef and MaintenanceBootConfigurationRef), validating the new handleCompletedState logic in the controller.


110-112: LGTM on cleanup additions.

The explicit cleanup of server and bmcSecret at the end of each test ensures a clean state for subsequent tests, preventing resource leaks in the test environment.

api/v1alpha1/zz_generated.deepcopy.go (1)

1668-1813: Auto-generated DeepCopy methods look correct.

The generated DeepCopy implementations for the new ServerMaintenanceSet types and ServerMaintenanceTemplate follow the correct patterns and are consistent with:

  • Existing generated code in this file (e.g., BIOSSettingsSet pattern)
  • Type definitions from the relevant code snippets
  • The refactored structure where ServerMaintenanceSpec now embeds ServerMaintenanceTemplate (line 1762)

All pointer fields are properly nil-checked, nested types correctly delegate to their own DeepCopyInto methods, and the ServerMaintenanceSetStatus correctly uses simple assignment for its primitive int32 fields.


- `ServerMaintenanceSet` is namespaced and may represent various maintenance operations.
- Selection: Supports selecting target servers via label selectors.
- `ServerMaintenanceSet` shows the number of server in maintenance, pending or failed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar fix needed.

"server in maintenance" should be plural to match the context.

🔎 Proposed fix
-- `ServerMaintenanceSet` shows the number of server in maintenance, pending or failed.
+- `ServerMaintenanceSet` shows the number of servers in maintenance, pending, or failed.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `ServerMaintenanceSet` shows the number of server in maintenance, pending or failed.
- `ServerMaintenanceSet` shows the number of servers in maintenance, pending, or failed.
🤖 Prompt for AI Agents
In docs/concepts/servermaintenanceset.md around line 10, the sentence uses
"server in maintenance" singular; change it to the plural "servers in
maintenance" so it reads: "`ServerMaintenanceSet` shows the number of servers in
maintenance, pending or failed." Ensure pluralization matches the rest of the
context.

Comment on lines +32 to +40
template:
policy: OwnerApproval
serverPower: On # or Off
serverBootConfigurationTemplate:
name: bios-update-config
spec:
image: "bios-update-image"
ignitionSecretRef:
name: bios-update-ignition
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the API type definition for ServerMaintenanceSet
rg -n "serverMaintenanceTemplate|template:" api/v1alpha1/servermaintenanceset_types.go -A 2 -B 2

Repository: ironcore-dev/metal-operator

Length of output: 301


🏁 Script executed:

#!/bin/bash
# Also check the sample YAML file to see what field name is used
cat config/samples/metal_v1alpha1_servermaintenanceset.yaml 2>/dev/null || echo "Sample file not found, searching for it..."
fd metal_v1alpha1_servermaintenanceset.yaml

Repository: ironcore-dev/metal-operator

Length of output: 583


🏁 Script executed:

#!/bin/bash
# Search for ServerMaintenanceSet struct definition
ast-grep --pattern 'type ServerMaintenanceSet struct {
  $$$
}'

Repository: ironcore-dev/metal-operator

Length of output: 633


🏁 Script executed:

#!/bin/bash
# Verify the documentation file contains the incorrect field name
sed -n '30,42p' docs/concepts/servermaintenanceset.md

Repository: ironcore-dev/metal-operator

Length of output: 376


Change template: to serverMaintenanceTemplate: to match the API schema.

The documentation uses template: but the actual API field is named serverMaintenanceTemplate: (line 15 of api/v1alpha1/servermaintenanceset_types.go). The sample YAML file already demonstrates the correct field name.

🤖 Prompt for AI Agents
In docs/concepts/servermaintenanceset.md around lines 32 to 40, the example uses
the incorrect top-level key "template:" whereas the API expects
"serverMaintenanceTemplate:"; update the sample YAML to replace "template:" with
"serverMaintenanceTemplate:" so it matches
api/v1alpha1/servermaintenanceset_types.go and ensure indentation and nested
fields remain unchanged.

Comment on lines +158 to +169
By("Checking the Annotations and OwnerReferences of the created maintenances")
for _, m := range maintenanceList.Items {
Expect(m.OwnerReferences).To(ContainElement(metav1.OwnerReference{
APIVersion: "metal.ironcore.dev/v1alpha1",
Kind: "ServerMaintenanceSet",
Name: servermaintenanceset.Name,
UID: servermaintenanceset.UID,
Controller: ptr.To(true),
BlockOwnerDeletion: ptr.To(true),
}))
Expect(m.Annotations[metalv1alpha1.ServerMaintenanceReasonAnnotationKey]).To(Equal("foo"))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale list reference may cause flaky tests.

The maintenanceList used for checking OwnerReferences and Annotations was fetched in the previous Eventually block (lines 140-156). By the time this assertion runs, the list reference may be stale. Consider wrapping these assertions in an Eventually block or re-fetching the list.

🔎 Suggested fix
 		By("Checking the Annotations and OwnerReferences of the created maintenances")
+		Eventually(func(g Gomega) {
+			err := k8sClient.List(ctx, maintenanceList, &client.ListOptions{Namespace: ns.Name})
+			g.Expect(err).NotTo(HaveOccurred())
 		for _, m := range maintenanceList.Items {
-			Expect(m.OwnerReferences).To(ContainElement(metav1.OwnerReference{
+			g.Expect(m.OwnerReferences).To(ContainElement(metav1.OwnerReference{
 				APIVersion:         "metal.ironcore.dev/v1alpha1",
 				Kind:               "ServerMaintenanceSet",
 				Name:               servermaintenanceset.Name,
 				UID:                servermaintenanceset.UID,
 				Controller:         ptr.To(true),
 				BlockOwnerDeletion: ptr.To(true),
 			}))
-			Expect(m.Annotations[metalv1alpha1.ServerMaintenanceReasonAnnotationKey]).To(Equal("foo"))
+			g.Expect(m.Annotations[metalv1alpha1.ServerMaintenanceReasonAnnotationKey]).To(Equal("foo"))
 		}
+		}).Should(Succeed())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
By("Checking the Annotations and OwnerReferences of the created maintenances")
for _, m := range maintenanceList.Items {
Expect(m.OwnerReferences).To(ContainElement(metav1.OwnerReference{
APIVersion: "metal.ironcore.dev/v1alpha1",
Kind: "ServerMaintenanceSet",
Name: servermaintenanceset.Name,
UID: servermaintenanceset.UID,
Controller: ptr.To(true),
BlockOwnerDeletion: ptr.To(true),
}))
Expect(m.Annotations[metalv1alpha1.ServerMaintenanceReasonAnnotationKey]).To(Equal("foo"))
}
By("Checking the Annotations and OwnerReferences of the created maintenances")
Eventually(func(g Gomega) {
err := k8sClient.List(ctx, maintenanceList, &client.ListOptions{Namespace: ns.Name})
g.Expect(err).NotTo(HaveOccurred())
for _, m := range maintenanceList.Items {
g.Expect(m.OwnerReferences).To(ContainElement(metav1.OwnerReference{
APIVersion: "metal.ironcore.dev/v1alpha1",
Kind: "ServerMaintenanceSet",
Name: servermaintenanceset.Name,
UID: servermaintenanceset.UID,
Controller: ptr.To(true),
BlockOwnerDeletion: ptr.To(true),
}))
g.Expect(m.Annotations[metalv1alpha1.ServerMaintenanceReasonAnnotationKey]).To(Equal("foo"))
}
}).Should(Succeed())
🤖 Prompt for AI Agents
In internal/controller/servermaintenanceset_controller_test.go around lines 158
to 169, the test iterates over maintenanceList fetched earlier and asserts
OwnerReferences and Annotations which can be stale and cause flakes; re-fetch
the maintenance list immediately before making these assertions or wrap the loop
in an Eventually that re-queries the API (using the same list lookup used
earlier) so the assertions run against the latest state, and keep the Expect
calls inside that retry block until they pass.

Comment on lines +270 to +279
By("Setting the maintenances to completed")
Eventually(UpdateStatus(&maintenanceList.Items[0], func() {
maintenanceList.Items[0].Status.State = metalv1alpha1.ServerMaintenanceStateCompleted
})).Should(Succeed())
Eventually(UpdateStatus(&maintenanceList.Items[1], func() {
maintenanceList.Items[1].Status.State = metalv1alpha1.ServerMaintenanceStateCompleted
})).Should(Succeed())
Eventually(UpdateStatus(&maintenanceList.Items[2], func() {
maintenanceList.Items[2].Status.State = metalv1alpha1.ServerMaintenanceStateCompleted
})).Should(Succeed())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential race condition when updating maintenance states.

The code updates maintenanceList.Items[0], [1], and [2] status to Completed, but these indices reference items from a list fetched earlier. If the list order changes or items are recreated (e.g., after deletion at line 180), the indices may not correspond to the expected maintenances. Consider re-fetching the list before these updates.

🤖 Prompt for AI Agents
In internal/controller/servermaintenanceset_controller_test.go around lines 270
to 279, the test updates maintenanceList.Items[0..2] statuses but reuses an
earlier-fetched list which can be reordered or changed (race) after prior
delete/recreate operations; re-fetch the current maintenance list (or retrieve
each maintenance by a stable identifier like name/UID) immediately before
performing each UpdateStatus call so you update the intended objects, then
proceed to set Status.State to Completed and assert success.

Comment on lines +188 to +199
for _, maintenance := range maintenancelist.Items {
log.V(1).Info("Checking for orphaned maintenance", "maintenance", maintenance.Name)
// Check if the maintenance is part of the server maintenance set
if _, exists := activeServers[maintenance.Spec.ServerRef.Name]; !exists {
log.V(1).Info("Maintenance is orphaned, deleting", "maintenance", maintenance.Name)
if err := r.Delete(ctx, &maintenance); err != nil {
errs = errors.Join(errs, fmt.Errorf("failed to delete orphaned maintenance %s: %w", maintenance.Name, err))
} else {
log.V(1).Info("Deleted orphaned maintenance", "maintenance", maintenance.Name)
}
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereference in orphan deletion.

Line 191 accesses maintenance.Spec.ServerRef.Name without checking if ServerRef is nil. While createMaintenances always sets ServerRef, orphaned maintenances from external sources or previous versions could have nil ServerRef.

🔎 Suggested fix
 	for _, maintenance := range maintenancelist.Items {
 		log.V(1).Info("Checking for orphaned maintenance", "maintenance", maintenance.Name)
+		if maintenance.Spec.ServerRef == nil {
+			log.V(1).Info("Maintenance has no ServerRef, deleting", "maintenance", maintenance.Name)
+			if err := r.Delete(ctx, &maintenance); err != nil {
+				errs = errors.Join(errs, fmt.Errorf("failed to delete maintenance without ServerRef %s: %w", maintenance.Name, err))
+			}
+			continue
+		}
 		// Check if the maintenance is part of the server maintenance set
 		if _, exists := activeServers[maintenance.Spec.ServerRef.Name]; !exists {
🤖 Prompt for AI Agents
In internal/controller/servermaintenanceset_controller.go around lines 188 to
199, the code dereferences maintenance.Spec.ServerRef without checking for nil
which can panic for legacy/external objects; change the logic to first test if
maintenance.Spec.ServerRef == nil and treat that as an orphan (log a clear
message that ServerRef is nil), then proceed to delete the maintenance (joining
errors as currently done) or skip deletion per policy — in any case avoid
dereferencing when ServerRef is nil and only access ServerRef.Name after the nil
check, keeping existing logging/error aggregation behavior.

Comment on lines +66 to +69
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))

By("bootstrapping test environment")
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(true)))
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate logger setup.

The logger is configured twice with identical settings. The second call on line 69 is redundant and can be removed.

🔎 Proposed fix
 var _ = BeforeSuite(func() {
 	logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
 
 	By("bootstrapping test environment")
-	logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
 	testEnv = &envtest.Environment{
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
By("bootstrapping test environment")
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(true)))
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
logf.SetLogger(zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)))
By("bootstrapping test environment")
🤖 Prompt for AI Agents
In internal/controller/suite_test.go around lines 66 to 69, the logger is
configured twice with identical settings; remove the redundant second call (the
duplicate zap.New(zap.WriteTo(GinkgoWriter), zap.UseDevMode(false)) invocation
on line 69) so the logger is only set once, leaving a single logf.SetLogger(...)
call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change area/metal-automation documentation Improvements or additions to documentation size/XXL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add ServerMaintenanceReplicaSet type

8 participants