Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@@ -116,7 +116,6 @@ impl TermSignal {
/// Returns true if the user should abort any graceful attempt to shutdown and shutdown immediately.
pub async fn recv(&mut self) -> bool {
let change = self.0.changed().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line causes CI to fail because it declares an unused variable 'change'. Since the tracing statement that used this variable was removed, the variable is now unused, triggering a compiler warning that's treated as an error. Change 'let change = self.0.changed().await;' to 'let _change = self.0.changed().await;' or 'let _ = self.0.changed().await;' to fix the issue.

Spotted by Graphite Agent (based on CI logs)

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@claude
Copy link

claude bot commented Jan 29, 2026

Pull Request Review: fix: remove graceful requirement for metrics aggregator

Summary

This PR changes the metrics aggregator service to not require graceful shutdown and removes debug code from the termination signal handler. The changes primarily affect service lifecycle management during shutdown.

Code Quality & Best Practices

✅ Positive Changes

  1. Cleaner termination signal handler (engine/packages/runtime/src/term_signal.rs:117-119)

    • Removed debug logging and commented-out code
    • The cleanup is appropriate and improves readability
  2. Improved service task tracking (engine/packages/service-manager/src/lib.rs:132-136)

    • Introduction of ServiceTask struct is good for better organization
    • Replaces tuple-based approach (bool, JoinHandle) with a more descriptive struct
    • Adding the name field enables better logging (as seen at line 386)
  3. Better debug visibility (engine/packages/service-manager/src/lib.rs:386)

    • Added logging when aborting services: tracing::debug!(name=%task.name, "aborting service")
    • This improves observability during shutdown

⚠️ Concerns & Questions

  1. Removed function call without context (engine/packages/pegboard-gateway/src/lib.rs:291)

    • The line init_ns_metrics_exporter(&ctx, self.runner_id), was removed from the tokio::try_join! call
    • Question: Was this function deleted entirely? Is this related to the metrics aggregator change?
    • Risk: If this was performing important initialization, removing it could cause missing metrics or runtime issues
    • Recommendation: Verify that namespace metrics are still being properly exported elsewhere
  2. Graceful shutdown change reasoning (engine/packages/engine/src/run_config.rs:28)

    • Changed from true to false for requires_graceful_shutdown
    • Analysis: The metrics aggregator runs an infinite loop with 15-second intervals, scanning database state and updating Prometheus metrics
    • Concern: During shutdown, if aborted immediately, the service may leave metrics in an inconsistent state mid-scan
    • Question: Are there any in-flight database transactions or metric updates that should complete gracefully?
    • Looking at the implementation (packages/metrics-aggregator/src/lib.rs:26-29), the loop uses interval.tick().await and calls tick() which performs database scans with timeout protection (EARLY_TXN_TIMEOUT = 2.5s)
  3. Potential race condition during shutdown

    • The metrics aggregator performs multi-step operations (tokio::try_join! of multiple aggregation tasks)
    • If aborted mid-aggregation, some metrics may be reset but not repopulated, leading to temporary incorrect values
    • Mitigation: Since Prometheus scrapes periodically, this might be acceptable if the service restarts quickly
    • Question: What's the expected behavior for metrics during a rolling restart?

Performance Considerations

No significant performance impact:

  • The refactoring from tuples to ServiceTask struct has negligible overhead
  • The change to non-graceful shutdown for metrics aggregator will make shutdowns slightly faster (no waiting for the service to finish)
  • The removed init_ns_metrics_exporter call eliminates one async operation from the startup path (assuming it's intentional)

Security Concerns

No security issues identified:

  • Changes are limited to service lifecycle management
  • No new attack surfaces introduced
  • No credential or sensitive data handling modified

Test Coverage

⚠️ Missing information:

  • No test files were modified in this PR
  • Recommendation: Verify that:
    1. Metrics aggregator properly handles abort signals
    2. Services that don't require graceful shutdown are correctly aborted
    3. The removal of init_ns_metrics_exporter doesn't break metric collection tests

Suggested test scenarios:

# Test that metrics aggregator can be interrupted
# Test that shutdown completes faster with the new behavior
# Test that metrics are still collected properly after the init_ns_metrics_exporter removal

Architecture & Design

Service Manager Improvements:

  • The ServiceTask struct is a good architectural improvement
  • Better separation of concerns with explicit field names

Potential Issue:

  • The PR title mentions "remove graceful requirement" but doesn't explain why this is necessary
  • Understanding the motivation would help assess if this is the right approach

Questions for PR Author

  1. Why was init_ns_metrics_exporter removed from pegboard-gateway? Was this function deleted elsewhere?
  2. What problem does removing the graceful shutdown requirement solve?
  3. Have you tested that metrics are still properly exported during/after shutdown?
  4. Are there any concerns about partially-updated metrics if the aggregator is aborted mid-scan?
  5. Should the EARLY_TXN_TIMEOUT mechanism in the metrics aggregator be adjusted if graceful shutdown is no longer guaranteed?

Recommendations

  1. Add comments explaining the choice: In run_config.rs, add a comment explaining why metrics_aggregator doesn't need graceful shutdown
  2. Verify metrics behavior: Test that metrics remain consistent across restarts
  3. Document the init_ns_metrics_exporter removal: If this was intentional, document why it's no longer needed
  4. Consider adding a test: Add a test that verifies service shutdown behavior with mixed graceful/non-graceful services

Verdict

The code changes are clean and improve observability, but I need clarification on the init_ns_metrics_exporter removal before approving. The graceful shutdown change appears reasonable for a metrics aggregator that polls on an interval, but the rationale should be documented.

Suggested next steps:

  1. Clarify the removal of init_ns_metrics_exporter
  2. Add a comment in the code explaining the graceful shutdown decision
  3. Verify metrics collection still works as expected

Review generated by Claude Code

@MasterPtato MasterPtato force-pushed the 01-28-fix_remove_graceful_requirement_for_metrics_aggregator branch from 24a2036 to c6e49a4 Compare January 29, 2026 01:48
@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 29, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4060

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4060

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4060

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4060

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4060

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4060

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4060

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4060

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4060

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4060

commit: c6e49a4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants