Skip to content

Conversation

@sujay-d07
Copy link
Collaborator

@sujay-d07 sujay-d07 commented Jan 21, 2026

Related Issue

Closes #651
Part 2/3 of Issue #427

Summary

Introduces comprehensive system health monitoring and persistent alert management capabilities to the cortex daemon. The implementation provides real-time monitoring of system resources (CPU, memory, disk, services) with configurable thresholds and SQLite-based alert persistence.

AI Disclosure

  • No AI used
  • AI/IDE/Agents used (please describe below)

Used Cursor to debug and make optimizations in the daemon.

Features

Alert Management

  • Persistent Storage: SQLite database for alert persistence across daemon restarts
  • Alert Lifecycle: Support for ACTIVE, ACKNOWLEDGED, and DISMISSED states
  • Severity Levels: INFO, WARNING, ERROR, CRITICAL
  • Categories: CPU, MEMORY, DISK, APT, CVE, SERVICE, SYSTEM
  • Thread-Safe Operations: Mutex-protected SQLite prepared statements
  • Performance Optimizations:
    • In-memory atomic counters for O(1) alert counts
    • Prepared statement caching
    • WAL mode for better concurrency
  • Query Filtering: Filter alerts by severity, category, status, and source
  • Bulk Operations: Acknowledge all alerts at once

System Monitoring

  • Resource Monitoring:
    • CPU usage percentage (calculated from /proc/stat)
    • Memory usage (total, used, available from /proc/meminfo)
    • Disk usage (root filesystem via statvfs)
    • System uptime (from /proc/uptime)
    • Failed systemd services count (via D-Bus)
  • Threshold-Based Alerts: Automatic alert generation when thresholds are exceeded
  • Configurable Thresholds: Warning and critical levels for CPU, memory, and disk
  • Performance Optimizations:
    • /proc file caching (1-second TTL) to reduce I/O overhead
    • Persistent systemd D-Bus connection reuse
    • Alert deduplication using hash set (O(1) lookup)
  • Service Integration: Implements Service interface for daemon lifecycle management

IPC Integration

  • New Endpoints:
    • health - Get current system health metrics and thresholds
    • alerts / alerts.get - Query alerts with optional filtering
    • alerts.ack - Acknowledge alerts (single UUID or all)
    • alerts.dismiss - Dismiss all or a specific alert
  • Response Format: JSON with structured data and error handling

Dependencies

  • SQLite3: Alert persistence
  • UUID library: Alert UUID generation

Testing

  • Unit Tests: Comprehensive test suite for AlertManager covering:
    • Alert creation and retrieval
    • Filtering by severity, category, status
    • Acknowledge and dismiss operations
    • Alert counts
    • Database persistence

Checklist

  • PR title follows format: type(scope): description or [scope] description
  • Tests pass (pytest tests/)
  • MVP label added if closing MVP issue
  • Update "Cortex -h" (if needed)

Demo of new features

Screencast.from.2026-01-21.16-48-31.webm

Summary by CodeRabbit

  • New Features

    • System monitoring: real-time health metrics, configurable CPU/memory/disk thresholds, uptime and failed-service checks.
    • Persistent alerting and alert management: list/filter by severity/category, acknowledge-all, dismiss individual or all alerts.
    • CLI: new daemon health and daemon alerts commands with filtering and acknowledge/dismiss options.
  • Documentation

    • README and Quick Start updated with health/alerts examples and monitoring configuration.
  • Chores

    • Installer may create the daemon config file if missing.
  • Tests

    • New unit and integration tests for monitoring and alert manager.

✏️ Tip: You can customize this high-level summary in your review settings.

…ment

- Added a background daemon to monitor system metrics including CPU, memory, and disk usage, with configurable thresholds for alert generation.
- Introduced new CLI commands for checking system health and managing alerts, including acknowledgment and dismissal of alerts.
- Implemented SQLite-based persistence for alerts, allowing for filtering by severity and category.
- Enhanced the README and documentation to reflect new features and usage instructions for system monitoring and alert management.
- Modified the UUID display to show the full UUID instead of truncating it, enhancing clarity in alert information.
Copilot AI review requested due to automatic review settings January 21, 2026 11:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 21, 2026

Warning

Rate limit exceeded

@Anshgrover23 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 38 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a SystemMonitor service and SQLite-backed AlertManager, new IPC handlers and CLI/DaemonClient commands for health and alerts, monitoring configuration and thresholds, build/test wiring, and concurrency improvements across IPC server and daemon core.

Changes

Cohort / File(s) Summary
CLI & Client
cortex/cli.py, cortex/daemon_client.py
Added daemon health and daemon alerts CLI subcommands, IPC call helper, and DaemonClient methods (health, alerts_get, alerts_acknowledge_all, alerts_dismiss, alerts_dismiss_all).
Config & Installer
daemon/config/cortexd.yaml.example, daemon/include/cortexd/config.h, daemon/src/config/config.cpp, daemon/scripts/setup_daemon.py
Introduced monitoring config (cpu/memory/disk thresholds, check interval); config load/save/validate extended; installer ensures config via ensure_config_file().
Alert Manager API & Impl
daemon/include/cortexd/alerts/alert_manager.h, daemon/src/alerts/alert_manager.cpp
New Alert, AlertFilter, and AlertManager with SQLite persistence, JSON serialization, create/get/query, acknowledge/dismiss (single & bulk), counters, UUIDs, prepared-statement lifecycle.
System Monitor API & Impl
daemon/include/cortexd/monitor/system_monitor.h, daemon/src/monitor/system_monitor.cpp
New SystemMonitor, SystemHealth, MonitoringThresholds; background monitoring thread, procfs caching, threshold checks, alert creation with deduplication, and systemd integration for failed services.
IPC Protocol & Handlers
daemon/include/cortexd/ipc/protocol.h, daemon/include/cortexd/ipc/handlers.h, daemon/src/ipc/handlers.cpp
Removed several legacy LLM/STATUS constants; Handlers::register_all now accepts SystemMonitor* and std::shared_ptr<AlertManager>; added handlers for health and alerts (get/acknowledge/dismiss).
IPC Server & RateLimiter Concurrency
daemon/include/cortexd/ipc/server.h, daemon/src/ipc/server.cpp
Replaced mutex-based RateLimiter with atomic lock-free window; switched handler locking to std::shared_mutex (shared/exclusive) for dispatch/register.
Daemon Core Concurrency
daemon/include/cortexd/core/daemon.h, daemon/src/core/daemon.cpp
Introduced std::shared_mutex for services_; applied shared/exclusive locking and local-copy patterns to avoid holding locks during blocking start/stop.
Build, Main & Tests
daemon/CMakeLists.txt, daemon/src/main.cpp, daemon/tests/CMakeLists.txt
Swapped OpenSSL for SQLite3, added monitor/alerts sources and tests, initialized and wired AlertManager and SystemMonitor in main, and added unit/integration tests.
Tests & Integration
daemon/tests/unit/test_alert_manager.cpp, daemon/tests/integration/test_handlers.cpp, other daemon/tests/*
New unit tests for AlertManager; integration tests updated to exercise monitoring and alert flows; test harness waits for health readiness.
Docs
README.md, daemon/README.md
Documentation updated with System Monitoring and Alert Management sections, CLI/IPC examples, and config samples.

Sequence Diagram(s)

sequenceDiagram
    participant User as Client/CLI
    participant DaemonClient as DaemonClient
    participant IPCServer as IPC Server
    participant Handler as IPC Handler
    participant Monitor as SystemMonitor
    participant AlertMgr as AlertManager
    participant DB as SQLiteDB

    rect rgba(100,150,200,0.5)
        Note over User,Monitor: Health Query Flow
        User->>DaemonClient: run `daemon health`
        DaemonClient->>IPCServer: IPC request "health"
        IPCServer->>Handler: dispatch -> handle_health
        Handler->>Monitor: get_health()
        Monitor->>Monitor: read /proc, systemd, thresholds
        Monitor-->>Handler: SystemHealth + Thresholds
        Handler-->>IPCServer: JSON response
        IPCServer-->>DaemonClient: DaemonResponse
        DaemonClient-->>User: render health
    end

    rect rgba(150,100,200,0.5)
        Note over User,DB: Alerts Query Flow
        User->>DaemonClient: run `daemon alerts --severity WARNING`
        DaemonClient->>IPCServer: IPC request "alerts" + filters
        IPCServer->>Handler: dispatch -> handle_alerts_get
        Handler->>AlertMgr: get_alerts(filter)
        AlertMgr->>DB: SELECT ... WHERE ...
        DB-->>AlertMgr: rows
        AlertMgr-->>Handler: alerts + counts
        Handler-->>IPCServer: JSON response
        IPCServer-->>DaemonClient: DaemonResponse
        DaemonClient-->>User: render alerts
    end

    rect rgba(200,150,100,0.5)
        Note over Monitor,DB: Background Monitoring & Alert Creation
        Monitor->>Monitor: periodic health check
        Monitor->>Monitor: compare vs thresholds
        alt threshold exceeded
            Monitor->>Monitor: dedupe check
            alt not duplicate
                Monitor->>AlertMgr: create_alert(...)
                AlertMgr->>DB: INSERT
                DB-->>AlertMgr: success (uuid)
                AlertMgr-->>Monitor: uuid
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • Suyashd999
  • Anshgrover23

Poem

🐇 I nibble logs beneath the moon,

Threads hum soft and counters tune,
Thresholds whisper, alerts take flight,
SQLite cradles each tiny byte,
A happy rabbit cheers the night.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(daemon): System Monitoring and Persistent Alert Management' accurately describes the primary change in the PR—introducing system monitoring and persistent alert management to the daemon.
Description check ✅ Passed The PR description covers most required template sections: Related Issue is specified (#651, Part 2/3 of #427), a detailed Summary explains the implementation, AI Disclosure is checked, and the Checklist includes title format compliance and CLI update confirmation. Only the test pass confirmation is unchecked.
Linked Issues check ✅ Passed The PR implementation aligns with #651 requirements: system monitoring (CPU, memory, disk, uptime, services), persistent SQLite alerts with ACTIVE/ACKNOWLEDGED/DISMISSED states, severities/categories, configurable thresholds, and IPC handlers (health, alerts, alerts.ack, alerts.dismiss).
Out of Scope Changes check ✅ Passed All changes are in-scope: new alert manager, system monitor, IPC integration, CLI commands, configuration, and tests directly support PR #651 objectives. Lock-free optimizations to RateLimiter and shared_mutex upgrades are infrastructure improvements for the new monitoring/alert features.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sujay-d07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the daemon's capabilities by integrating comprehensive system health monitoring and a persistent alert management system. The daemon can now proactively track critical system resources and services, automatically generate alerts based on configurable thresholds, and store these alerts reliably in an SQLite database. This enhancement provides crucial observability into the daemon's operating environment and introduces new IPC and CLI interfaces for users to inspect system health and manage generated alerts effectively.

Highlights

  • System Monitoring Implementation: Introduced real-time monitoring of CPU, memory, disk usage, system uptime, and failed systemd services. This includes configurable warning and critical thresholds for resource utilization, with automatic alert generation when these thresholds are exceeded.
  • Persistent Alert Management: Developed a robust alert management system using an SQLite database for persistence across daemon restarts. Alerts have defined lifecycle states (ACTIVE, ACKNOWLEDGED, DISMISSED), severity levels (INFO, WARNING, ERROR, CRITICAL), and categories (CPU, MEMORY, DISK, SERVICE, SYSTEM, etc.).
  • IPC Integration and CLI Commands: New IPC endpoints (health, alerts, alerts.get, alerts.ack, alerts.dismiss) have been added to allow external clients to query system health metrics and manage alerts. Corresponding CLI commands (cortex daemon health, cortex daemon alerts) provide user-friendly access to these new functionalities, including filtering and bulk operations.
  • Performance and Thread Safety Optimizations: Key optimizations include in-memory atomic counters for O(1) alert counts, prepared statement caching for SQLite operations, WAL mode for improved database concurrency, /proc file caching to reduce I/O, persistent systemd D-Bus connection reuse, and O(1) alert deduplication using a hash set. Thread safety for service management and IPC handlers has been enhanced with shared mutexes and lock-free rate limiting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
daemon/include/cortexd/core/daemon.h (1)

76-85: Unsafe raw pointer lifetime from concurrent vector access

get_service() returns a raw pointer to a service after releasing the shared lock. While the current usage pattern (services registered only during initialization, immutable during runtime) mitigates the immediate risk, this is a fragile pattern that could silently break if future changes add dynamic service registration or removal during runtime. Additionally, if the services_ vector reallocates (exceeds capacity), previously returned pointers become dangling.

Recommended fixes:

  • Store std::shared_ptr<Service> in the vector and return std::shared_ptr<T> via std::dynamic_pointer_cast to extend service lifetime beyond lock release, or
  • Document and enforce that services_ is immutable during daemon runtime with assertions and explicit lifetime guarantees in code comments

Writer-side locking is correct: register_service() and reset() use std::unique_lock as expected.

🤖 Fix all issues with AI agents
In `@cortex/cli.py`:
- Around line 2581-2613: The CPU/Memory/Disk color thresholds are hardcoded;
read daemon-provided thresholds from result.get("thresholds", {}) and use
per-metric warning/critical values (e.g. thresholds.get("cpu",
{}).get("warning", 80) and .get("critical", 95)) with those defaults if missing,
then compute color for cpu (variable cpu), memory (mem) and disk (disk) using
those warning/critical values instead of the fixed 80/95 before adding rows to
health_table; update all three blocks that currently use the hardcoded
comparisons.

In `@daemon/include/cortexd/monitor/system_monitor.h`:
- Around line 101-106: Add a mutex (preferably std::shared_mutex) protecting the
member thresholds_ and lock it for write in set_thresholds(const
MonitoringThresholds& thresholds) and for read in get_thresholds() and in
SystemMonitor::check_thresholds(); specifically, add the mutex as a member next
to thresholds_, change get_thresholds() to acquire a shared/read lock and return
the value, change set_thresholds() to acquire an exclusive/write lock before
assigning thresholds_, and update the .cpp implementation of
SystemMonitor::check_thresholds() to acquire a shared/read lock while reading
thresholds_ so no data race occurs.

In `@daemon/scripts/setup_daemon.py`:
- Around line 457-458: The call to ensure_config_file() currently ignores its
boolean return, hiding failures; update the code where ensure_config_file() is
invoked so you capture its return value and handle failures: call result =
ensure_config_file(), and if result is False, emit a clear processLogger.warn or
processLogger.error (include context like "config template missing or failed to
create") and abort installation (e.g., process.exit(1) or raise an exception) so
missing templates are audited and stop the install; keep ensure_config_file()
reference unchanged and only add the check/logging/abort around that call.
- Around line 394-442: ensure_config_file currently calls
config_path.parent.mkdir(...) which raises PermissionError for non-root users
when creating /etc/cortex; change the logic in ensure_config_file to avoid
direct mkdir on privileged paths: detect if parent exists and if not, if running
as root (os.geteuid()==0) perform Path(...).mkdir(...), otherwise use a
subprocess.run call with sudo to create the directory (e.g.,
["sudo","mkdir","-p", str(config_path.parent)]) and then proceed to copy the
template and set permissions via sudo as already done; update error
handling/logging around these subprocess calls to surface stderr when they fail.

In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 122-149: AlertManager::ensure_db_directory currently returns false
from the catch when std::filesystem::create_directories throws (e.g., permission
denied), skipping the later fallback to ~/.cortex; update
AlertManager::ensure_db_directory to detect permission-related exceptions in the
catch block and perform the same fallback logic used when access(db_dir, W_OK)
fails: obtain HOME, set db_dir = HOME/.cortex, create_directories(db_dir),
update db_path_ = (db_dir / "alerts.db").string(), and LOG_WARN accordingly;
keep existing behavior of returning false only if HOME is unavailable or the
fallback create_directories still fails, and include the original exception
message in the LOG_ERROR if you still return false.
- Around line 57-103: The deserialization currently uses std::mktime (which
treats struct tm as local time) in Alert::from_json and the two DB deserializers
(from_db, from_db_filtered) for the fields timestamp, acknowledged_at and
dismissed_at; replace those std::mktime calls with a cross-platform helper
(e.g., utc_timegm or tm_to_time_t_utc) that calls timegm on POSIX and _mkgmtime
on Windows, and use that helper at all nine call sites so parsed
"%Y-%m-%dT%H:%M:%SZ" strings are interpreted as UTC consistently.
- Around line 391-449: AlertManager::create_alert uses temporary
std::stringstream results (ack_ss.str().c_str() and dis_ss.str().c_str()) and
binds them with SQLITE_STATIC, which leads to use-after-free when sqlite3_step
runs; fix by storing the timestamp strings in persistent std::string variables
(e.g., ack_ts, dis_ts) that outlive the bind call and bind those using
SQLITE_TRANSIENT, and ensure similar change for the main timestamp_str; update
references around stmt_insert_, sqlite3_bind_text calls for parameter indices 7,
9, and 10 so SQLite gets copied data instead of pointers to temporaries.

In `@daemon/src/core/daemon.cpp`:
- Around line 210-231: The loop iterates services_ after releasing
services_mutex_, causing potential iterator invalidation; instead, while holding
services_mutex_ (unique_lock) copy the shared_ptrs from services_ into a local
std::vector<std::shared_ptr<ServiceType>> (or the concrete service pointer
type), sort that local vector by the same priority comparator, then unlock
services_mutex_ and iterate over the local copy calling service->start() and
logging service->name(); remove the find_if/remove checks against services_
inside the loop since the local copy is stable. Ensure you reference
services_mutex_, services_, the priority() comparator and calls to
start()/name() when making the change.

In `@daemon/src/ipc/handlers.cpp`:
- Around line 144-187: The IPC alerts handler handle_alerts_get currently
ignores a "source" request parameter; update it to parse req.params["source"]
(when req.params.is_object() and contains "source") and set filter.source =
AlertManager::string_to_source(source_str) (using a std::string source_str =
req.params["source"].get<std::string>()), analogous to how
severity/category/status are handled so AlertManager::get_alerts(filter) will
perform source-based filtering.

In `@daemon/src/main.cpp`:
- Around line 105-106: The current binding uses a const reference to the result
of ConfigManager::instance().get(), which returns by value and causes a dangling
reference; change the declaration to copy the returned value (e.g., use const
auto config = ConfigManager::instance().get()) so that the local variable owns
the returned Config object and does not reference freed memory; update any
subsequent uses that assume a reference accordingly (look for the variable named
config and the methods ConfigManager::instance() and get()).

In `@daemon/src/monitor/system_monitor.cpp`:
- Around line 500-548: The active_alert_keys_ set is never cleared causing
permanent suppression; modify SystemMonitor so keys are removed when the
corresponding alert is resolved/dismissed or when the metric returns below
threshold. Keep the existing insertion in SystemMonitor::create_basic_alert
(guarded by alert_keys_mutex_), but add logic to erase the alert_key when: 1)
alert_manager_->create_alert succeeds and later the alert status transitions
from AlertStatus::ACTIVE to a non-active state (handle in whatever alert
status-change callback/handler you have), or 2) the metric check that originally
triggered this alert detects recovery (remove active_alert_keys_ under
alert_keys_mutex_), and ensure any path that fails to create the alert already
removes the key (you already handle this). Reference active_alert_keys_,
alert_keys_mutex_, SystemMonitor::create_basic_alert,
alert_manager_->create_alert and AlertStatus to locate where to add the
key-removal logic.
- Around line 118-137: The constructor SystemMonitor::SystemMonitor must
validate check_interval_seconds_ to prevent busy-spin when it's <= 0; update the
constructor to either throw std::invalid_argument for non-positive intervals or
clamp them to a safe minimum (e.g., 1) and log the adjustment, ensuring the
member check_interval_seconds_ is always > 0 before the monitor_loop uses it;
reference the constructor and the member variable check_interval_seconds_ (also
consider documenting behavior for tests that instantiate SystemMonitor
directly).

In `@daemon/tests/integration/test_handlers.cpp`:
- Around line 84-113: The test uses a fixed 200ms sleep in
start_server_with_monitoring(), which is flaky; replace it with a polling loop
that waits with a timeout for the system monitor to report health/readiness
instead of sleeping. Specifically, after calling system_monitor_->start() and
before asserting server_->start(), poll a reachability/readiness API on
SystemMonitor (e.g., a new or existing accessor such as
system_monitor_->has_health_data(), system_monitor_->last_check_time() > 0, or
system_monitor_->get_latest_health() returning non-empty) in a short-interval
loop (e.g., 50–200ms) with an overall timeout (e.g., 5s) and fail the test if
the timeout elapses; update start_server_with_monitoring() to use this polling
logic.
🧹 Nitpick comments (4)
daemon/CMakeLists.txt (1)

44-44: Stale comment references removed dependency.

The linker warnings documentation still mentions OpenSSL (line 44: "systemd/OpenSSL"), but OpenSSL has been replaced with SQLite3 in this PR. Consider updating the comment to reflect the current dependencies.

Suggested fix
-# - Symbol versioning warnings from systemd/OpenSSL (e.g., "version node not found")
+# - Symbol versioning warnings from systemd/SQLite3 (e.g., "version node not found")
cortex/daemon_client.py (1)

4-6: Module docstring is now outdated.

The docstring mentions only "PR1 commands: ping, version, config.get, config.reload, shutdown" but this file now includes PR2 methods (health, alerts).

Suggested update
 """
 Cortex Daemon IPC Client
 
 Provides communication with the cortexd daemon via Unix socket IPC.
-Supports the PR1 commands: ping, version, config.get, config.reload, shutdown.
+Supports commands: ping, version, config.get, config.reload, shutdown,
+health, alerts, alerts.acknowledge, alerts.dismiss.
 """
daemon/tests/unit/test_protocol.cpp (1)

307-315: Good test coverage for method constants.

The test validates core method constants. For completeness, consider adding assertions for the remaining alert-related constants defined in protocol.h.

Optional: Add missing constant assertions
 TEST_F(ProtocolTest, MethodConstantsAreDefined) {
     EXPECT_STREQ(cortexd::Methods::PING, "ping");
     EXPECT_STREQ(cortexd::Methods::VERSION, "version");
     EXPECT_STREQ(cortexd::Methods::CONFIG_GET, "config.get");
     EXPECT_STREQ(cortexd::Methods::CONFIG_RELOAD, "config.reload");
     EXPECT_STREQ(cortexd::Methods::SHUTDOWN, "shutdown");
     EXPECT_STREQ(cortexd::Methods::HEALTH, "health");
     EXPECT_STREQ(cortexd::Methods::ALERTS, "alerts");
+    EXPECT_STREQ(cortexd::Methods::ALERTS_GET, "alerts.get");
+    EXPECT_STREQ(cortexd::Methods::ALERTS_ACK, "alerts.acknowledge");
+    EXPECT_STREQ(cortexd::Methods::ALERTS_DISMISS, "alerts.dismiss");
 }
daemon/tests/unit/test_alert_manager.cpp (1)

221-242: Consider testing JSON round-trip with optional timestamp fields.

The JSON conversion test validates the basic round-trip but doesn't test acknowledged_at or dismissed_at optional fields. Consider adding a test that acknowledges/dismisses an alert and verifies these timestamps survive JSON serialization.

💡 Optional enhancement for timestamp round-trip
TEST_F(AlertManagerTest, AlertJsonConversionWithTimestamps) {
    Alert alert;
    alert.uuid = AlertManager::generate_uuid();
    alert.severity = AlertSeverity::WARNING;
    alert.category = AlertCategory::CPU;
    alert.source = "test_source";
    alert.message = "Test alert";
    alert.status = AlertStatus::ACKNOWLEDGED;
    alert.timestamp = std::chrono::system_clock::now();
    alert.acknowledged_at = std::chrono::system_clock::now();
    
    json j = alert.to_json();
    ASSERT_TRUE(j.contains("acknowledged_at"));
    ASSERT_FALSE(j["acknowledged_at"].is_null());
    
    Alert restored = Alert::from_json(j);
    ASSERT_TRUE(restored.acknowledged_at.has_value());
}

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality for system monitoring and alert management. The changes are extensive, adding a SystemMonitor service to track CPU, memory, disk, and systemd services, and an AlertManager service for persisting alerts to an SQLite database. New IPC endpoints and corresponding CLI commands are added to expose this functionality. The implementation includes several performance and concurrency optimizations, such as a lock-free rate limiter and caching for /proc file reads.

My review focuses on thread safety, correctness, and consistency. I've identified a potential race condition in the AlertManager due to the use of a non-thread-safe function, and a complex and potentially unsafe locking pattern in the Daemon's service startup logic. I've also noted a minor inconsistency in the CLI where UI display logic uses hardcoded values instead of data from the daemon.

Overall, this is a well-structured feature addition with good attention to performance. The suggested changes aim to improve robustness and maintainability.

sujay-d07 and others added 2 commits January 21, 2026 17:32
- Added support for configurable CPU, memory, and disk usage thresholds in the Cortex CLI, allowing for dynamic alerting based on user-defined limits.
- Updated health check logic to utilize these thresholds for determining alert colors, improving clarity in system health reporting.
- Refactored the SystemMonitor class to validate check intervals and ensure proper alert management based on system metrics.
- Enhanced error handling in the setup process for configuration files, ensuring robust directory creation and permission handling.
- Improved integration tests to verify health data readiness and system monitoring functionality.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 751-759: The counters (count_info_, count_warning_, count_error_,
count_critical_, count_total_) are being reset after stmt_mutex_ is released
which can race with concurrent inserts; move the reset logic so it executes
while holding stmt_mutex_ (i.e., perform the store(0, ...) calls before
unlocking stmt_mutex_) or alternatively recompute and set the counters while
still under the protection of stmt_mutex_ to ensure no new alerts are missed.
♻️ Duplicate comments (1)
daemon/include/cortexd/monitor/system_monitor.h (1)

101-106: Protect thresholds_ with a lock to prevent data races.

set_thresholds() can run on config reload while the monitor thread reads thresholds_ in check_thresholds(). This is a data race (UB). Please guard both reads/writes (and the read path in the .cpp).

🔒 Suggested fix (header-side)
+#include <shared_mutex>
 ...
-    MonitoringThresholds get_thresholds() const { return thresholds_; }
+    MonitoringThresholds get_thresholds() const {
+        std::shared_lock<std::shared_mutex> lock(thresholds_mutex_);
+        return thresholds_;
+    }
 ...
-    void set_thresholds(const MonitoringThresholds& thresholds) { thresholds_ = thresholds; }
+    void set_thresholds(const MonitoringThresholds& thresholds) {
+        std::unique_lock<std::shared_mutex> lock(thresholds_mutex_);
+        thresholds_ = thresholds;
+    }
 ...
-    MonitoringThresholds thresholds_;
+    MonitoringThresholds thresholds_;
+    mutable std::shared_mutex thresholds_mutex_;

Also take a shared/read lock around thresholds_ reads in SystemMonitor::check_thresholds() in the .cpp.

Also applies to: 113-114

- Added a shared mutex to the SystemMonitor class to ensure thread-safe access to monitoring thresholds.
- Refactored get_thresholds and set_thresholds methods to utilize shared and unique locks, respectively, enhancing concurrency control.
- Updated the AlertManager to hold a lock while updating alert counters during acknowledgment, preventing race conditions.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 396-431: In AlertManager::load_initial_counters(), count_total_ is
never reset and is cumulatively increased with fetch_add across calls; before
iterating the sqlite3_step loop (while holding stmt_mutex_), reset count_total_
to 0 (e.g., count_total_.store(0, std::memory_order_relaxed)) or accumulate into
a local total variable and store it once after the loop, ensuring consistency
with the per-severity stores and using the same memory_order_relaxed semantics;
locate this change near stmt_count_, stmt_mutex_, and where fetch_add is
currently called.
🧹 Nitpick comments (8)
daemon/src/alerts/alert_manager.cpp (3)

84-85: Unchecked enum casts from JSON integers.

Direct static_cast from JSON integers to enums without bounds validation could produce undefined enum values if the JSON is malformed or from a newer version.

♻️ Suggested validation
-    alert.severity = static_cast<AlertSeverity>(j.value("severity", 0));
-    alert.category = static_cast<AlertCategory>(j.value("category", 0));
+    int sev_val = j.value("severity", 0);
+    alert.severity = (sev_val >= 0 && sev_val <= 3) 
+        ? static_cast<AlertSeverity>(sev_val) : AlertSeverity::INFO;
+    int cat_val = j.value("category", 0);
+    alert.category = (cat_val >= 0 && cat_val <= 6) 
+        ? static_cast<AlertCategory>(cat_val) : AlertCategory::SYSTEM;

289-291: Unused prepared statement stmt_select_all_.

stmt_select_all_ is prepared but never used. get_alerts() builds dynamic SQL with filtering instead. Consider removing it to reduce resource usage and maintenance burden.


614-629: Unused param_index initialization when source filter is absent.

param_index is initialized to 1 and only used if filter.source.has_value(). This works, but the variable is unnecessarily declared when not needed.

♻️ Minor cleanup
-    int param_index = 1;
     if (filter.source.has_value()) {
         select_sql += " AND source = ?";
     }
     
     select_sql += " ORDER BY timestamp DESC";
     
     sqlite3_stmt* stmt;
     int rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr);
     
     if (rc != SQLITE_OK) {
         return alerts;
     }
     
     if (filter.source.has_value()) {
-        sqlite3_bind_text(stmt, param_index++, filter.source.value().c_str(), -1, SQLITE_STATIC);
+        sqlite3_bind_text(stmt, 1, filter.source.value().c_str(), -1, SQLITE_STATIC);
     }
daemon/include/cortexd/monitor/system_monitor.h (2)

55-62: Consider adding default threshold values.

MonitoringThresholds members have no default values, which could lead to uninitialized values if the struct is default-constructed. Adding sensible defaults improves safety.

♻️ Example defaults
 struct MonitoringThresholds {
-    double cpu_warning;
-    double cpu_critical;
-    double memory_warning;
-    double memory_critical;
-    double disk_warning;
-    double disk_critical;
+    double cpu_warning = 75.0;
+    double cpu_critical = 90.0;
+    double memory_warning = 80.0;
+    double memory_critical = 95.0;
+    double disk_warning = 80.0;
+    double disk_critical = 90.0;
 };

128-131: Consider documenting systemd bus reconnection behavior.

systemd_bus_ is declared as void* with a comment about avoiding systemd headers, but the reconnection-on-failure behavior (implemented in .cpp) is not documented here. A brief note would help maintainers understand the lifecycle.

daemon/src/monitor/system_monitor.cpp (3)

195-223: Lock held during file I/O may cause unnecessary contention.

The proc_cache_mutex_ is held for the entire read operation, including file I/O. While /proc reads are typically fast, this could block other readers. Consider a double-checked locking pattern or accepting the current approach given the 1s TTL.

This is a minor optimization opportunity; the current implementation is correct and likely sufficient for the monitoring interval.


437-448: Alert key construction is duplicated.

The alert key format is constructed both in check_thresholds() (for cleanup) and in create_basic_alert() (for deduplication). If the format changes, both locations must be updated. Consider extracting a helper function.

♻️ Example helper
static std::string make_alert_key(AlertCategory category, AlertSeverity severity, 
                                  const std::string& source, const std::string& message) {
    return std::to_string(static_cast<int>(category)) + ":" + 
           std::to_string(static_cast<int>(severity)) + ":" + 
           source + ":" + message;
}

468-479: Variable name shadowing: inner lock shadows outer lock.

The inner std::lock_guard named lock shadows the outer std::shared_lock also named lock. While this works (different scopes, different mutexes), using distinct names like threshold_lock and alert_lock would improve clarity.

…unters

- Updated the load_initial_counters method to initialize and store the total alert count correctly.
- Replaced the previous fetch_add operation with a more accurate total accumulation approach, ensuring thread safety and consistency in alert status updates.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 292-296: The current acknowledge_alert and dismiss_alert flows
call get_alert() (reading status) and then run UPDATEs (using update_ack_sql and
update_dismiss_sql) without verifying the status atomically, allowing races that
double-decrement counters; modify update_ack_sql and update_dismiss_sql to
include "AND status = ?" (i.e., WHERE uuid = ? AND status = ?) and bind the
expected status read from get_alert(), then execute the UPDATE and check the
affected-rows result (sqlite3_changes/statement step result) before decrementing
any counters so you only adjust counters when the conditional UPDATE actually
changed a row; leave update_ack_all_sql as-is or ensure similar conditional
semantics when used in bulk acknowledge paths.
♻️ Duplicate comments (1)
daemon/src/alerts/alert_manager.cpp (1)

490-496: Avoid SQLITE_STATIC for cached INSERT bindings with non-static strings.

stmt_insert_ is cached; SQLite requires SQLITE_STATIC data to remain valid until the statement is reset or finalized. new_alert.* strings are destroyed when the function returns, so this can become a latent UAF if the statement isn’t reset immediately. Use SQLITE_TRANSIENT for non-static strings to match the timestamp handling.

🔧 Suggested fix
-        sqlite3_bind_text(stmt, 1, new_alert.uuid.c_str(), -1, SQLITE_STATIC);
+        sqlite3_bind_text(stmt, 1, new_alert.uuid.c_str(), -1, SQLITE_TRANSIENT);
         sqlite3_bind_int(stmt, 2, static_cast<int>(new_alert.severity));
         sqlite3_bind_int(stmt, 3, static_cast<int>(new_alert.category));
-        sqlite3_bind_text(stmt, 4, new_alert.source.c_str(), -1, SQLITE_STATIC);
-        sqlite3_bind_text(stmt, 5, new_alert.message.c_str(), -1, SQLITE_STATIC);
-        sqlite3_bind_text(stmt, 6, new_alert.description.c_str(), -1, SQLITE_STATIC);
+        sqlite3_bind_text(stmt, 4, new_alert.source.c_str(), -1, SQLITE_TRANSIENT);
+        sqlite3_bind_text(stmt, 5, new_alert.message.c_str(), -1, SQLITE_TRANSIENT);
+        sqlite3_bind_text(stmt, 6, new_alert.description.c_str(), -1, SQLITE_TRANSIENT);
🧹 Nitpick comments (1)
daemon/src/alerts/alert_manager.cpp (1)

593-683: Serialize get_alerts DB access or verify SQLite serialized mode.

This path uses sqlite3_prepare_v2/step/finalize without stmt_mutex_ while sharing a single connection. If SQLite is built in multi-thread mode (not serialized), concurrent use of the same connection is unsafe. Consider locking around the statement lifecycle or ensure serialized mode at startup.

🔧 Possible lock scope
-    sqlite3_stmt* stmt;
-    int rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr);
+    sqlite3_stmt* stmt;
+    int rc;
+    std::lock_guard<std::mutex> lock(stmt_mutex_);
+    rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr);

- Updated SQL queries in the AlertManager to include status checks during acknowledgment and dismissal of alerts, ensuring that only alerts with the expected status are modified.
- Improved code readability by restructuring the load_initial_counters method for clarity in total alert count calculations.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 638-643: The assignment to alert.description is dereferencing
sqlite3_column_text without checking for NULL (same risk as in get_alert);
update the code that populates alert fields (the block assigning alert.uuid,
alert.severity, alert.category, alert.source, alert.message, alert.description)
to check sqlite3_column_text(stmt, 5) for nullptr before reinterpret_cast and
assign a safe fallback (e.g., empty string or std::nullopt depending on Alert
type) when NULL is returned; apply the same NULL-check pattern used for
description in get_alert to avoid null pointer dereference.
- Around line 548-553: The assignment to alert.description uses
reinterpret_cast<const char*>(sqlite3_column_text(stmt, 5)) which can be nullptr
for NULL DB values; update the code in alert_manager.cpp where alert.description
is set (and similarly for any other nullable TEXT columns if present) to check
if sqlite3_column_text(stmt, 5) is non-null before assigning to
std::string—e.g., get const unsigned char* txt = sqlite3_column_text(stmt, 5);
then set alert.description = txt ? reinterpret_cast<const char*>(txt) :
std::string() so you never assign nullptr into std::string.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Updated the get_alert and get_alerts methods to safely handle potential null values from SQLite queries by checking for null pointers before dereferencing, ensuring robust string assignment for alert attributes.
- Enhanced code readability by using descriptive variable names for SQLite column text retrieval.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 653-662: The timestamp parsing block in alert_manager.cpp should
defensively check for a NULL column value before casting and parsing: for the
timestamp column (sqlite3_column_text(stmt, 6)) verify it is not NULL, only then
read into timestamp_str, parse with std::get_time on tm and set alert.timestamp
via utc_timegm; if the column is NULL or parsing fails, set alert.timestamp =
std::chrono::system_clock::now(); mirror the same NULL-check pattern used in
get_alert to avoid dereferencing a NULL pointer.
- Around line 559-568: The code reads the timestamp column into timestamp_str
using sqlite3_column_text(stmt, 6) without checking for NULL, which can crash if
the DB is corrupted; update the block around timestamp_str/ss/alert.timestamp to
first check that const unsigned char* raw = sqlite3_column_text(stmt, 6) is not
null (and treat an empty string as missing), only construct std::string when raw
!= nullptr and parse with std::get_time into tm and convert via utc_timegm(&tm);
if raw is null or parsing fails, set alert.timestamp =
std::chrono::system_clock::now() as the safe fallback.
🧹 Nitpick comments (1)
daemon/src/alerts/alert_manager.cpp (1)

606-627: Consider parameterizing all filter values for consistency.

The filter values are safely converted to integers via std::to_string(static_cast<int>(...)), so there's no SQL injection risk. However, using parameterized queries consistently (like for source) would enable query plan caching and align with the style used elsewhere.

- Improved the get_alert and get_alerts methods to handle potential null values from SQLite more robustly, ensuring that timestamps are correctly parsed or defaulted to the current time when necessary.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 266-271: On the error path after prepare_statements() fails,
finalize any statements that may have been prepared before closing the DB: call
finalize_statements() (the routine that finalizes cached sqlite3_stmt objects)
prior to sqlite3_close(db), then set db_handle_ = nullptr and return false; this
ensures no unfinalized statements remain which would cause sqlite3_close() to
fail with SQLITE_BUSY.

- Introduced a new command `cortex daemon alerts --dismiss-all` to dismiss all active and acknowledged alerts.
- Implemented backend support in the AlertManager and IPC handlers to process the dismiss-all request.
- Updated documentation to include the new command and its usage.
- Added integration and unit tests to ensure the dismiss-all feature works as expected.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cortex/cli.py`:
- Around line 4605-4632: The alert action flags on daemon_alerts_parser are
currently independent, allowing ambiguous combinations; create a mutually
exclusive group on daemon_alerts_parser (e.g., actions_group =
daemon_alerts_parser.add_mutually_exclusive_group()) and move the action flags
(--acknowledge-all, --dismiss-all, --dismiss) into that group so only one of the
actions can be provided at a time; keep the filter arguments (--severity,
--category) outside the group and ensure the existing metavar/name for --dismiss
(UUID) remains unchanged.

In `@daemon/src/ipc/handlers.cpp`:
- Around line 194-224: The handler handle_alerts_acknowledge currently defaults
to acknowledging all alerts when neither "all" nor "uuid" is provided; change it
to mirror handle_alerts_dismiss by validating req.params and returning an error
instead of performing a bulk acknowledge when parameters are missing or
malformed. Specifically, inside handle_alerts_acknowledge check for
req.params.is_object() and either req.params.contains("all") with true or
req.params.contains("uuid") and only then call alerts->acknowledge_all() or
alerts->acknowledge_alert(uuid); otherwise return Response::err with a clear
message (e.g. "Missing or invalid parameters: require 'uuid' or 'all'") and an
appropriate ErrorCodes value (e.g. ErrorCodes::INVALID_PARAMS) so the behavior
matches handle_alerts_dismiss.
🧹 Nitpick comments (2)
daemon/src/alerts/alert_manager.cpp (1)

619-723: Dynamic query construction is reasonable for flexible filtering.

The get_alerts method intentionally uses dynamic SQL construction rather than cached prepared statements to support flexible filtering combinations. The enum-to-int casts are safe against SQL injection, and the source parameter is properly bound via placeholder.

Consider documenting why this method doesn't use cached statements (unlike other query methods) for future maintainers.

daemon/tests/unit/test_alert_manager.cpp (1)

250-271: Consider adding edge case tests.

The test suite covers the happy paths well. Consider adding tests for edge cases in a follow-up:

  • Acknowledging an already-acknowledged alert (should return false)
  • Dismissing an already-dismissed alert (should return false)
  • Getting an alert with a non-existent UUID (should return nullopt)
  • Filtering by source

Would you like me to generate these additional edge case tests?

sujay-d07 and others added 2 commits January 21, 2026 22:48
…group

- Refactored the argument parsing for alert actions in the CLI to use a mutually exclusive group, improving command clarity and preventing conflicting options.
- Updated the handling of alert acknowledgment to ensure proper parameter validation, returning an error for missing or invalid parameters.
@github-actions
Copy link

CLA Verification Passed

All contributors have signed the CLA.

Contributor Signed As
@sujay-d07 @sujay-d07
@Anshgrover23 @Anshgrover23

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CORE] System Monitoring and Persistent Alert Management for Daemon

2 participants