-
-
Notifications
You must be signed in to change notification settings - Fork 51
feat(daemon): System Monitoring and Persistent Alert Management #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ment - Added a background daemon to monitor system metrics including CPU, memory, and disk usage, with configurable thresholds for alert generation. - Introduced new CLI commands for checking system health and managing alerts, including acknowledgment and dismissal of alerts. - Implemented SQLite-based persistence for alerts, allowing for filtering by severity and category. - Enhanced the README and documentation to reflect new features and usage instructions for system monitoring and alert management.
- Modified the UUID display to show the full UUID instead of truncating it, enhancing clarity in alert information.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a SystemMonitor service and SQLite-backed AlertManager, new IPC handlers and CLI/DaemonClient commands for health and alerts, monitoring configuration and thresholds, build/test wiring, and concurrency improvements across IPC server and daemon core. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Client/CLI
participant DaemonClient as DaemonClient
participant IPCServer as IPC Server
participant Handler as IPC Handler
participant Monitor as SystemMonitor
participant AlertMgr as AlertManager
participant DB as SQLiteDB
rect rgba(100,150,200,0.5)
Note over User,Monitor: Health Query Flow
User->>DaemonClient: run `daemon health`
DaemonClient->>IPCServer: IPC request "health"
IPCServer->>Handler: dispatch -> handle_health
Handler->>Monitor: get_health()
Monitor->>Monitor: read /proc, systemd, thresholds
Monitor-->>Handler: SystemHealth + Thresholds
Handler-->>IPCServer: JSON response
IPCServer-->>DaemonClient: DaemonResponse
DaemonClient-->>User: render health
end
rect rgba(150,100,200,0.5)
Note over User,DB: Alerts Query Flow
User->>DaemonClient: run `daemon alerts --severity WARNING`
DaemonClient->>IPCServer: IPC request "alerts" + filters
IPCServer->>Handler: dispatch -> handle_alerts_get
Handler->>AlertMgr: get_alerts(filter)
AlertMgr->>DB: SELECT ... WHERE ...
DB-->>AlertMgr: rows
AlertMgr-->>Handler: alerts + counts
Handler-->>IPCServer: JSON response
IPCServer-->>DaemonClient: DaemonResponse
DaemonClient-->>User: render alerts
end
rect rgba(200,150,100,0.5)
Note over Monitor,DB: Background Monitoring & Alert Creation
Monitor->>Monitor: periodic health check
Monitor->>Monitor: compare vs thresholds
alt threshold exceeded
Monitor->>Monitor: dedupe check
alt not duplicate
Monitor->>AlertMgr: create_alert(...)
AlertMgr->>DB: INSERT
DB-->>AlertMgr: success (uuid)
AlertMgr-->>Monitor: uuid
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @sujay-d07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the daemon's capabilities by integrating comprehensive system health monitoring and a persistent alert management system. The daemon can now proactively track critical system resources and services, automatically generate alerts based on configurable thresholds, and store these alerts reliably in an SQLite database. This enhancement provides crucial observability into the daemon's operating environment and introduces new IPC and CLI interfaces for users to inspect system health and manage generated alerts effectively. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
daemon/include/cortexd/core/daemon.h (1)
76-85: Unsafe raw pointer lifetime from concurrent vector access
get_service()returns a raw pointer to a service after releasing the shared lock. While the current usage pattern (services registered only during initialization, immutable during runtime) mitigates the immediate risk, this is a fragile pattern that could silently break if future changes add dynamic service registration or removal during runtime. Additionally, if theservices_vector reallocates (exceeds capacity), previously returned pointers become dangling.Recommended fixes:
- Store
std::shared_ptr<Service>in the vector and returnstd::shared_ptr<T>viastd::dynamic_pointer_castto extend service lifetime beyond lock release, or- Document and enforce that
services_is immutable during daemon runtime with assertions and explicit lifetime guarantees in code commentsWriter-side locking is correct:
register_service()andreset()usestd::unique_lockas expected.
🤖 Fix all issues with AI agents
In `@cortex/cli.py`:
- Around line 2581-2613: The CPU/Memory/Disk color thresholds are hardcoded;
read daemon-provided thresholds from result.get("thresholds", {}) and use
per-metric warning/critical values (e.g. thresholds.get("cpu",
{}).get("warning", 80) and .get("critical", 95)) with those defaults if missing,
then compute color for cpu (variable cpu), memory (mem) and disk (disk) using
those warning/critical values instead of the fixed 80/95 before adding rows to
health_table; update all three blocks that currently use the hardcoded
comparisons.
In `@daemon/include/cortexd/monitor/system_monitor.h`:
- Around line 101-106: Add a mutex (preferably std::shared_mutex) protecting the
member thresholds_ and lock it for write in set_thresholds(const
MonitoringThresholds& thresholds) and for read in get_thresholds() and in
SystemMonitor::check_thresholds(); specifically, add the mutex as a member next
to thresholds_, change get_thresholds() to acquire a shared/read lock and return
the value, change set_thresholds() to acquire an exclusive/write lock before
assigning thresholds_, and update the .cpp implementation of
SystemMonitor::check_thresholds() to acquire a shared/read lock while reading
thresholds_ so no data race occurs.
In `@daemon/scripts/setup_daemon.py`:
- Around line 457-458: The call to ensure_config_file() currently ignores its
boolean return, hiding failures; update the code where ensure_config_file() is
invoked so you capture its return value and handle failures: call result =
ensure_config_file(), and if result is False, emit a clear processLogger.warn or
processLogger.error (include context like "config template missing or failed to
create") and abort installation (e.g., process.exit(1) or raise an exception) so
missing templates are audited and stop the install; keep ensure_config_file()
reference unchanged and only add the check/logging/abort around that call.
- Around line 394-442: ensure_config_file currently calls
config_path.parent.mkdir(...) which raises PermissionError for non-root users
when creating /etc/cortex; change the logic in ensure_config_file to avoid
direct mkdir on privileged paths: detect if parent exists and if not, if running
as root (os.geteuid()==0) perform Path(...).mkdir(...), otherwise use a
subprocess.run call with sudo to create the directory (e.g.,
["sudo","mkdir","-p", str(config_path.parent)]) and then proceed to copy the
template and set permissions via sudo as already done; update error
handling/logging around these subprocess calls to surface stderr when they fail.
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 122-149: AlertManager::ensure_db_directory currently returns false
from the catch when std::filesystem::create_directories throws (e.g., permission
denied), skipping the later fallback to ~/.cortex; update
AlertManager::ensure_db_directory to detect permission-related exceptions in the
catch block and perform the same fallback logic used when access(db_dir, W_OK)
fails: obtain HOME, set db_dir = HOME/.cortex, create_directories(db_dir),
update db_path_ = (db_dir / "alerts.db").string(), and LOG_WARN accordingly;
keep existing behavior of returning false only if HOME is unavailable or the
fallback create_directories still fails, and include the original exception
message in the LOG_ERROR if you still return false.
- Around line 57-103: The deserialization currently uses std::mktime (which
treats struct tm as local time) in Alert::from_json and the two DB deserializers
(from_db, from_db_filtered) for the fields timestamp, acknowledged_at and
dismissed_at; replace those std::mktime calls with a cross-platform helper
(e.g., utc_timegm or tm_to_time_t_utc) that calls timegm on POSIX and _mkgmtime
on Windows, and use that helper at all nine call sites so parsed
"%Y-%m-%dT%H:%M:%SZ" strings are interpreted as UTC consistently.
- Around line 391-449: AlertManager::create_alert uses temporary
std::stringstream results (ack_ss.str().c_str() and dis_ss.str().c_str()) and
binds them with SQLITE_STATIC, which leads to use-after-free when sqlite3_step
runs; fix by storing the timestamp strings in persistent std::string variables
(e.g., ack_ts, dis_ts) that outlive the bind call and bind those using
SQLITE_TRANSIENT, and ensure similar change for the main timestamp_str; update
references around stmt_insert_, sqlite3_bind_text calls for parameter indices 7,
9, and 10 so SQLite gets copied data instead of pointers to temporaries.
In `@daemon/src/core/daemon.cpp`:
- Around line 210-231: The loop iterates services_ after releasing
services_mutex_, causing potential iterator invalidation; instead, while holding
services_mutex_ (unique_lock) copy the shared_ptrs from services_ into a local
std::vector<std::shared_ptr<ServiceType>> (or the concrete service pointer
type), sort that local vector by the same priority comparator, then unlock
services_mutex_ and iterate over the local copy calling service->start() and
logging service->name(); remove the find_if/remove checks against services_
inside the loop since the local copy is stable. Ensure you reference
services_mutex_, services_, the priority() comparator and calls to
start()/name() when making the change.
In `@daemon/src/ipc/handlers.cpp`:
- Around line 144-187: The IPC alerts handler handle_alerts_get currently
ignores a "source" request parameter; update it to parse req.params["source"]
(when req.params.is_object() and contains "source") and set filter.source =
AlertManager::string_to_source(source_str) (using a std::string source_str =
req.params["source"].get<std::string>()), analogous to how
severity/category/status are handled so AlertManager::get_alerts(filter) will
perform source-based filtering.
In `@daemon/src/main.cpp`:
- Around line 105-106: The current binding uses a const reference to the result
of ConfigManager::instance().get(), which returns by value and causes a dangling
reference; change the declaration to copy the returned value (e.g., use const
auto config = ConfigManager::instance().get()) so that the local variable owns
the returned Config object and does not reference freed memory; update any
subsequent uses that assume a reference accordingly (look for the variable named
config and the methods ConfigManager::instance() and get()).
In `@daemon/src/monitor/system_monitor.cpp`:
- Around line 500-548: The active_alert_keys_ set is never cleared causing
permanent suppression; modify SystemMonitor so keys are removed when the
corresponding alert is resolved/dismissed or when the metric returns below
threshold. Keep the existing insertion in SystemMonitor::create_basic_alert
(guarded by alert_keys_mutex_), but add logic to erase the alert_key when: 1)
alert_manager_->create_alert succeeds and later the alert status transitions
from AlertStatus::ACTIVE to a non-active state (handle in whatever alert
status-change callback/handler you have), or 2) the metric check that originally
triggered this alert detects recovery (remove active_alert_keys_ under
alert_keys_mutex_), and ensure any path that fails to create the alert already
removes the key (you already handle this). Reference active_alert_keys_,
alert_keys_mutex_, SystemMonitor::create_basic_alert,
alert_manager_->create_alert and AlertStatus to locate where to add the
key-removal logic.
- Around line 118-137: The constructor SystemMonitor::SystemMonitor must
validate check_interval_seconds_ to prevent busy-spin when it's <= 0; update the
constructor to either throw std::invalid_argument for non-positive intervals or
clamp them to a safe minimum (e.g., 1) and log the adjustment, ensuring the
member check_interval_seconds_ is always > 0 before the monitor_loop uses it;
reference the constructor and the member variable check_interval_seconds_ (also
consider documenting behavior for tests that instantiate SystemMonitor
directly).
In `@daemon/tests/integration/test_handlers.cpp`:
- Around line 84-113: The test uses a fixed 200ms sleep in
start_server_with_monitoring(), which is flaky; replace it with a polling loop
that waits with a timeout for the system monitor to report health/readiness
instead of sleeping. Specifically, after calling system_monitor_->start() and
before asserting server_->start(), poll a reachability/readiness API on
SystemMonitor (e.g., a new or existing accessor such as
system_monitor_->has_health_data(), system_monitor_->last_check_time() > 0, or
system_monitor_->get_latest_health() returning non-empty) in a short-interval
loop (e.g., 50–200ms) with an overall timeout (e.g., 5s) and fail the test if
the timeout elapses; update start_server_with_monitoring() to use this polling
logic.
🧹 Nitpick comments (4)
daemon/CMakeLists.txt (1)
44-44: Stale comment references removed dependency.The linker warnings documentation still mentions OpenSSL (line 44: "systemd/OpenSSL"), but OpenSSL has been replaced with SQLite3 in this PR. Consider updating the comment to reflect the current dependencies.
Suggested fix
-# - Symbol versioning warnings from systemd/OpenSSL (e.g., "version node not found") +# - Symbol versioning warnings from systemd/SQLite3 (e.g., "version node not found")cortex/daemon_client.py (1)
4-6: Module docstring is now outdated.The docstring mentions only "PR1 commands: ping, version, config.get, config.reload, shutdown" but this file now includes PR2 methods (health, alerts).
Suggested update
""" Cortex Daemon IPC Client Provides communication with the cortexd daemon via Unix socket IPC. -Supports the PR1 commands: ping, version, config.get, config.reload, shutdown. +Supports commands: ping, version, config.get, config.reload, shutdown, +health, alerts, alerts.acknowledge, alerts.dismiss. """daemon/tests/unit/test_protocol.cpp (1)
307-315: Good test coverage for method constants.The test validates core method constants. For completeness, consider adding assertions for the remaining alert-related constants defined in protocol.h.
Optional: Add missing constant assertions
TEST_F(ProtocolTest, MethodConstantsAreDefined) { EXPECT_STREQ(cortexd::Methods::PING, "ping"); EXPECT_STREQ(cortexd::Methods::VERSION, "version"); EXPECT_STREQ(cortexd::Methods::CONFIG_GET, "config.get"); EXPECT_STREQ(cortexd::Methods::CONFIG_RELOAD, "config.reload"); EXPECT_STREQ(cortexd::Methods::SHUTDOWN, "shutdown"); EXPECT_STREQ(cortexd::Methods::HEALTH, "health"); EXPECT_STREQ(cortexd::Methods::ALERTS, "alerts"); + EXPECT_STREQ(cortexd::Methods::ALERTS_GET, "alerts.get"); + EXPECT_STREQ(cortexd::Methods::ALERTS_ACK, "alerts.acknowledge"); + EXPECT_STREQ(cortexd::Methods::ALERTS_DISMISS, "alerts.dismiss"); }daemon/tests/unit/test_alert_manager.cpp (1)
221-242: Consider testing JSON round-trip with optional timestamp fields.The JSON conversion test validates the basic round-trip but doesn't test
acknowledged_atordismissed_atoptional fields. Consider adding a test that acknowledges/dismisses an alert and verifies these timestamps survive JSON serialization.💡 Optional enhancement for timestamp round-trip
TEST_F(AlertManagerTest, AlertJsonConversionWithTimestamps) { Alert alert; alert.uuid = AlertManager::generate_uuid(); alert.severity = AlertSeverity::WARNING; alert.category = AlertCategory::CPU; alert.source = "test_source"; alert.message = "Test alert"; alert.status = AlertStatus::ACKNOWLEDGED; alert.timestamp = std::chrono::system_clock::now(); alert.acknowledged_at = std::chrono::system_clock::now(); json j = alert.to_json(); ASSERT_TRUE(j.contains("acknowledged_at")); ASSERT_FALSE(j["acknowledged_at"].is_null()); Alert restored = Alert::from_json(j); ASSERT_TRUE(restored.acknowledged_at.has_value()); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant new functionality for system monitoring and alert management. The changes are extensive, adding a SystemMonitor service to track CPU, memory, disk, and systemd services, and an AlertManager service for persisting alerts to an SQLite database. New IPC endpoints and corresponding CLI commands are added to expose this functionality. The implementation includes several performance and concurrency optimizations, such as a lock-free rate limiter and caching for /proc file reads.
My review focuses on thread safety, correctness, and consistency. I've identified a potential race condition in the AlertManager due to the use of a non-thread-safe function, and a complex and potentially unsafe locking pattern in the Daemon's service startup logic. I've also noted a minor inconsistency in the CLI where UI display logic uses hardcoded values instead of data from the daemon.
Overall, this is a well-structured feature addition with good attention to performance. The suggested changes aim to improve robustness and maintainability.
- Added support for configurable CPU, memory, and disk usage thresholds in the Cortex CLI, allowing for dynamic alerting based on user-defined limits. - Updated health check logic to utilize these thresholds for determining alert colors, improving clarity in system health reporting. - Refactored the SystemMonitor class to validate check intervals and ensure proper alert management based on system metrics. - Enhanced error handling in the setup process for configuration files, ensuring robust directory creation and permission handling. - Improved integration tests to verify health data readiness and system monitoring functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 751-759: The counters (count_info_, count_warning_, count_error_,
count_critical_, count_total_) are being reset after stmt_mutex_ is released
which can race with concurrent inserts; move the reset logic so it executes
while holding stmt_mutex_ (i.e., perform the store(0, ...) calls before
unlocking stmt_mutex_) or alternatively recompute and set the counters while
still under the protection of stmt_mutex_ to ensure no new alerts are missed.
♻️ Duplicate comments (1)
daemon/include/cortexd/monitor/system_monitor.h (1)
101-106: Protectthresholds_with a lock to prevent data races.
set_thresholds()can run on config reload while the monitor thread readsthresholds_incheck_thresholds(). This is a data race (UB). Please guard both reads/writes (and the read path in the .cpp).🔒 Suggested fix (header-side)
+#include <shared_mutex> ... - MonitoringThresholds get_thresholds() const { return thresholds_; } + MonitoringThresholds get_thresholds() const { + std::shared_lock<std::shared_mutex> lock(thresholds_mutex_); + return thresholds_; + } ... - void set_thresholds(const MonitoringThresholds& thresholds) { thresholds_ = thresholds; } + void set_thresholds(const MonitoringThresholds& thresholds) { + std::unique_lock<std::shared_mutex> lock(thresholds_mutex_); + thresholds_ = thresholds; + } ... - MonitoringThresholds thresholds_; + MonitoringThresholds thresholds_; + mutable std::shared_mutex thresholds_mutex_;Also take a shared/read lock around
thresholds_reads inSystemMonitor::check_thresholds()in the .cpp.Also applies to: 113-114
- Added a shared mutex to the SystemMonitor class to ensure thread-safe access to monitoring thresholds. - Refactored get_thresholds and set_thresholds methods to utilize shared and unique locks, respectively, enhancing concurrency control. - Updated the AlertManager to hold a lock while updating alert counters during acknowledgment, preventing race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 396-431: In AlertManager::load_initial_counters(), count_total_ is
never reset and is cumulatively increased with fetch_add across calls; before
iterating the sqlite3_step loop (while holding stmt_mutex_), reset count_total_
to 0 (e.g., count_total_.store(0, std::memory_order_relaxed)) or accumulate into
a local total variable and store it once after the loop, ensuring consistency
with the per-severity stores and using the same memory_order_relaxed semantics;
locate this change near stmt_count_, stmt_mutex_, and where fetch_add is
currently called.
🧹 Nitpick comments (8)
daemon/src/alerts/alert_manager.cpp (3)
84-85: Unchecked enum casts from JSON integers.Direct
static_castfrom JSON integers to enums without bounds validation could produce undefined enum values if the JSON is malformed or from a newer version.♻️ Suggested validation
- alert.severity = static_cast<AlertSeverity>(j.value("severity", 0)); - alert.category = static_cast<AlertCategory>(j.value("category", 0)); + int sev_val = j.value("severity", 0); + alert.severity = (sev_val >= 0 && sev_val <= 3) + ? static_cast<AlertSeverity>(sev_val) : AlertSeverity::INFO; + int cat_val = j.value("category", 0); + alert.category = (cat_val >= 0 && cat_val <= 6) + ? static_cast<AlertCategory>(cat_val) : AlertCategory::SYSTEM;
289-291: Unused prepared statementstmt_select_all_.
stmt_select_all_is prepared but never used.get_alerts()builds dynamic SQL with filtering instead. Consider removing it to reduce resource usage and maintenance burden.
614-629: Unusedparam_indexinitialization when source filter is absent.
param_indexis initialized to 1 and only used iffilter.source.has_value(). This works, but the variable is unnecessarily declared when not needed.♻️ Minor cleanup
- int param_index = 1; if (filter.source.has_value()) { select_sql += " AND source = ?"; } select_sql += " ORDER BY timestamp DESC"; sqlite3_stmt* stmt; int rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr); if (rc != SQLITE_OK) { return alerts; } if (filter.source.has_value()) { - sqlite3_bind_text(stmt, param_index++, filter.source.value().c_str(), -1, SQLITE_STATIC); + sqlite3_bind_text(stmt, 1, filter.source.value().c_str(), -1, SQLITE_STATIC); }daemon/include/cortexd/monitor/system_monitor.h (2)
55-62: Consider adding default threshold values.
MonitoringThresholdsmembers have no default values, which could lead to uninitialized values if the struct is default-constructed. Adding sensible defaults improves safety.♻️ Example defaults
struct MonitoringThresholds { - double cpu_warning; - double cpu_critical; - double memory_warning; - double memory_critical; - double disk_warning; - double disk_critical; + double cpu_warning = 75.0; + double cpu_critical = 90.0; + double memory_warning = 80.0; + double memory_critical = 95.0; + double disk_warning = 80.0; + double disk_critical = 90.0; };
128-131: Consider documenting systemd bus reconnection behavior.
systemd_bus_is declared asvoid*with a comment about avoiding systemd headers, but the reconnection-on-failure behavior (implemented in .cpp) is not documented here. A brief note would help maintainers understand the lifecycle.daemon/src/monitor/system_monitor.cpp (3)
195-223: Lock held during file I/O may cause unnecessary contention.The
proc_cache_mutex_is held for the entire read operation, including file I/O. While /proc reads are typically fast, this could block other readers. Consider a double-checked locking pattern or accepting the current approach given the 1s TTL.This is a minor optimization opportunity; the current implementation is correct and likely sufficient for the monitoring interval.
437-448: Alert key construction is duplicated.The alert key format is constructed both in
check_thresholds()(for cleanup) and increate_basic_alert()(for deduplication). If the format changes, both locations must be updated. Consider extracting a helper function.♻️ Example helper
static std::string make_alert_key(AlertCategory category, AlertSeverity severity, const std::string& source, const std::string& message) { return std::to_string(static_cast<int>(category)) + ":" + std::to_string(static_cast<int>(severity)) + ":" + source + ":" + message; }
468-479: Variable name shadowing: innerlockshadows outerlock.The inner
std::lock_guardnamedlockshadows the outerstd::shared_lockalso namedlock. While this works (different scopes, different mutexes), using distinct names likethreshold_lockandalert_lockwould improve clarity.
…unters - Updated the load_initial_counters method to initialize and store the total alert count correctly. - Replaced the previous fetch_add operation with a more accurate total accumulation approach, ensuring thread safety and consistency in alert status updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 292-296: The current acknowledge_alert and dismiss_alert flows
call get_alert() (reading status) and then run UPDATEs (using update_ack_sql and
update_dismiss_sql) without verifying the status atomically, allowing races that
double-decrement counters; modify update_ack_sql and update_dismiss_sql to
include "AND status = ?" (i.e., WHERE uuid = ? AND status = ?) and bind the
expected status read from get_alert(), then execute the UPDATE and check the
affected-rows result (sqlite3_changes/statement step result) before decrementing
any counters so you only adjust counters when the conditional UPDATE actually
changed a row; leave update_ack_all_sql as-is or ensure similar conditional
semantics when used in bulk acknowledge paths.
♻️ Duplicate comments (1)
daemon/src/alerts/alert_manager.cpp (1)
490-496: AvoidSQLITE_STATICfor cached INSERT bindings with non-static strings.
stmt_insert_is cached; SQLite requiresSQLITE_STATICdata to remain valid until the statement is reset or finalized.new_alert.*strings are destroyed when the function returns, so this can become a latent UAF if the statement isn’t reset immediately. UseSQLITE_TRANSIENTfor non-static strings to match the timestamp handling.🔧 Suggested fix
- sqlite3_bind_text(stmt, 1, new_alert.uuid.c_str(), -1, SQLITE_STATIC); + sqlite3_bind_text(stmt, 1, new_alert.uuid.c_str(), -1, SQLITE_TRANSIENT); sqlite3_bind_int(stmt, 2, static_cast<int>(new_alert.severity)); sqlite3_bind_int(stmt, 3, static_cast<int>(new_alert.category)); - sqlite3_bind_text(stmt, 4, new_alert.source.c_str(), -1, SQLITE_STATIC); - sqlite3_bind_text(stmt, 5, new_alert.message.c_str(), -1, SQLITE_STATIC); - sqlite3_bind_text(stmt, 6, new_alert.description.c_str(), -1, SQLITE_STATIC); + sqlite3_bind_text(stmt, 4, new_alert.source.c_str(), -1, SQLITE_TRANSIENT); + sqlite3_bind_text(stmt, 5, new_alert.message.c_str(), -1, SQLITE_TRANSIENT); + sqlite3_bind_text(stmt, 6, new_alert.description.c_str(), -1, SQLITE_TRANSIENT);
🧹 Nitpick comments (1)
daemon/src/alerts/alert_manager.cpp (1)
593-683: Serializeget_alertsDB access or verify SQLite serialized mode.This path uses
sqlite3_prepare_v2/step/finalizewithoutstmt_mutex_while sharing a single connection. If SQLite is built in multi-thread mode (not serialized), concurrent use of the same connection is unsafe. Consider locking around the statement lifecycle or ensure serialized mode at startup.🔧 Possible lock scope
- sqlite3_stmt* stmt; - int rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr); + sqlite3_stmt* stmt; + int rc; + std::lock_guard<std::mutex> lock(stmt_mutex_); + rc = sqlite3_prepare_v2(db, select_sql.c_str(), -1, &stmt, nullptr);
- Updated SQL queries in the AlertManager to include status checks during acknowledgment and dismissal of alerts, ensuring that only alerts with the expected status are modified. - Improved code readability by restructuring the load_initial_counters method for clarity in total alert count calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 638-643: The assignment to alert.description is dereferencing
sqlite3_column_text without checking for NULL (same risk as in get_alert);
update the code that populates alert fields (the block assigning alert.uuid,
alert.severity, alert.category, alert.source, alert.message, alert.description)
to check sqlite3_column_text(stmt, 5) for nullptr before reinterpret_cast and
assign a safe fallback (e.g., empty string or std::nullopt depending on Alert
type) when NULL is returned; apply the same NULL-check pattern used for
description in get_alert to avoid null pointer dereference.
- Around line 548-553: The assignment to alert.description uses
reinterpret_cast<const char*>(sqlite3_column_text(stmt, 5)) which can be nullptr
for NULL DB values; update the code in alert_manager.cpp where alert.description
is set (and similarly for any other nullable TEXT columns if present) to check
if sqlite3_column_text(stmt, 5) is non-null before assigning to
std::string—e.g., get const unsigned char* txt = sqlite3_column_text(stmt, 5);
then set alert.description = txt ? reinterpret_cast<const char*>(txt) :
std::string() so you never assign nullptr into std::string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
- Updated the get_alert and get_alerts methods to safely handle potential null values from SQLite queries by checking for null pointers before dereferencing, ensuring robust string assignment for alert attributes. - Enhanced code readability by using descriptive variable names for SQLite column text retrieval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 653-662: The timestamp parsing block in alert_manager.cpp should
defensively check for a NULL column value before casting and parsing: for the
timestamp column (sqlite3_column_text(stmt, 6)) verify it is not NULL, only then
read into timestamp_str, parse with std::get_time on tm and set alert.timestamp
via utc_timegm; if the column is NULL or parsing fails, set alert.timestamp =
std::chrono::system_clock::now(); mirror the same NULL-check pattern used in
get_alert to avoid dereferencing a NULL pointer.
- Around line 559-568: The code reads the timestamp column into timestamp_str
using sqlite3_column_text(stmt, 6) without checking for NULL, which can crash if
the DB is corrupted; update the block around timestamp_str/ss/alert.timestamp to
first check that const unsigned char* raw = sqlite3_column_text(stmt, 6) is not
null (and treat an empty string as missing), only construct std::string when raw
!= nullptr and parse with std::get_time into tm and convert via utc_timegm(&tm);
if raw is null or parsing fails, set alert.timestamp =
std::chrono::system_clock::now() as the safe fallback.
🧹 Nitpick comments (1)
daemon/src/alerts/alert_manager.cpp (1)
606-627: Consider parameterizing all filter values for consistency.The filter values are safely converted to integers via
std::to_string(static_cast<int>(...)), so there's no SQL injection risk. However, using parameterized queries consistently (like forsource) would enable query plan caching and align with the style used elsewhere.
- Improved the get_alert and get_alerts methods to handle potential null values from SQLite more robustly, ensuring that timestamps are correctly parsed or defaulted to the current time when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@daemon/src/alerts/alert_manager.cpp`:
- Around line 266-271: On the error path after prepare_statements() fails,
finalize any statements that may have been prepared before closing the DB: call
finalize_statements() (the routine that finalizes cached sqlite3_stmt objects)
prior to sqlite3_close(db), then set db_handle_ = nullptr and return false; this
ensures no unfinalized statements remain which would cause sqlite3_close() to
fail with SQLITE_BUSY.
- Introduced a new command `cortex daemon alerts --dismiss-all` to dismiss all active and acknowledged alerts. - Implemented backend support in the AlertManager and IPC handlers to process the dismiss-all request. - Updated documentation to include the new command and its usage. - Added integration and unit tests to ensure the dismiss-all feature works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cortex/cli.py`:
- Around line 4605-4632: The alert action flags on daemon_alerts_parser are
currently independent, allowing ambiguous combinations; create a mutually
exclusive group on daemon_alerts_parser (e.g., actions_group =
daemon_alerts_parser.add_mutually_exclusive_group()) and move the action flags
(--acknowledge-all, --dismiss-all, --dismiss) into that group so only one of the
actions can be provided at a time; keep the filter arguments (--severity,
--category) outside the group and ensure the existing metavar/name for --dismiss
(UUID) remains unchanged.
In `@daemon/src/ipc/handlers.cpp`:
- Around line 194-224: The handler handle_alerts_acknowledge currently defaults
to acknowledging all alerts when neither "all" nor "uuid" is provided; change it
to mirror handle_alerts_dismiss by validating req.params and returning an error
instead of performing a bulk acknowledge when parameters are missing or
malformed. Specifically, inside handle_alerts_acknowledge check for
req.params.is_object() and either req.params.contains("all") with true or
req.params.contains("uuid") and only then call alerts->acknowledge_all() or
alerts->acknowledge_alert(uuid); otherwise return Response::err with a clear
message (e.g. "Missing or invalid parameters: require 'uuid' or 'all'") and an
appropriate ErrorCodes value (e.g. ErrorCodes::INVALID_PARAMS) so the behavior
matches handle_alerts_dismiss.
🧹 Nitpick comments (2)
daemon/src/alerts/alert_manager.cpp (1)
619-723: Dynamic query construction is reasonable for flexible filtering.The
get_alertsmethod intentionally uses dynamic SQL construction rather than cached prepared statements to support flexible filtering combinations. The enum-to-int casts are safe against SQL injection, and thesourceparameter is properly bound via placeholder.Consider documenting why this method doesn't use cached statements (unlike other query methods) for future maintainers.
daemon/tests/unit/test_alert_manager.cpp (1)
250-271: Consider adding edge case tests.The test suite covers the happy paths well. Consider adding tests for edge cases in a follow-up:
- Acknowledging an already-acknowledged alert (should return false)
- Dismissing an already-dismissed alert (should return false)
- Getting an alert with a non-existent UUID (should return nullopt)
- Filtering by source
Would you like me to generate these additional edge case tests?
…group - Refactored the argument parsing for alert actions in the CLI to use a mutually exclusive group, improving command clarity and preventing conflicting options. - Updated the handling of alert acknowledgment to ensure proper parameter validation, returning an error for missing or invalid parameters.
CLA Verification PassedAll contributors have signed the CLA.
|
|



Related Issue
Closes #651
Part 2/3 of Issue #427
Summary
Introduces comprehensive system health monitoring and persistent alert management capabilities to the cortex daemon. The implementation provides real-time monitoring of system resources (CPU, memory, disk, services) with configurable thresholds and SQLite-based alert persistence.
AI Disclosure
Used Cursor to debug and make optimizations in the daemon.
Features
Alert Management
System Monitoring
/proc/stat)/proc/meminfo)statvfs)/proc/uptime)/procfile caching (1-second TTL) to reduce I/O overheadServiceinterface for daemon lifecycle managementIPC Integration
health- Get current system health metrics and thresholdsalerts/alerts.get- Query alerts with optional filteringalerts.ack- Acknowledge alerts (single UUID or all)alerts.dismiss- Dismiss all or a specific alertDependencies
Testing
AlertManagercovering:Checklist
type(scope): descriptionor[scope] descriptionpytest tests/)Demo of new features
Screencast.from.2026-01-21.16-48-31.webm
Summary by CodeRabbit
New Features
Documentation
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.