From 195f35ac65a3aad3963640dc69a96869cc45ad02 Mon Sep 17 00:00:00 2001 From: Lasim Date: Thu, 25 Dec 2025 09:15:30 +0100 Subject: [PATCH 1/2] docs(satellite): add comprehensive status & health tracking documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add 4 new documentation files covering the MCP Status & Health Tracking System (18 implementation phases) and update 6 existing files with cross-references to maintain modular documentation structure. New files: - status-tracking.mdx: 11-state status system, lifecycle flows, tool filtering - event-emission.mdx: Event types, payloads, batching configuration - log-capture.mdx: Server/request logging, buffering, privacy controls - recovery-system.mdx: Automatic recovery detection, retry logic, tool preservation Updated files with cross-links: - architecture.mdx: Add Status Tracking, Event System, Log Capture sections - tool-discovery.mdx: Add Status Integration, Recovery System sections - process-management.mdx: Add Status Events, Log Buffering sections - backend-communication.mdx: Add Events vs Heartbeat, Health Check sections - commands.mdx: Add health_check command documentation - hierarchical-router.mdx: Add Status-Based Tool Filtering section Navigation: - docs.json: Add "Status & Health Tracking" group to Satellite Development tab Technical details: - 11 status values (provisioning → online → offline/error/requires_reauth) - Event batching: 3-second interval, max 20 per batch - Retry logic: exponential backoff (500ms, 1s, 2s) - Log storage: 100-line limit per installation - Request logging privacy control via settings --- development/satellite/architecture.mdx | 47 +- .../satellite/backend-communication.mdx | 68 ++- development/satellite/commands.mdx | 38 ++ development/satellite/event-emission.mdx | 426 +++++++++++++++++ development/satellite/hierarchical-router.mdx | 69 +-- development/satellite/log-capture.mdx | 450 ++++++++++++++++++ development/satellite/process-management.mdx | 74 +-- development/satellite/recovery-system.mdx | 370 ++++++++++++++ development/satellite/status-tracking.mdx | 284 +++++++++++ development/satellite/tool-discovery.mdx | 64 ++- docs.json | 9 + 11 files changed, 1787 insertions(+), 112 deletions(-) create mode 100644 development/satellite/event-emission.mdx create mode 100644 development/satellite/log-capture.mdx create mode 100644 development/satellite/recovery-system.mdx create mode 100644 development/satellite/status-tracking.mdx diff --git a/development/satellite/architecture.mdx b/development/satellite/architecture.mdx index 9407649..0183d01 100644 --- a/development/satellite/architecture.mdx +++ b/development/satellite/architecture.mdx @@ -284,31 +284,36 @@ For complete implementation details, see [Backend Polling Implementation](/devel ### Real-Time Event System -**Event Emission with Batching:** -``` -Satellite Operations EventBus Backend - │ │ │ - │─── mcp.server.started ──▶│ │ - │─── mcp.tool.executed ───▶│ [Queue] │ - │─── mcp.client.connected ─▶│ │ - │ [Every 3 seconds] │ - │ │ │ - │ │─── POST /events ───▶│ - │ │◀─── 200 OK ─────────│ -``` - -**Event Features:** -- **Immediate Emission**: Events emitted when actions occur (not delayed by 30s heartbeat) -- **Automatic Batching**: Events collected for 3 seconds, then sent as single batch (max 100 events) -- **Memory Management**: In-memory queue (10,000 event limit) with overflow protection -- **Graceful Error Handling**: 429 exponential backoff, 400 drops invalid events, 500/network errors retry -- **10 Event Types**: Server lifecycle, client connections, tool discovery, configuration updates +The satellite emits typed events for status changes, logs, and tool metadata. Events enable real-time monitoring without polling. **Difference from Heartbeat:** - **Heartbeat** (every 30s): Aggregate metrics, system health, resource usage -- **Events** (immediate): Point-in-time occurrences, user actions, precise timestamps +- **Events** (immediate): Point-in-time status updates, precise timestamps + +See [Event Emission](/development/satellite/event-emission) for complete event types, payloads, and batching configuration. + +### Status Tracking System + +The satellite tracks MCP server installation health through an 11-state status system that drives tool availability and automatic recovery. + +**Status Values:** +- Installation lifecycle: `provisioning`, `command_received`, `connecting`, `discovering_tools`, `syncing_tools` +- Healthy state: `online` (tools available) +- Configuration changes: `restarting` +- Failure states: `offline`, `error`, `requires_reauth`, `permanently_failed` + +**Status Integration:** +- **Tool Filtering**: Tools from non-online servers hidden from discovery +- **Auto-Recovery**: Offline servers auto-recover when responsive +- **Event Emission**: Status changes emitted immediately to backend + +See [Status Tracking](/development/satellite/status-tracking) for complete status lifecycle and transitions. + +### Log Capture System + +The satellite captures and batches two types of logs for debugging and monitoring: **server logs** (stderr output) and **request logs** (tool execution with full request/response data). -For complete event system documentation, see [Event System](/development/satellite/event-system). +See [Log Capture](/development/satellite/log-capture) for buffering implementation, batching configuration, backend storage limits, and privacy controls. ## Security Architecture diff --git a/development/satellite/backend-communication.mdx b/development/satellite/backend-communication.mdx index 1f09bcf..b066638 100644 --- a/development/satellite/backend-communication.mdx +++ b/development/satellite/backend-communication.mdx @@ -157,11 +157,11 @@ For detailed event system documentation, see [Event System](/development/satelli - Performance metrics collection **Terminate Process:** -- Graceful shutdown with SIGTERM -- Force kill with SIGKILL after timeout - Resource cleanup and deallocation - Final status report to Backend +See [Process Management - Graceful Termination](/development/satellite/process-management#graceful-termination) for SIGTERM/SIGKILL shutdown details. + ## Internal Architecture ### Five Core Components @@ -375,6 +375,70 @@ server.log.info({ 4. Add comprehensive monitoring and alerting 5. End-to-end testing and performance validation +## Events vs Heartbeat + +The satellite communicates status and metrics through two distinct channels: + +**Events (Immediate):** +- Emitted when actions occur (not delayed by heartbeat interval) +- Point-in-time status updates with precise timestamps +- Batched automatically (3-second interval, max 20 per batch) +- Types: Status changes, logs, tool metadata, lifecycle events + +**Heartbeat (Periodic, every 30s):** +- Aggregate metrics and system health +- Resource usage statistics +- Overall satellite status + +See [Event Emission](/development/satellite/event-emission) for complete event types and batching strategy. + +## Health Check Command + +The backend sends `health_check` commands for credential validation: + +**Command Structure:** +```typescript +{ + commandType: 'health_check', + priority: 'immediate', + payload: { + check_type: 'credential_validation', + installation_id: string, + team_id: string + } +} +``` + +**Satellite Action:** +- Calls `tools/list` on MCP server with credentials +- Detects auth errors (401, 403) +- Emits `requires_reauth` status if validation fails + +See [Commands](/development/satellite/commands) for complete command reference. + +## Recovery Commands + +When offline servers recover, backend sends recovery commands: + +**Command Structure:** +```typescript +{ + commandType: 'configure', + priority: 'high', + payload: { + event: 'mcp_recovery', + installation_id: string, + team_id: string + } +} +``` + +**Satellite Action:** +- Triggers re-discovery for the recovered server +- Status progresses: `offline` → `connecting` → `discovering_tools` → `online` + +See [Recovery System](/development/satellite/recovery-system) for automatic recovery logic. + The satellite communication system is designed for enterprise deployment with complete team isolation, resource management, and audit logging while maintaining the developer experience that defines the DeployStack platform. diff --git a/development/satellite/commands.mdx b/development/satellite/commands.mdx index ff428df..cb328e1 100644 --- a/development/satellite/commands.mdx +++ b/development/satellite/commands.mdx @@ -115,6 +115,44 @@ Each satellite command contains: 4. Restart affected components 5. Verify system integrity post-update +### health_check + +**Purpose**: Validates MCP server credentials and connectivity + +**Priority**: `immediate` + +**Triggered By**: +- Backend credential validation cron (every 1 minute) +- Manual credential testing +- OAuth token expiration detection + +**Payload Structure**: +```json +{ + "check_type": "credential_validation", + "installation_id": "installation-uuid", + "team_id": "team-uuid" +} +``` + +**Satellite Actions**: +1. Find MCP server configuration by installation_id +2. Skip stdio servers (no HTTP credentials to validate) +3. Build HTTP request with configured credentials (headers, query params) +4. Call `tools/list` with 15-second timeout +5. Detect authentication errors: + - HTTP 401/403 responses + - Error messages containing "auth", "unauthorized", "forbidden" +6. Emit status event: + - On auth failure → `requires_reauth` status + - On success → credentials valid (no status change) + +**Error Detection Patterns**: +- HTTP status codes: 401, 403 +- Response body keywords: "auth", "unauthorized", "forbidden", "token", "credentials" + +See [Status Tracking](/development/satellite/status-tracking) for credential validation status flow. + ## Command Lifecycle ### Creation diff --git a/development/satellite/event-emission.mdx b/development/satellite/event-emission.mdx new file mode 100644 index 0000000..b4d96ba --- /dev/null +++ b/development/satellite/event-emission.mdx @@ -0,0 +1,426 @@ +--- +title: Event Emission +description: Events emitted by the satellite to communicate with the backend +--- + +# Event Emission + +The satellite communicates with the backend through a centralized EventBus that emits typed events. These events enable real-time status updates, log streaming, and tool metadata synchronization without polling. + +## Overview + +The satellite emits events for: +- **Status Changes**: Real-time installation status updates +- **Server Logs**: Batched stderr output from MCP servers +- **Request Logs**: Batched tool execution logs with request/response data +- **Tool Metadata**: Tool discovery results with token counts +- **Process Lifecycle**: Server start, crash, restart, permanent failure events + +All events are processed by the backend's event handler system and trigger database updates, SSE broadcasts to frontend, and health monitoring actions. + +## Event System Architecture + +``` +Satellite Component (ProcessManager, McpServerWrapper, DiscoveryManager) + ↓ +EventBus.emit(eventType, eventData) + ↓ +Backend Polling Service (30-second interval) + ↓ +Backend Event Handlers (process events, update database) + ↓ +Frontend SSE Streams (real-time updates to users) +``` + +## Event Types Reference + +### mcp.server.status_changed + +**Purpose:** Update installation status in real-time + +**Emitted by:** +- ProcessManager (connecting, online, crashed, permanently_failed) +- McpServerWrapper (offline, error, requires_reauth on tool execution failures) +- RemoteToolDiscoveryManager (connecting, online, offline, error, requires_reauth) + +For complete status transition triggers and lifecycle flows, see [Status Tracking](/development/satellite/status-tracking). + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + status: 'provisioning' | 'command_received' | 'connecting' | 'discovering_tools' + | 'syncing_tools' | 'online' | 'restarting' | 'offline' | 'error' + | 'requires_reauth' | 'permanently_failed'; + status_message?: string; + timestamp: string; // ISO 8601 +} +``` + +**Example:** +```typescript +eventBus.emit('mcp.server.status_changed', { + installation_id: 'inst_abc123', + team_id: 'team_xyz', + status: 'online', + status_message: 'Server connected successfully', + timestamp: '2025-01-15T10:30:00.000Z' +}); +``` + +**Backend Action:** Updates `mcpServerInstallations.status` and broadcasts via SSE + +--- + +### mcp.server.logs + +**Purpose:** Stream server logs (stderr, connection errors, startup messages) to backend + +**Emitted by:** +- ProcessManager (batched stderr output from stdio MCP servers) + +**Batching Strategy:** +- **Interval**: 3 seconds after first log entry +- **Max Size**: 20 logs per batch (forces immediate flush) +- **Grouping**: By `installation_id + team_id` + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + logs: Array<{ + level: 'info' | 'warn' | 'error' | 'debug'; + message: string; + metadata?: Record; + timestamp: string; // ISO 8601 + }>; +} +``` + +**Example:** +```typescript +eventBus.emit('mcp.server.logs', { + installation_id: 'inst_abc123', + team_id: 'team_xyz', + logs: [ + { + level: 'error', + message: 'Connection refused to http://localhost:3568/sse', + metadata: { error_code: 'ECONNREFUSED' }, + timestamp: '2025-01-15T10:30:00.000Z' + }, + { + level: 'info', + message: 'Retrying connection in 2 seconds...', + timestamp: '2025-01-15T10:30:02.000Z' + } + ] +}); +``` + +**Backend Action:** Inserts logs into `mcpServerLogs` table, enforces 100-line limit per installation + +--- + +### mcp.request.logs + +**Purpose:** Stream tool execution logs with full request/response data + +**Emitted by:** +- McpServerWrapper (batched tool call logs) + +**Batching Strategy:** +- **Interval**: 3 seconds after first request +- **Max Size**: 20 requests per batch +- **Grouping**: By `installation_id + team_id` + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + requests: Array<{ + user_id?: string; + tool_name: string; + tool_params: Record; + tool_response?: unknown; // Full MCP server response + response_time_ms: number; + success: boolean; + error_message?: string; + timestamp: string; // ISO 8601 + }>; +} +``` + +**Example:** +```typescript +eventBus.emit('mcp.request.logs', { + installation_id: 'inst_abc123', + team_id: 'team_xyz', + requests: [ + { + user_id: 'user_xyz', + tool_name: 'github:list-repos', + tool_params: { owner: 'deploystackio' }, + tool_response: { repos: ['deploystack', 'mcp-server'], total: 2 }, + response_time_ms: 234, + success: true, + timestamp: '2025-01-15T10:30:00.000Z' + } + ] +}); +``` + +**Backend Action:** Inserts requests into `mcpRequestLogs` table, enforces 100-line limit + +**Privacy Note:** Only emitted if `settings.request_logging_enabled !== false` + +--- + +### mcp.tools.discovered + +**Purpose:** Synchronize discovered tools and metadata to backend + +**Emitted by:** +- UnifiedToolDiscoveryManager (after tool discovery completes) + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + tools: Array<{ + tool_path: string; // e.g., "github:list-repos" + name: string; + description?: string; + inputSchema: unknown; + token_count: number; // Estimated token usage + }>; + timestamp: string; // ISO 8601 +} +``` + +**Example:** +```typescript +eventBus.emit('mcp.tools.discovered', { + installation_id: 'inst_abc123', + team_id: 'team_xyz', + tools: [ + { + tool_path: 'github:list-repos', + name: 'list-repos', + description: 'List all repositories for an owner', + inputSchema: { type: 'object', properties: { owner: { type: 'string' } } }, + token_count: 42 + } + ], + timestamp: '2025-01-15T10:30:00.000Z' +}); +``` + +**Backend Action:** Updates `mcpTools` table with discovered tools and metadata + +--- + +### Process Lifecycle Events + +These events track stdio MCP server process state: + +#### mcp.server.started + +**Emitted when:** Stdio process successfully spawned + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + process_id: string; + timestamp: string; +} +``` + +#### mcp.server.crashed + +**Emitted when:** Stdio process terminates unexpectedly + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + process_id: string; + exit_code: number | null; + signal: string | null; + crash_count: number; // Crashes within 5-minute window + timestamp: string; +} +``` + +#### mcp.server.restarted + +**Emitted when:** Stdio process automatically restarted after crash + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + process_id: string; + restart_count: number; + timestamp: string; +} +``` + +#### mcp.server.permanently_failed + +**Emitted when:** Stdio process crashes 3 times within 5 minutes + +**Payload:** +```typescript +{ + installation_id: string; + team_id: string; + process_id: string; + crash_count: number; // Always 3 + message: string; // "Process crashed 3 times in 5 minutes" + timestamp: string; +} +``` + +**Backend Action:** Sets installation status to `permanently_failed`, requires manual restart + +--- + +## Event Batching Strategy + +### Why Batching? + +Batching reduces: +- Backend API calls (20 logs = 1 API call instead of 20) +- Database transactions (bulk insert instead of individual inserts) +- Network overhead (fewer HTTP requests) +- Backend processing load (batch operations are more efficient) + +### Batching Configuration + +| Parameter | Value | Reason | +|-----------|-------|--------| +| Batch Interval | 3 seconds | Balance between real-time feel and efficiency | +| Max Batch Size | 20 entries | Prevent large payloads, force timely emission | +| Grouping Key | `installation_id + team_id` | Separate batches per installation | + +### Batching Implementation + +Log batching implementation details are in [Log Capture - Buffering Implementation](/development/satellite/log-capture#buffering-implementation) for both server logs and request logs. + +## EventBus Usage + +### Emitting Events + +```typescript +import { EventBus } from './events/event-bus'; + +// EventBus is a singleton +const eventBus = EventBus.getInstance(); + +// Emit with type safety +eventBus.emit('mcp.server.status_changed', { + installation_id: 'inst_123', + team_id: 'team_456', + status: 'online', + timestamp: new Date().toISOString() +}); +``` + +### Event Registry + +All event types are defined in the event registry: + +```typescript +// services/satellite/src/events/registry.ts + +export type EventType = + | 'mcp.server.status_changed' + | 'mcp.server.logs' + | 'mcp.request.logs' + | 'mcp.tools.discovered' + | 'mcp.server.started' + | 'mcp.server.crashed' + | 'mcp.server.restarted' + | 'mcp.server.permanently_failed' + // ... 13 total event types + ; + +export interface EventDataMap { + 'mcp.server.status_changed': { /* payload */ }; + 'mcp.server.logs': { /* payload */ }; + // ... type definitions for all events +} +``` + +## Backend Event Handlers + +Each event type has a dedicated backend handler: + +**Status Changed:** +```typescript +// services/backend/src/events/satellite/mcp-server-status-changed.ts +// Updates mcpServerInstallations.status +``` + +**Server Logs:** +```typescript +// services/backend/src/events/satellite/mcp-server-logs.ts +// Inserts into mcpServerLogs table +``` + +**Request Logs:** +```typescript +// services/backend/src/events/satellite/mcp-request-logs.ts +// Inserts into mcpRequestLogs table (if logging enabled) +``` + +**Tools Discovered:** +```typescript +// services/backend/src/events/satellite/mcp-tools-discovered.ts +// Updates mcpTools table with metadata +``` + +## Integration Points + +**Process Manager:** +- Emits server logs (stderr batching) +- Emits lifecycle events (started, crashed, restarted, permanently_failed) +- Emits status changes (connecting, online, permanently_failed) + +**MCP Server Wrapper:** +- Emits request logs (tool execution batching) +- Emits status changes (offline, error, requires_reauth on failures) +- Emits status changes (connecting, online on recovery) + +**Tool Discovery Managers:** +- Emit status changes (connecting, discovering_tools, online, offline, error) +- Trigger tool metadata emission via UnifiedToolDiscoveryManager + +**Unified Tool Discovery Manager:** +- Emits `mcp.tools.discovered` after successful discovery +- Coordinates status callbacks from discovery managers + +## Implementation References + +**Phase 3:** Backend event handler system +**Phase 4:** Satellite status event emission +**Phase 7:** Server and request log batching +**Phase 10:** Tool metadata event emission +**Phase 13:** Stdio permanently_failed event +**Phase 18:** Tool execution failure status events + +## Related Documentation + +- [Status Tracking](/development/satellite/status-tracking) - Status values and lifecycle +- [Log Capture](/development/satellite/log-capture) - Logging system details +- [Process Management](/development/satellite/process-management) - Lifecycle events +- [Tool Discovery](/development/satellite/tool-discovery) - Tool metadata events diff --git a/development/satellite/hierarchical-router.mdx b/development/satellite/hierarchical-router.mdx index 5267e6c..9a11617 100644 --- a/development/satellite/hierarchical-router.mdx +++ b/development/satellite/hierarchical-router.mdx @@ -384,66 +384,9 @@ Satellite → Client ## Format Conversion -### External vs Internal Formats +The satellite converts between user-facing format (`serverName:toolName`) and internal routing format (`serverName-toolName`) transparently during tool discovery and execution. -The satellite uses different tool path formats for different purposes: - -**External Format (User-Facing): `serverName:toolName`** - -Used in: -- `discover_mcp_tools` responses -- `execute_mcp_tool` requests -- Any client-facing communication - -Examples: -- `github:create_issue` -- `figma:get_file` -- `postgres:query` - -Why colon? -- Standard separator in URIs and paths -- Clean, readable format -- Industry convention (npm packages, docker images) - -**Internal Format (Routing): `serverName-toolName`** - -Used in: -- Unified tool cache keys -- Tool discovery manager -- Process routing -- Internal lookups - -Examples: -- `github-create_issue` -- `figma-get_file` -- `postgres-query` - -Why dash? -- Existing codebase convention -- Backward compatibility -- All existing code uses dash format - -### Conversion Logic - -```typescript -// In handleExecuteTool() -const toolPath = "github:create_issue"; // From client - -// Parse external format -const [serverSlug, toolName] = toolPath.split(':'); - -// Convert to internal format -const namespacedToolName = `${serverSlug}-${toolName}`; -// Result: "github-create_issue" - -// Look up in cache -const cachedTool = toolDiscoveryManager.getTool(namespacedToolName); - -// Route to actual MCP server -await executeToolCall(namespacedToolName, toolArguments); -``` - -The conversion is transparent to both clients and actual MCP servers - it's purely a satellite internal concern. +See [Tool Discovery - Namespacing Strategy](/development/satellite/tool-discovery#namespacing-strategy) for complete details on naming conventions and format conversion logic. ## Search Implementation @@ -586,9 +529,17 @@ Both meta-tools are implemented and production-ready: - Fast search performance - Easy to monitor and debug +## Status-Based Tool Filtering + +The hierarchical router integrates with status tracking to hide tools from unavailable servers and provide clear error messages when unavailable tools are executed. + +See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete filtering logic, execution blocking rules, and status values. + ## Related Documentation - [Tool Discovery Implementation](/development/satellite/tool-discovery) - Internal tool caching and discovery +- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status +- [Recovery System](/development/satellite/recovery-system) - How offline servers auto-recover - [MCP Transport Protocols](/development/satellite/mcp-transport) - How clients connect - [Process Management](/development/satellite/process-management) - stdio server lifecycle - [Architecture Overview](/development/satellite/architecture) - Complete satellite design diff --git a/development/satellite/log-capture.mdx b/development/satellite/log-capture.mdx new file mode 100644 index 0000000..4f83bfd --- /dev/null +++ b/development/satellite/log-capture.mdx @@ -0,0 +1,450 @@ +--- +title: Log Capture +description: Server and request logging system in the satellite +--- + +# Log Capture + +The satellite captures and batches two types of logs for each MCP server installation: **server logs** (stderr output, connection errors, startup messages) and **request logs** (tool execution with full request/response data). + +## Overview + +Log capture serves three purposes: **Debugging** lets developers see stderr output and tool execution details, **Monitoring** tracks server health and tool usage in real-time, and **Audit Trail** provides a complete record of tool calls with parameters and responses + +Both log types use the same batching strategy (3-second interval, max 20 per batch) to optimize backend API calls and database writes. + +## Server Logs + +Server logs capture stderr output and connection events from MCP servers, particularly useful for debugging stdio-based servers. + +### What Gets Logged + +**Stdio Servers:** +- stderr output from the MCP server process +- Connection errors (handshake failures) +- Process spawn errors +- Crash information + +**HTTP/SSE Servers:** +- Connection errors (ECONNREFUSED, ETIMEDOUT) +- HTTP error responses (4xx, 5xx) +- OAuth authentication failures +- Network timeouts + +### Log Levels + +| Level | Usage | +|-------|-------| +| `info` | Normal operations (connection established, tool discovery started) | +| `warn` | Non-critical issues (retry attempts, temporary failures) | +| `error` | Critical errors (connection refused, auth failures, crashes) | +| `debug` | Detailed diagnostic information (handshake details, raw responses) | + +### Buffering Implementation + +```typescript +// services/satellite/src/process/manager.ts + +interface BufferedLogEntry { + installation_id: string; + team_id: string; + level: 'info' | 'warn' | 'error' | 'debug'; + message: string; + metadata?: Record; + timestamp: string; +} + +class ProcessManager { + private logBuffer: BufferedLogEntry[] = []; + private logFlushTimeout: NodeJS.Timeout | null = null; + private readonly LOG_BATCH_INTERVAL_MS = 3000; + private readonly LOG_BATCH_MAX_SIZE = 20; + + // Called when stderr receives data + private handleStderrData(processInfo: ProcessInfo, data: Buffer) { + const message = data.toString().trim(); + + this.bufferLogEntry({ + installation_id: processInfo.config.installation_id, + team_id: processInfo.config.team_id, + level: this.inferLogLevel(message), // 'error' if contains "error", etc. + message, + metadata: { process_id: processInfo.processId }, + timestamp: new Date().toISOString() + }); + } + + private bufferLogEntry(entry: BufferedLogEntry) { + this.logBuffer.push(entry); + + // Force immediate flush if buffer full + if (this.logBuffer.length >= this.LOG_BATCH_MAX_SIZE) { + this.flushLogBuffer(); + } else { + this.scheduleLogFlush(); // Flush after 3 seconds + } + } + + private scheduleLogFlush() { + if (this.logFlushTimeout) return; // Already scheduled + + this.logFlushTimeout = setTimeout(() => { + this.flushLogBuffer(); + }, this.LOG_BATCH_INTERVAL_MS); + } + + private flushLogBuffer() { + if (this.logBuffer.length === 0) return; + + // Group by installation + const groupedLogs = new Map(); + for (const entry of this.logBuffer) { + const key = `${entry.installation_id}:${entry.team_id}`; + if (!groupedLogs.has(key)) { + groupedLogs.set(key, []); + } + groupedLogs.get(key)!.push(entry); + } + + // Emit one event per installation + for (const [key, logs] of groupedLogs.entries()) { + this.eventBus?.emit('mcp.server.logs', { + installation_id: logs[0].installation_id, + team_id: logs[0].team_id, + logs: logs.map(log => ({ + level: log.level, + message: log.message, + metadata: log.metadata, + timestamp: log.timestamp + })) + }); + } + + // Clear buffer + this.logBuffer = []; + this.logFlushTimeout = null; + } +} +``` + +### Example Server Logs + +```json +{ + "installation_id": "inst_abc123", + "team_id": "team_xyz", + "logs": [ + { + "level": "info", + "message": "MCP server starting on port 3568", + "timestamp": "2025-01-15T10:30:00.000Z" + }, + { + "level": "error", + "message": "Connection refused: ECONNREFUSED", + "metadata": { "error_code": "ECONNREFUSED" }, + "timestamp": "2025-01-15T10:30:05.000Z" + }, + { + "level": "warn", + "message": "Retrying connection in 2 seconds...", + "timestamp": "2025-01-15T10:30:07.000Z" + } + ] +} +``` + +## Request Logs + +Request logs capture tool execution with full request parameters and server responses, providing complete visibility into MCP tool usage. + +### What Gets Logged + +For each tool execution: +- Tool name (e.g., `github:list-repos`) +- Input parameters sent to tool +- **Full response from MCP server** (captured in Phase 14) +- Response time in milliseconds +- Success/failure status +- Error message (if failed) +- User ID (who called the tool) +- Timestamp + +### Privacy Control + +Request logging can be disabled per-installation via settings: + +```typescript +// Installation settings +{ + "request_logging_enabled": false +} +``` + +When disabled: +- No request logs are buffered or emitted +- Tool execution still works normally +- Server logs (stderr) still captured +- Used for privacy-sensitive tools (internal APIs, credentials, PII) + +### Buffering Implementation + +```typescript +// services/satellite/src/core/mcp-server-wrapper.ts + +interface BufferedRequestEntry { + installation_id: string; + team_id: string; + user_id?: string; + tool_name: string; + tool_params: Record; + tool_response?: unknown; // Full MCP server response + response_time_ms: number; + success: boolean; + error_message?: string; + timestamp: string; +} + +class McpServerWrapper { + private requestLogBuffer: BufferedRequestEntry[] = []; + private requestLogFlushTimeout: NodeJS.Timeout | null = null; + private readonly REQUEST_LOG_BATCH_INTERVAL_MS = 3000; + private readonly REQUEST_LOG_BATCH_MAX_SIZE = 20; + + async handleExecuteTool(toolPath: string, toolArguments: unknown) { + const startTime = Date.now(); + let result: unknown; + let success = false; + let errorMessage: string | undefined; + + try { + result = await this.executeToolCall(toolPath, toolArguments); + success = true; + } catch (error) { + errorMessage = error instanceof Error ? error.message : 'Unknown error'; + } finally { + const responseTimeMs = Date.now() - startTime; + + // Check if logging is enabled (default: true) + const loggingEnabled = config?.settings?.request_logging_enabled !== false; + + // Buffer request log if installation context exists and logging enabled + if ((config?.installation_id && config?.team_id) && loggingEnabled) { + this.bufferRequestLogEntry({ + installation_id: config.installation_id, + team_id: config.team_id, + user_id: config.user_id, + tool_name: toolPath, + tool_params: toolArguments as Record, + tool_response: result, // Captured response + response_time_ms: responseTimeMs, + success, + error_message: errorMessage, + timestamp: new Date().toISOString() + }); + } + } + + return result; + } + + private bufferRequestLogEntry(entry: BufferedRequestEntry) { + this.requestLogBuffer.push(entry); + + // Force flush if buffer full + if (this.requestLogBuffer.length >= this.REQUEST_LOG_BATCH_MAX_SIZE) { + this.flushRequestLogBuffer(); + } else { + this.scheduleRequestLogFlush(); + } + } + + private flushRequestLogBuffer() { + if (this.requestLogBuffer.length === 0) return; + + // Group by installation + const grouped = this.groupRequestsByInstallation(this.requestLogBuffer); + + // Emit one event per installation + for (const [key, requests] of grouped.entries()) { + this.eventBus?.emit('mcp.request.logs', { + installation_id: requests[0].installation_id, + team_id: requests[0].team_id, + requests: requests.map(req => ({ + user_id: req.user_id, + tool_name: req.tool_name, + tool_params: req.tool_params, + tool_response: req.tool_response, // Include response + response_time_ms: req.response_time_ms, + success: req.success, + error_message: req.error_message, + timestamp: req.timestamp + })) + }); + } + + // Clear buffer + this.requestLogBuffer = []; + this.requestLogFlushTimeout = null; + } +} +``` + +### Example Request Logs + +```json +{ + "installation_id": "inst_abc123", + "team_id": "team_xyz", + "requests": [ + { + "user_id": "user_xyz", + "tool_name": "github:list-repos", + "tool_params": { + "owner": "deploystackio" + }, + "tool_response": { + "repos": ["deploystack", "mcp-server"], + "total": 2 + }, + "response_time_ms": 234, + "success": true, + "timestamp": "2025-01-15T10:30:00.000Z" + }, + { + "user_id": "user_xyz", + "tool_name": "slack:send-message", + "tool_params": { + "channel": "#general", + "text": "Deploy complete" + }, + "response_time_ms": 456, + "success": false, + "error_message": "Channel not found", + "timestamp": "2025-01-15T10:30:05.000Z" + } + ] +} +``` + +## Batching Configuration + +Both server logs and request logs use the same batching strategy. See [Event Emission - Batching Configuration](/development/satellite/event-emission#batching-configuration) for configuration parameters and rationale. + +### Batching Flow + +``` +Log/Request occurs + ↓ +Buffer entry in memory + ↓ + ├─ Buffer size < 20? + │ ↓ + │ Schedule flush after 3 seconds + │ + └─ Buffer size >= 20? + ↓ + Flush immediately (force) + ↓ +Group entries by installation + ↓ +Emit one event per installation + ↓ +Backend receives batched logs + ↓ +Bulk insert into database +``` + +## Backend Storage + +### Server Logs Table + +```sql +CREATE TABLE mcpServerLogs ( + id TEXT PRIMARY KEY, + installation_id TEXT NOT NULL, + level TEXT NOT NULL, -- 'info'|'warn'|'error'|'debug' + message TEXT NOT NULL, + metadata JSONB, + created_at TIMESTAMP NOT NULL, + FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id) +); +``` + +### Request Logs Table + +```sql +CREATE TABLE mcpRequestLogs ( + id TEXT PRIMARY KEY, + installation_id TEXT NOT NULL, + user_id TEXT, + tool_name TEXT NOT NULL, + tool_params JSONB NOT NULL, + tool_response JSONB, -- Full response from MCP server + response_time_ms INTEGER NOT NULL, + success BOOLEAN NOT NULL, + error_message TEXT, + created_at TIMESTAMP NOT NULL, + FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id), + FOREIGN KEY (user_id) REFERENCES authUser(id) +); +``` + +### Cleanup Job + +A backend cron job enforces a 100-line limit per installation for both tables: + +```typescript +// Runs every 10 minutes +// For each installation with > 100 logs: +// 1. Find oldest logs to delete (keep most recent 100) +// 2. DELETE FROM table WHERE id NOT IN (recent 100) +``` + +This prevents unbounded table growth while maintaining recent debugging history. + +## Buffer Management + +### Memory Usage + +**Server Logs:** +- Maximum ~20 entries in buffer before flush +- Each entry: ~200 bytes average (message + metadata) +- Max buffer size: ~4 KB per ProcessManager instance + +**Request Logs:** +- Maximum ~20 entries in buffer before flush +- Each entry: Variable (depends on params/response size) +- Typically: 500 bytes - 5 KB per entry +- Max buffer size: ~10-100 KB per McpServerWrapper instance + +### Cleanup on Shutdown + +Both buffer managers flush remaining logs on cleanup: + +```typescript +// ProcessManager cleanup +cleanup() { + this.flushLogBuffer(); // Flush any buffered logs + clearTimeout(this.logFlushTimeout); +} + +// McpServerWrapper cleanup +cleanup() { + this.flushRequestLogBuffer(); // Flush any buffered requests + clearTimeout(this.requestLogFlushTimeout); +} +``` + +## Implementation References + +**Phase 7:** Server and request log batching implementation +**Phase 14:** Request logging toggle and tool response capture +**Phase 5:** Backend log tables and event handlers +**Phase 6:** 100-line cleanup job + +## Related Documentation + +- [Event Emission](/development/satellite/event-emission) - Log event types and payloads +- [Process Management](/development/satellite/process-management) - Server log buffering +- [Status Tracking](/development/satellite/status-tracking) - How logs relate to status diff --git a/development/satellite/process-management.mdx b/development/satellite/process-management.mdx index 34c974e..5c8ceb3 100644 --- a/development/satellite/process-management.mdx +++ b/development/satellite/process-management.mdx @@ -408,36 +408,11 @@ The ProcessManager emits events for monitoring and integration: ## Event Emission -The ProcessManager emits real-time events to the Backend for operational visibility and audit trails. These events are batched every 3 seconds and sent via the Event System. +The ProcessManager emits real-time lifecycle events (started, crashed, restarted, permanently_failed) to the Backend for operational visibility and audit trails. -### Lifecycle Events +ProcessManager internal events (processSpawned, processTerminated) are for satellite-internal coordination. Event System events (mcp.server.started, etc.) are sent to Backend for external visibility. -**mcp.server.started** -- Emitted after successful spawn and handshake completion -- Includes: server_id, process_id, spawn_duration_ms, tool_count -- Provides immediate visibility into new MCP server availability - -**mcp.server.crashed** -- Emitted on unexpected process exit with non-zero code -- Includes: exit_code, signal, uptime_seconds, crash_count, will_restart -- Enables real-time alerting for process failures - -**mcp.server.restarted** -- Emitted after successful automatic restart -- Includes: old_process_id, new_process_id, restart_reason, attempt_number -- Tracks restart attempts for reliability monitoring - -**mcp.server.permanently_failed** -- Emitted when restart limit (3 attempts) is exceeded -- Includes: total_crashes, last_error, failed_at timestamp -- Critical alert requiring manual intervention - -**Event vs Internal Events:** -- ProcessManager internal events (processSpawned, processTerminated, etc.) are for satellite-internal coordination -- Event System events (mcp.server.started, etc.) are sent to Backend for external visibility -- Both work together: Internal events trigger state changes, Event System events provide audit trail - -For complete event system documentation and all event types, see [Event System](/development/satellite/event-system). +See [Event Emission - Process Lifecycle Events](/development/satellite/event-emission#event-types-reference) for complete event types, payloads, and batching configuration. ## Team Isolation @@ -531,10 +506,51 @@ LOG_LEVEL=debug npm run dev - Enabled by default (MCP servers need external connectivity) - Can be disabled for higher security requirements +## Status Events + +Process lifecycle emits status events to backend for real-time monitoring: + +**Status Event Emission:** +- `connecting` - When process spawn starts +- `online` - After successful handshake and tool discovery +- `permanently_failed` - When process crashes 3 times in 5 minutes + +See [Event Emission](/development/satellite/event-emission) for complete event types and payloads. + +## Log Buffering + +Process stderr output is buffered and batched before emission: + +**Buffering Strategy:** +- Batch interval: 3 seconds after first log +- Max batch size: 20 logs (forces immediate flush) +- Grouping: By installation_id + team_id + +**Log Levels:** +- Inferred from message content (`error` if contains "error", etc.) +- Metadata includes process_id for debugging + +See [Log Capture](/development/satellite/log-capture) for buffer management details. + +## Configuration Restart Flow + +When configuration is updated (env vars, args, headers, query params): + +1. Backend sets installation status to `restarting` +2. Backend sends `configure` command to satellite +3. Satellite receives command and stops old process +4. Satellite clears tool cache for installation +5. Satellite spawns new process with updated configuration +6. Status progresses: `restarting` → `connecting` → `discovering_tools` → `online` + +See [Status Tracking](/development/satellite/status-tracking) for configuration update status transitions. + ## Related Documentation - [Satellite Architecture Design](/development/satellite/architecture) - Overall system architecture - [Idle Process Management](/development/satellite/idle-process-management) - Automatic termination and respawning of idle processes - [Tool Discovery Implementation](/development/satellite/tool-discovery) - How tools are discovered from processes -- [Team Isolation Implementation](/development/satellite/team-isolation) - Team-based access control +- [Event Emission](/development/satellite/event-emission) - Process lifecycle events +- [Log Capture](/development/satellite/log-capture) - stderr log buffering +- [Status Tracking](/development/satellite/status-tracking) - Process status management - [Backend Communication](/development/satellite/backend-communication) - Integration with Backend commands diff --git a/development/satellite/recovery-system.mdx b/development/satellite/recovery-system.mdx new file mode 100644 index 0000000..84527a5 --- /dev/null +++ b/development/satellite/recovery-system.mdx @@ -0,0 +1,370 @@ +--- +title: Recovery System +description: Automatic recovery and failure handling for MCP servers +--- + +# Recovery System + +The satellite automatically detects and recovers from MCP server failures without manual intervention. Recovery works for HTTP/SSE servers (network failures) and stdio servers (process crashes). + +## Overview + +The recovery system handles **HTTP/SSE Servers** (network failures, server downtime, connection timeouts) and **Stdio Servers** (process crashes up to 3 times in 5 minutes) + +Recovery is fully automatic for recoverable failures. Permanent failures (3+ crashes, OAuth token expired) require manual action. + +## Recovery Detection + +### Tool Execution Recovery + +When a tool is executed on a server that was previously offline/error, recovery is detected automatically: + +```typescript +// services/satellite/src/core/mcp-server-wrapper.ts + +async handleExecuteTool(toolPath: string, toolArguments: unknown) { + const serverSlug = toolPath.split(':')[0]; + const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug); + const wasOfflineOrError = statusEntry && ['offline', 'error'].includes(statusEntry.status); + + // Execute tool with retry logic + const result = await this.executeHttpToolCallWithRetry(...); + + // If execution succeeded but server was offline/error → RECOVERY DETECTED + if (wasOfflineOrError) { + this.handleServerRecovery(serverSlug, config); + } + + return result; +} +``` + +### Health Check Recovery + +Backend health checks periodically test offline servers. When they respond again: + +``` +Backend health check runs (every 3 minutes) + ↓ +Offline template now responds + ↓ +Backend sets installations to 'connecting' + ↓ +Backend sends 'configure' command with event='mcp_recovery' + ↓ +Satellite receives command and triggers re-discovery + ↓ +Status progresses: connecting → discovering_tools → online +``` + +## Retry Logic (HTTP/SSE) + +Before marking a server as offline, the satellite retries tool execution with exponential backoff: + +```typescript +// services/satellite/src/core/mcp-server-wrapper.ts + +interface RetryConfig { + maxRetries: 3; + backoffMs: [500, 1000, 2000]; // Exponential: 500ms, 1s, 2s +} + +async executeHttpToolCallWithRetry( + serverConfig: McpServerConfig, + toolName: string, + args: unknown +): Promise { + let lastError: Error; + + for (let attempt = 1; attempt <= 3; attempt++) { + try { + const response = await this.executeHttpToolCall(serverConfig, toolName, args); + return response; // Success - no retry needed + } catch (error) { + lastError = error; + + // Non-retryable errors (auth failures) → fail immediately + if (this.isNonRetryableError(error)) { + throw error; + } + + // Retryable errors (connection refused) → backoff and retry + if (attempt < 3) { + const backoffMs = [500, 1000, 2000][attempt - 1]; + await new Promise(resolve => setTimeout(resolve, backoffMs)); + } + } + } + + // All retries exhausted → throw last error + throw lastError; +} + +private isNonRetryableError(error: Error): boolean { + const msg = error.message.toLowerCase(); + return msg.includes('401') || msg.includes('403') || + msg.includes('unauthorized') || msg.includes('forbidden') || + msg.includes('oauth') || msg.includes('authorization required'); +} +``` + +### Retryable vs Non-Retryable Errors + +| Error Type | Action | Reason | +|------------|--------|--------| +| ECONNREFUSED | **Retry** | Server may be restarting | +| ETIMEDOUT | **Retry** | Network hiccup, may recover | +| ENOTFOUND | **Retry** | DNS issue, may be temporary | +| fetch failed | **Retry** | Network error, transient | +| 401 Unauthorized | **No retry** | Token expired, retrying won't help | +| 403 Forbidden | **No retry** | Access denied, retrying won't help | +| OAuth errors | **No retry** | Auth issue, needs user action | + +## Recovery Flow + +When servers recover from failure, the satellite updates status and triggers re-discovery asynchronously without blocking tool execution responses. + +See [Status Tracking - Status Lifecycle](/development/satellite/status-tracking#status-lifecycle) for complete recovery flow diagrams including successful recovery, failed recovery, and status transitions. + +## Automatic Re-Discovery + +When recovery is detected, tools are refreshed from the server without blocking the user: + +```typescript +// services/satellite/src/core/mcp-server-wrapper.ts + +private async handleServerRecovery( + serverSlug: string, + config: McpServerConfig +): Promise { + // Prevent duplicate recovery attempts + if (this.recoveryInProgress.has(serverSlug)) { + return; // Already recovering + } + + this.recoveryInProgress.add(serverSlug); + + try { + this.logger.info({ serverSlug }, 'Server recovered - triggering re-discovery'); + + // Emit status change to backend + this.eventBus?.emit('mcp.server.status_changed', { + installation_id: config.installation_id, + team_id: config.team_id, + status: 'connecting', + status_message: 'Server recovered, re-discovering tools', + timestamp: new Date().toISOString() + }); + + // Trigger re-discovery asynchronously (doesn't block tool response) + await this.toolDiscoveryManager?.remoteToolManager?.discoverServerTools(serverSlug); + + this.logger.info({ serverSlug }, 'Tool re-discovery successful after recovery'); + } catch (error) { + // Re-discovery failed (non-fatal, tool response still returned) + this.logger.error({ serverSlug, error }, 'Tool re-discovery failed after recovery'); + } finally { + this.recoveryInProgress.delete(serverSlug); + } +} +``` + +### Why Asynchronous Re-Discovery? + +**User Experience:** +- Tool execution result returned immediately +- User doesn't wait for tool discovery (can take 1-5 seconds) +- If re-discovery fails, user already got their result + +**Reliability:** +- Tool response isn't blocked by discovery errors +- Discovery failure doesn't affect user's current request +- Recovery can be retried later + +## Tool Preservation + +When re-discovery fails, tools are NOT removed from cache: + +```typescript +// services/satellite/src/services/remote-tool-discovery-manager.ts + +async rediscoverServerTools(serverSlug: string): Promise { + try { + // Attempt discovery + const newTools = await this.fetchToolsFromServer(serverSlug); + + // Discovery succeeded → remove old tools and add new ones + this.removeToolsForServer(serverSlug); + this.addTools(newTools); + + this.statusCallback?.(serverSlug, 'online'); + } catch (error) { + // Discovery failed → keep old tools in cache + // Tools remain available for future attempts + this.statusCallback?.(serverSlug, 'error', error.message); + } +} +``` + +**Why preserve tools on failure?** +- User can still see what tools are available +- Tools may work if server recovers later +- Better UX than empty tool list +- Discovery can be retried without losing tool metadata + +## Stdio Process Recovery + +Stdio servers auto-restart after crashes (up to 3 times in 5 minutes): + +```typescript +// services/satellite/src/process/manager.ts + +async handleProcessExit(processInfo: ProcessInfo, exitCode: number) { + const now = Date.now(); + const fiveMinutesAgo = now - 5 * 60 * 1000; + + // Track crashes in 5-minute window + processInfo.crashHistory = processInfo.crashHistory.filter(t => t > fiveMinutesAgo); + processInfo.crashHistory.push(now); + + const crashCount = processInfo.crashHistory.length; + + if (crashCount >= 3) { + // Permanent failure - emit status event + this.eventBus?.emit('mcp.server.permanently_failed', { + installation_id: processInfo.config.installation_id, + team_id: processInfo.config.team_id, + process_id: processInfo.processId, + crash_count: crashCount, + message: `Process crashed ${crashCount} times in 5 minutes`, + timestamp: new Date().toISOString() + }); + + // Also emit status_changed for database update + this.eventBus?.emit('mcp.server.status_changed', { + installation_id: processInfo.config.installation_id, + team_id: processInfo.config.team_id, + status: 'permanently_failed', + status_message: `Process crashed ${crashCount} times in 5 minutes. Manual restart required.`, + timestamp: new Date().toISOString() + }); + + return; // No auto-restart + } + + // Auto-restart (crash count < 3) + this.logger.info({ processId: processInfo.processId, crashCount }, 'Auto-restarting crashed process'); + await this.startProcess(processInfo.config); +} +``` + +### Stdio Recovery Timeline + +``` +Process crashes (crash #1) + ↓ +Auto-restart immediately + ↓ +Process crashes again (crash #2, within 5 min) + ↓ +Auto-restart immediately + ↓ +Process crashes again (crash #3, within 5 min) + ↓ +Status → 'permanently_failed' + ↓ +No auto-restart (manual action required) +``` + +## Failure Status Mapping + +When tool execution fails after all retries, error messages are mapped to appropriate status values: + +```typescript +// services/satellite/src/services/remote-tool-discovery-manager.ts + +static getStatusFromError(error: Error): { status: string; message: string } { + const msg = error.message.toLowerCase(); + + // Auth errors → requires_reauth + if (msg.includes('401') || msg.includes('unauthorized')) { + return { status: 'requires_reauth', message: 'Authentication failed (HTTP 401)' }; + } + if (msg.includes('403') || msg.includes('forbidden')) { + return { status: 'requires_reauth', message: 'Access forbidden (HTTP 403)' }; + } + + // Connection errors → offline + if (msg.includes('econnrefused') || msg.includes('etimedout') || + msg.includes('enotfound') || msg.includes('fetch failed')) { + return { status: 'offline', message: 'Server unreachable' }; + } + + // Other errors → error + return { status: 'error', message: error.message }; +} +``` + +## Debouncing Concurrent Recovery + +Multiple tool executions may detect recovery simultaneously. Debouncing prevents duplicate re-discoveries: + +```typescript +class McpServerWrapper { + private recoveryInProgress: Set = new Set(); + + private async handleServerRecovery(serverSlug: string, config: McpServerConfig) { + // Check if already recovering + if (this.recoveryInProgress.has(serverSlug)) { + return; // Skip duplicate recovery + } + + this.recoveryInProgress.add(serverSlug); + + try { + await this.performRecovery(serverSlug, config); + } finally { + this.recoveryInProgress.delete(serverSlug); + } + } +} +``` + +**Scenario:** +- LLM executes 3 tools from same server concurrently +- All 3 detect recovery (server was offline) +- Only first execution triggers re-discovery +- Other 2 skip (already in progress) + +## Recovery Timing + +| Recovery Type | Detection Time | Re-Discovery Time | Total | +|---------------|----------------|-------------------|-------| +| **Tool Execution** | Immediate (on next tool call) | 1-5 seconds | ~1-5s | +| **Health Check** | Up to 3 minutes (polling interval) | 1-5 seconds | ~3-8 min | + +**Recommendation:** Tool execution recovery is faster and more responsive than health check recovery. + +## Manual Recovery (Requires User Action) + +Some failures cannot auto-recover: + +| Status | Reason | User Action | +|--------|--------|-------------| +| `requires_reauth` | OAuth token expired/revoked | Re-authenticate in dashboard | +| `permanently_failed` | 3+ crashes in 5 minutes (stdio) | Check logs, fix issue, manual restart | + +See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays). + +## Implementation References + +**Phase 13:** Stdio auto-recovery and permanently_failed status +**Phase 18:** Tool execution retry logic and recovery detection +**Phase 8:** Health check recovery via backend + +## Related Documentation + +- [Status Tracking](/development/satellite/status-tracking) - Status values and transitions +- [Event Emission](/development/satellite/event-emission) - Recovery status events +- [Tool Discovery](/development/satellite/tool-discovery) - Re-discovery after recovery +- [Process Management](/development/satellite/process-management) - Stdio crash recovery diff --git a/development/satellite/status-tracking.mdx b/development/satellite/status-tracking.mdx new file mode 100644 index 0000000..a530c55 --- /dev/null +++ b/development/satellite/status-tracking.mdx @@ -0,0 +1,284 @@ +--- +title: Status Tracking +description: MCP server installation status tracking system in the satellite +--- + +# Status Tracking + +The satellite tracks the health and availability of each MCP server installation through an 11-state status system. This enables real-time monitoring, automatic recovery, and tool availability filtering. + +## Overview + +Status tracking serves three primary purposes: + +1. **User Visibility**: Users see current server state in real-time via the frontend +2. **Tool Availability**: Tools from unavailable servers are filtered from discovery +3. **Automatic Recovery**: System detects and recovers from failures automatically + +The status system is managed by `UnifiedToolDiscoveryManager` and updated through: +- Installation lifecycle events (provisioning → online) +- Health check results (online → offline) +- Tool execution failures (online → offline/error/requires_reauth) +- Configuration changes (online → restarting) +- Recovery detection (offline → connecting → online) + +## Status Values + +| Status | Description | Tools Available? | User Action Required | +|--------|-------------|------------------|---------------------| +| `provisioning` | Initial state after installation created | No | Wait | +| `command_received` | Satellite received configuration command | No | Wait | +| `connecting` | Connecting to MCP server | No | Wait | +| `discovering_tools` | Running tool discovery | No | Wait | +| `syncing_tools` | Syncing tools to backend | No | Wait | +| `online` | Server healthy and responding | **Yes** | None | +| `restarting` | Configuration updated, server restarting | No | Wait | +| `offline` | Server unreachable (auto-recovers) | No | Wait or check server | +| `error` | General error state (auto-recovers) | No | Check logs | +| `requires_reauth` | OAuth token expired/revoked | No | Re-authenticate | +| `permanently_failed` | 3+ crashes in 5 minutes (stdio only) | No | Manual restart required | + +## Status Lifecycle + +### Initial Installation Flow + +``` +provisioning + ↓ +command_received (satellite received configure command) + ↓ +connecting (spawning MCP server process or connecting to HTTP/SSE) + ↓ +discovering_tools (calling tools/list) + ↓ +syncing_tools (sending tools to backend) + ↓ +online (ready for use) +``` + +### Configuration Update Flow + +``` +online + ↓ +restarting (user updated config, backend sets status immediately) + ↓ +connecting (satellite receives command, restarts server) + ↓ +discovering_tools + ↓ +online +``` + +### Failure and Recovery Flow + +``` +online + ↓ +offline/error (server unreachable or error response) + ↓ +[automatic recovery when server comes back] + ↓ +connecting + ↓ +discovering_tools + ↓ +online +``` + +### OAuth Failure Flow + +``` +online + ↓ +requires_reauth (401/403 response or token refresh failed) + ↓ +[user re-authenticates via dashboard] + ↓ +connecting + ↓ +discovering_tools + ↓ +online +``` + +### Stdio Crash Flow (Permanent Failure) + +``` +online + ↓ +(stdio process crashes) + ↓ +connecting (auto-restart attempt 1) + ↓ +(crashes again within 5 minutes) + ↓ +connecting (auto-restart attempt 2) + ↓ +(crashes again within 5 minutes) + ↓ +permanently_failed (manual intervention required) +``` + +## Status Tracking Implementation + +### UnifiedToolDiscoveryManager + +The status system is implemented in `UnifiedToolDiscoveryManager`: + +```typescript +// services/satellite/src/services/unified-tool-discovery-manager.ts + +export type ServerAvailabilityStatus = + | 'online' + | 'offline' + | 'error' + | 'requires_reauth' + | 'permanently_failed' + | 'connecting' + | 'discovering_tools'; + +export interface ServerStatusEntry { + status: ServerAvailabilityStatus; + lastUpdated: Date; + message?: string; +} + +class UnifiedToolDiscoveryManager { + private serverStatus: Map = new Map(); + + // Set server status (called by discovery managers and MCP wrapper) + setServerStatus(serverSlug: string, status: ServerAvailabilityStatus, message?: string): void { + this.serverStatus.set(serverSlug, { + status, + lastUpdated: new Date(), + message + }); + } + + // Check if server is available for tool execution + isServerAvailable(serverSlug: string): boolean { + const statusEntry = this.serverStatus.get(serverSlug); + if (!statusEntry) return true; // Unknown = available (safe default) + return statusEntry.status === 'online'; + } + + // Get all tools, filtered by server status + getAllTools(): ToolMetadata[] { + const allTools = this.getAllToolsUnfiltered(); + return allTools.filter(tool => { + const serverSlug = tool.tool_path.split(':')[0]; + return this.isServerAvailable(serverSlug); + }); + } +} +``` + +### Status Callbacks + +Discovery managers call status callbacks when discovery succeeds or fails: + +**HTTP/SSE Discovery:** +```typescript +// services/satellite/src/services/remote-tool-discovery-manager.ts + +// On successful discovery +this.statusCallback?.(serverSlug, 'online'); + +// On connection error +const { status, message } = RemoteToolDiscoveryManager.getStatusFromError(error); +this.statusCallback?.(serverSlug, status, message); +``` + +**Stdio Discovery:** +```typescript +// services/satellite/src/services/stdio-tool-discovery-manager.ts + +// On successful discovery +this.statusCallback?.(processId, 'online'); + +// On discovery error +this.statusCallback?.(processId, 'error', errorMessage); +``` + +## Tool Filtering by Status + +### Discovery Filtering + +When LLMs call `discover_mcp_tools`, only tools from available servers are returned: + +```typescript +// UnifiedToolDiscoveryManager.getAllTools() filters by status +const tools = toolDiscoveryManager.getAllTools(); // Only 'online' servers + +// Tools from offline/error/requires_reauth servers are hidden +``` + +### Execution Blocking + +When LLMs attempt to execute tools from unavailable servers: + +```typescript +// services/satellite/src/core/mcp-server-wrapper.ts + +const serverSlug = toolPath.split(':')[0]; +const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug); + +// Block execution for non-recoverable states +if (statusEntry?.status === 'requires_reauth') { + return { + error: `Tool cannot be executed - server requires re-authentication. + +Status: ${statusEntry.status} +The server requires re-authentication. Please re-authorize in the dashboard. + +Unavailable server: ${serverSlug}` + }; +} + +// Allow execution for offline/error (enables recovery detection) +``` + +## Status Transition Triggers + +### Backend-Triggered (Database Updates) + +**Source:** Backend API routes + +| Trigger | New Status | When | +|---------|-----------|------| +| Installation created | `provisioning` | User installs MCP server | +| Config updated | `restarting` | User modifies environment vars/args/headers | +| OAuth callback success | `connecting` | User re-authenticates | +| Health check fails | `offline` | Server unreachable (3-min interval) | +| Credential validation fails | `requires_reauth` | OAuth token invalid | + +### Satellite-Triggered (Event Emission) + +**Source:** Satellite emits `mcp.server.status_changed` events to backend + +| Trigger | New Status | When | +|---------|-----------|------| +| Configure command received | `command_received` | Satellite polls backend | +| Server connection starts | `connecting` | Spawning process or HTTP connect | +| Tool discovery starts | `discovering_tools` | Calling tools/list | +| Tool discovery succeeds | `online` | Discovery completed successfully | +| Tool execution fails (3 retries) | `offline`/`error`/`requires_reauth` | Tool call failed after retries | +| Server recovery detected | `connecting` | Previously offline server responds | +| Stdio crashes 3 times | `permanently_failed` | 3 crashes within 5 minutes | + +## Implementation References + +**Phase 1:** Database schema for status field +**Phase 3:** Backend event handler for status updates +**Phase 4:** Satellite status event emission +**Phase 10:** Tool availability filtering by status +**Phase 17:** Configuration update status transitions +**Phase 18:** Tool execution status updates + auto-recovery + +## Related Documentation + +- [Event Emission](/development/satellite/event-emission) - Status change event details +- [Recovery System](/development/satellite/recovery-system) - Automatic recovery logic +- [Tool Discovery](/development/satellite/tool-discovery) - How status affects tool discovery +- [Hierarchical Router](/development/satellite/hierarchical-router) - Status-based tool filtering diff --git a/development/satellite/tool-discovery.mdx b/development/satellite/tool-discovery.mdx index b4cdba1..00abbbe 100644 --- a/development/satellite/tool-discovery.mdx +++ b/development/satellite/tool-discovery.mdx @@ -281,7 +281,7 @@ The satellite uses `estimateMcpServerTokens()` from `token-counter.ts` to calcul - Enable frontend tool catalog display with token consumption metrics - Provide analytics on MCP server complexity and context window usage -See [Event System](/development/satellite/event-system) for event batching and delivery details. +For event payload structure and event batching details, see [Event Emission - mcp.tools.discovered](/development/satellite/event-emission#mcp-tools-discovered). ## Development Considerations @@ -349,4 +349,66 @@ curl http://localhost:3001/api/status/debug - Detailed usage and performance analytics - Cache persistence for faster startup (HTTP only) +## Status Integration + +Tool discovery integrates with the status tracking system to filter tools and enable automatic recovery. Discovery managers call status callbacks on success/failure to update installation status in real-time. + +See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete details on status-based tool filtering and execution blocking. + +## Recovery System + +When offline servers recover, tool discovery is automatically triggered. The satellite preserves existing tools during re-discovery attempts to prevent tool loss on failure. + +See [Recovery System - Recovery Detection](/development/satellite/recovery-system#recovery-detection) for complete recovery logic, retry strategy, and tool preservation implementation. + +## Tool Metadata Events + +Discovered tools are emitted to backend with token count estimates. + +**Event Structure:** +```typescript +eventBus.emit('mcp.tools.discovered', { + installation_id: string, + team_id: string, + tools: [{ + tool_path: string, + name: string, + description?: string, + inputSchema: unknown, + token_count: number // Estimated token usage + }] +}); +``` + +**Token Calculation:** +- Name + description + input schema serialized +- Estimated using character count / 4 (approximate tokens) +- Used for analytics and optimization + +See [Event Emission](/development/satellite/event-emission) for complete event types. + +## Request Logging + +Tool execution is logged with full request/response data for debugging. + +**Logged Information:** +- Tool name and input parameters +- Full MCP server response (captured) +- Response time in milliseconds +- Success/failure status and error messages +- User attribution (who called the tool) + +**Privacy Control:** +Request logging can be disabled per-installation via `settings.request_logging_enabled = false`. + +See [Log Capture](/development/satellite/log-capture) for buffering and storage details. + +## Related Documentation + +- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status +- [Recovery System](/development/satellite/recovery-system) - Automatic re-discovery on recovery +- [Event Emission](/development/satellite/event-emission) - Tool metadata events +- [Log Capture](/development/satellite/log-capture) - Request logging system +- [Hierarchical Router](/development/satellite/hierarchical-router) - How tools are exposed to MCP clients + The unified tool discovery implementation provides a solid foundation for multi-transport MCP server integration while maintaining simplicity and reliability for development and production use. diff --git a/docs.json b/docs.json index 47479f6..25fd8ef 100644 --- a/docs.json +++ b/docs.json @@ -207,6 +207,15 @@ "/development/satellite/mcp-server-token-injection" ] }, + { + "group": "Status & Health Tracking", + "pages": [ + "/development/satellite/status-tracking", + "/development/satellite/event-emission", + "/development/satellite/log-capture", + "/development/satellite/recovery-system" + ] + }, { "group": "Backend Communication", "pages": [ From 080b90af4e8383d02324e26f2c58068547709256 Mon Sep 17 00:00:00 2001 From: Lasim Date: Thu, 25 Dec 2025 09:49:40 +0100 Subject: [PATCH 2/2] docs(satellite): update documentation for status tracking, health checks, and OAuth token handling --- development/backend/plugins.mdx | 8 +- development/backend/satellite/commands.mdx | 53 ++++- .../backend/satellite/communication.mdx | 186 +++++++++++++++++- development/backend/satellite/events.mdx | 52 ++++- development/satellite/architecture.mdx | 6 +- .../satellite/backend-communication.mdx | 4 +- development/satellite/event-emission.mdx | 17 +- development/satellite/index.mdx | 8 +- development/satellite/log-capture.mdx | 13 +- .../satellite/mcp-server-token-injection.mdx | 4 +- development/satellite/process-management.mdx | 8 +- development/satellite/recovery-system.mdx | 9 +- development/satellite/status-tracking.mdx | 17 +- 13 files changed, 331 insertions(+), 54 deletions(-) diff --git a/development/backend/plugins.mdx b/development/backend/plugins.mdx index e1dbdf4..58e76d1 100644 --- a/development/backend/plugins.mdx +++ b/development/backend/plugins.mdx @@ -313,8 +313,8 @@ The `databaseExtension` property allows your plugin to: #### How Plugin Database Tables Work **Security Architecture:** -- **Phase 1 (Trusted)**: Core migrations run first (static, secure) -- **Phase 2 (Untrusted)**: Plugin tables created dynamically (sandboxed) +- **Stage 1 (Trusted)**: Core migrations run first (static, secure) +- **Stage 2 (Untrusted)**: Plugin tables created dynamically (sandboxed) - **Clear Separation**: Plugin tables cannot interfere with core database structure **Dynamic Table Creation:** @@ -421,7 +421,7 @@ The database initialization follows a strict security-first approach: ``` ┌─────────────────────────────────────────┐ -│ Phase 1: Core System (Trusted) │ +│ Stage 1: Core System (Trusted) │ ├─────────────────────────────────────────┤ │ 1. Apply core migrations │ │ 2. Create core tables │ @@ -430,7 +430,7 @@ The database initialization follows a strict security-first approach: │ ▼ Security Boundary ┌─────────────────────────────────────────┐ -│ Phase 2: Plugin System (Sandboxed) │ +│ Stage 2: Plugin System (Sandboxed) │ ├─────────────────────────────────────────┤ │ 1. Generate CREATE TABLE SQL │ │ 2. Drop existing plugin tables │ diff --git a/development/backend/satellite/commands.mdx b/development/backend/satellite/commands.mdx index 0608d69..c8cdeaf 100644 --- a/development/backend/satellite/commands.mdx +++ b/development/backend/satellite/commands.mdx @@ -32,7 +32,7 @@ The system supports 5 command types defined in the `command_type` enum: | `spawn` | Start MCP server process | Launch HTTP proxy or stdio process | | `kill` | Stop MCP server process | Terminate process gracefully | | `restart` | Restart MCP server | Stop and start process | -| `health_check` | Verify server health | Call tools/list to check connectivity | +| `health_check` | Verify server health and validate credentials | Check connectivity or validate OAuth tokens | ### Configure Commands @@ -74,6 +74,30 @@ interface CommandPayload { } ``` +## Status Changes Triggered by Commands + +Commands trigger installation status changes through satellite event emission: + +| Command | Status Before | Status After | When | +|---------|--------------|--------------|------| +| `configure` (install) | N/A | `provisioning` → `command_received` → `connecting` | Installation creation flow | +| `configure` (update) | `online` | `restarting` → `online` | Configuration change applied | +| `configure` (delete) | Any | Process terminated | Installation removal | +| `health_check` (credential) | `online` | `requires_reauth` | OAuth token invalid | +| `restart` | `online` | `restarting` → `online` | Manual restart requested | + +**Status Lifecycle on Installation**: +1. Backend creates installation → status=`provisioning` +2. Backend sends `configure` command → status=`command_received` +3. Satellite connects to server → status=`connecting` +4. Satellite discovers tools → status=`discovering_tools` +5. Satellite syncs tools to backend → status=`syncing_tools` +6. Process complete → status=`online` + +For complete status transition documentation, see [Backend Events - Status Values](/development/backend/satellite/events#mcp-server-status_changed). + +--- + ## Command Event Types All `configure` commands include an `event` field in the payload for tracking and logging: @@ -168,6 +192,14 @@ await satelliteCommandService.notifyMcpRecovery( **Payload**: `event: 'mcp_recovery'` +**Status Flow**: +- Triggered by health check detecting offline installation +- Sets status to `connecting` +- Satellite rediscovers tools +- Status progresses: offline → connecting → discovering_tools → online + +For complete recovery system documentation, see [Backend Communication - Auto-Recovery](/development/backend/satellite/communication#auto-recovery-system). + ## Critical Pattern **ALWAYS use the correct convenience method**: @@ -247,9 +279,22 @@ When satellites receive commands: 3. Execute spawn sequence **For `health_check` commands**: -1. Call tools/list on target server -2. Verify response -3. Report health status +1. Check `payload.check_type` field: + - `connectivity` (default): Call tools/list to verify server responds + - `credential_validation`: Validate OAuth tokens for installation +2. Execute appropriate validation +3. Report health status via `mcp.server.status_changed` event: + - `online` - Health check passed + - `requires_reauth` - OAuth token expired/revoked + - `error` - Validation failed with error + +**Credential Validation Flow**: +- Backend cron job sends `health_check` command with `check_type: 'credential_validation'` +- Satellite validates OAuth token (performs token refresh test) +- Emits status event based on validation result +- Backend updates `mcpServerInstallations.status` and `last_credential_check_at` + +For satellite-side credential validation implementation, see [Satellite OAuth Authentication](/development/satellite/oauth-authentication). ## Example Usage diff --git a/development/backend/satellite/communication.mdx b/development/backend/satellite/communication.mdx index 3eefcce..a6a9dde 100644 --- a/development/backend/satellite/communication.mdx +++ b/development/backend/satellite/communication.mdx @@ -106,20 +106,20 @@ The system uses three distinct communication patterns: ### Security Architecture -The satellite pairing process implements a secure **two-phase JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication). +The satellite pairing process implements a secure **two-step JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication). -**Phase 1: Token Generation** +**Step 1: Token Generation** - Administrators generate temporary registration tokens through admin APIs - Scope-specific tokens (global vs team) with cryptographic signatures - Token management endpoints for generation, listing, and revocation -**Phase 2: Satellite Registration** +**Step 2: Satellite Registration** - Satellites authenticate using `Authorization: Bearer deploystack_satellite_*` headers - Backend validates JWT tokens with single-use consumption - Permanent API keys issued after successful token validation - Token consumed to prevent replay attacks -**Breaking Change**: As of Phase 3 implementation, all new satellite registrations require valid registration tokens. The open registration system has been secured. +**Note**: All new satellite registrations require valid registration tokens. The open registration system has been secured. ### Registration Middleware @@ -261,6 +261,153 @@ Configuration respects team boundaries and isolation: - Team-defined security policies - Internal resource access settings +## Frontend API Endpoints + +The backend provides REST and SSE endpoints for frontend access to installation status, logs, and requests. + +### Status & Monitoring Endpoints + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/status`** +- Returns current installation status, status message, and last update timestamp +- Used by frontend for real-time status badges and progress indicators + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs`** +- Returns paginated server logs (stderr output, connection errors) +- Query params: `limit`, `offset` for pagination +- Limited to 100 lines per installation (enforced by cleanup cron job) + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests`** +- Returns paginated request logs (tool execution history) +- Includes request params, duration, success status +- Response data included if `request_logging_enabled=true` + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/{requestId}`** +- Returns detailed request log for specific execution +- Includes full request/response payloads when available + +### Settings Management + +**PATCH `/api/teams/{teamId}/mcp/installations/{installationId}/settings`** +- Updates installation settings (stored in `mcpServerInstallations.settings` jsonb column) +- Settings distributed to satellites via config endpoint +- Current settings: + - `request_logging_enabled` (boolean) - Controls capture of tool responses + +### Real-Time Streaming (SSE) + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs/stream`** +- Server-Sent Events endpoint for real-time log streaming +- Frontend subscribes for live stderr output +- Auto-reconnects on connection loss + +**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/stream`** +- Server-Sent Events endpoint for real-time request log streaming +- Frontend subscribes for live tool execution updates +- Includes duration, status, and optionally response data + +**SSE vs REST Comparison**: +| Feature | REST Endpoints | SSE Endpoints | +|---------|---------------|---------------| +| Use Case | Historical data, pagination | Real-time updates | +| Connection | Request/response | Persistent connection | +| Data Flow | Pull (client requests) | Push (server sends) | +| Frontend Usage | Initial load, manual refresh | Live monitoring | + +**SSE Controller Implementation**: `services/backend/src/controllers/mcp/sse.controller.ts` + +**Routes Implementation**: `services/backend/src/routes/api/teams/mcp/installations.routes.ts` + +--- + +## Health Check & Recovery Systems + +### Cumulative Health Check System + +**Purpose**: Template-level health aggregation across all installations of an MCP server. + +**McpHealthCheckService** (`services/backend/src/services/mcp-health-check.service.ts`): +- Aggregates health status from all installations of each MCP server template +- Updates `mcpServers.health_status` based on installation health +- Provides template-level health visibility in admin dashboard + +**Cron Job**: `mcp-health-check` runs every 3 minutes +- Implementation: `services/backend/src/jobs/mcp-health-check.job.ts` +- Checks all MCP server templates +- Updates template health status for admin visibility + +### Credential Validation System + +**Purpose**: Per-installation OAuth token validation to detect expired/revoked credentials. + +**McpCredentialValidationWorker** (`services/backend/src/workers/mcp-credential-validation.worker.ts`): +- Validates OAuth tokens for each installation +- Sends `health_check` command to satellite with `check_type: 'credential_validation'` +- Satellite performs OAuth validation and reports status + +**Cron Job**: `mcp-credential-validation` runs every 1 minute +- Implementation: `services/backend/src/jobs/mcp-credential-validation.job.ts` +- Validates installations on 15-minute rotation +- Triggers `requires_reauth` status on validation failure + +**Health Check Command Payload**: +```json +{ + "commandType": "health_check", + "priority": "immediate", + "payload": { + "check_type": "credential_validation", + "installation_id": "inst_123", + "team_id": "team_xyz" + } +} +``` + +Satellite validates credentials and emits `mcp.server.status_changed` with status: +- `online` - Credentials valid +- `requires_reauth` - OAuth token expired/revoked +- `error` - Validation failed with error + +### Auto-Recovery System + +**Recovery Trigger**: +- Health check system detects offline installations +- Backend calls `notifyMcpRecovery(installation_id, team_id)` +- Sends command to satellite: Set status=`connecting`, rediscover tools +- Status progression: offline → connecting → discovering_tools → online + +**Tool Execution Recovery**: +- Satellite detects recovery during tool execution (offline server responds) +- Emits immediate status change event (doesn't wait for health check) +- Triggers asynchronous re-discovery + +For satellite-side recovery implementation, see [Satellite Recovery System](/development/satellite/recovery-system). + +--- + +## Background Cron Jobs + +The backend runs three MCP-related cron jobs for maintenance and monitoring: + +**cleanup-mcp-server-logs**: +- **Schedule**: Every 10 minutes +- **Purpose**: Enforce 100-line limit per installation in `mcpServerLogs` table +- **Action**: Deletes oldest logs beyond 100-line limit +- **Implementation**: `services/backend/src/jobs/cleanup-mcp-server-logs.job.ts` + +**mcp-health-check**: +- **Schedule**: Every 3 minutes +- **Purpose**: Template-level health aggregation +- **Action**: Updates `mcpServers.health_status` column +- **Implementation**: `services/backend/src/jobs/mcp-health-check.job.ts` + +**mcp-credential-validation**: +- **Schedule**: Every 1 minute +- **Purpose**: Detect expired/revoked OAuth tokens +- **Action**: Sends `health_check` commands to satellites +- **Implementation**: `services/backend/src/jobs/mcp-credential-validation.job.ts` + +--- + ## Database Schema Integration ### Core Table Structure @@ -298,6 +445,37 @@ The satellite system integrates with existing DeployStack schema through 5 speci - Alert generation and notification triggers - Historical health trend analysis +### New Columns Added (Status & Health Tracking System) + +**mcpServerInstallations** table: +- `status` (text) - Current installation status (11 possible values) +- `status_message` (text, nullable) - Human-readable status context or error details +- `status_updated_at` (timestamp) - Last status change timestamp +- `last_health_check_at` (timestamp, nullable) - Last health check execution time +- `last_credential_check_at` (timestamp, nullable) - Last credential validation time +- `settings` (jsonb, nullable) - Generic settings object (e.g., `request_logging_enabled`) + +**mcpServers** table: +- `health_status` (text, nullable) - Template-level aggregated health status +- `last_health_check_at` (timestamp, nullable) - Last template health check time +- `health_check_error` (text, nullable) - Last health check error message + +**mcpServerLogs** table: +- Stores batched stderr logs from satellites +- 100-line limit per installation (enforced by cleanup cron job) +- Fields: `installation_id`, `team_id`, `log_level`, `message`, `timestamp` + +**mcpRequestLogs** table: +- Stores batched tool execution logs +- `tool_response` (jsonb, nullable) - MCP server response data +- Privacy control: Only captured when `request_logging_enabled=true` +- Fields: `installation_id`, `team_id`, `tool_name`, `request_params`, `tool_response`, `duration_ms`, `success`, `error_message`, `timestamp` + +**mcpToolMetadata** table: +- Stores discovered tools with token counts +- Used for hierarchical router token savings calculations +- Fields: `installation_id`, `server_slug`, `tool_name`, `description`, `input_schema`, `token_count`, `discovered_at` + ### Team Isolation in Data Model All satellite data respects team boundaries: diff --git a/development/backend/satellite/events.mdx b/development/backend/satellite/events.mdx index 7524a68..d77b0e4 100644 --- a/development/backend/satellite/events.mdx +++ b/development/backend/satellite/events.mdx @@ -197,18 +197,23 @@ Updates `mcpServerInstallations` table when server status changes during install **Optional Fields**: `status_message` (string, human-readable context or error details) -**Status Values**: +**Status Values** (11 total): - `provisioning` - Installation created, waiting for satellite - `command_received` - Satellite acknowledged install command - `connecting` - Satellite connecting to MCP server - `discovering_tools` - Tool discovery in progress - `syncing_tools` - Sending discovered tools to backend - `online` - Server healthy and responding +- `restarting` - Configuration changed, server restarting - `offline` - Server unreachable - `error` - Connection failed with specific error - `requires_reauth` - OAuth token expired/revoked - `permanently_failed` - Process crashed 3+ times in 5 minutes +**Handler Implementation**: `services/backend/src/events/handlers/mcp/status-changed.handler.ts` + +For satellite-side status detection logic and lifecycle flows, see [Satellite Status Tracking](/development/satellite/status-tracking). + **Emission Points**: - Success path: After successful tool discovery → status='online' - Failure path: On connection errors → status='offline', 'error', or 'requires_reauth' @@ -225,6 +230,48 @@ Inserts record into `satelliteUsageLogs` for analytics and audit trails. **Optional Fields**: `error_message` (string, only present when success=false) +### Logging Events + +#### mcp.server.logs + +Inserts batched stderr output from MCP servers into `mcpServerLogs` table for debugging and monitoring. + +**Business Logic**: Captures stderr output, connection errors, and process lifecycle events. Limited to 100 lines per installation via cleanup cron job. + +**Required Fields** (snake_case): `installation_id`, `team_id`, `logs` (array of log entries) + +**Handler Implementation**: `services/backend/src/events/handlers/mcp/server-logs.handler.ts` + +Event batching strategy (3-second interval, max 20 per batch) is documented in [Satellite Event Emission](/development/satellite/event-emission). + +#### mcp.request.logs + +Inserts batched tool execution logs into `mcpRequestLogs` table with full request/response data for audit trails. + +**Business Logic**: Captures tool execution with request parameters, response data, duration, and success status. Privacy controlled via `mcpServerInstallations.settings.request_logging_enabled`. + +**Required Fields** (snake_case): `installation_id`, `team_id`, `tool_name`, `request_params`, `duration_ms`, `success` + +**Optional Fields**: `tool_response` (jsonb), `error_message` (string) + +**Handler Implementation**: `services/backend/src/events/handlers/mcp/request-logs.handler.ts` + +**Database Storage**: `mcpRequestLogs.tool_response` column stores MCP server responses when request logging is enabled. + +### Tool Discovery Events + +#### mcp.tools.discovered + +Updates `mcpToolMetadata` table with discovered tools, token counts, and tool schemas from MCP servers. + +**Business Logic**: Stores tool metadata for team visibility, hierarchical router token savings calculations, and frontend tool catalog display. + +**Required Fields** (snake_case): `installation_id`, `team_id`, `server_slug`, `tool_count`, `total_tokens`, `tools` (array) + +**Handler Implementation**: `services/backend/src/events/handlers/mcp/tools-discovered.handler.ts` + +For satellite-side tool discovery implementation, see [Satellite Tool Discovery](/development/satellite/tool-discovery). + ## Creating New Event Handlers ### Handler Template @@ -339,6 +386,9 @@ Events route to existing business tables based on their purpose: | `mcp.server.crashed` | `satelliteProcesses` | Update status='failed', log error details | | `mcp.server.status_changed` | `mcpServerInstallations` | Update status, status_message, status_updated_at | | `mcp.tool.executed` | `satelliteUsageLogs` | Insert usage record with metrics | +| `mcp.server.logs` | `mcpServerLogs` | Insert batched stderr logs (100-line limit) | +| `mcp.request.logs` | `mcpRequestLogs` | Insert tool execution logs with request/response | +| `mcp.tools.discovered` | `mcpToolMetadata` | Update tool metadata with token counts | ### Transaction Strategy diff --git a/development/satellite/architecture.mdx b/development/satellite/architecture.mdx index 0183d01..7df1c60 100644 --- a/development/satellite/architecture.mdx +++ b/development/satellite/architecture.mdx @@ -442,14 +442,14 @@ For testing the hierarchical router (tool discovery and execution), see [Hierarc ## Implementation Status -The satellite service has completed **Phase 1: MCP Transport Implementation** and **Phase 4: Backend Integration**. Current implementation provides: +The satellite service has completed MCP Transport Implementation and Backend Integration. Current implementation provides: -**Phase 1 - MCP Transport Layer:** +**MCP Transport Layer:** - **Complete MCP Transport Layer**: SSE, SSE Messaging, Streamable HTTP - **Session Management**: Cryptographically secure with automatic cleanup - **JSON-RPC 2.0 Compliance**: Full protocol support with error handling -**Phase 4 - Backend Integration:** +**Backend Integration:** - **Command Polling Service**: Adaptive polling with three modes (normal/immediate/error) - **Dynamic Configuration Management**: Replaces hardcoded MCP server configurations - **Command Processing**: HTTP MCP server management (spawn/kill/restart/health_check) diff --git a/development/satellite/backend-communication.mdx b/development/satellite/backend-communication.mdx index b066638..f2d9018 100644 --- a/development/satellite/backend-communication.mdx +++ b/development/satellite/backend-communication.mdx @@ -286,7 +286,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions. ### Authentication Flow -**Registration Phase:** +**Registration:** 1. Admin generates JWT registration token via backend API 2. Satellite includes token in Authorization header during registration 3. Backend validates token signature, scope, and expiration @@ -295,7 +295,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions. For detailed token validation process, see [Registration Security](/development/backend/satellite-communication#satellite-pairing-process). -**Operational Phase:** +**Ongoing Operations:** 1. All requests include `Authorization: Bearer {api_key}` 2. Backend validates API key and satellite scope 3. Team context extracted from satellite registration diff --git a/development/satellite/event-emission.mdx b/development/satellite/event-emission.mdx index b4d96ba..46ae8e9 100644 --- a/development/satellite/event-emission.mdx +++ b/development/satellite/event-emission.mdx @@ -409,14 +409,15 @@ Each event type has a dedicated backend handler: - Emits `mcp.tools.discovered` after successful discovery - Coordinates status callbacks from discovery managers -## Implementation References - -**Phase 3:** Backend event handler system -**Phase 4:** Satellite status event emission -**Phase 7:** Server and request log batching -**Phase 10:** Tool metadata event emission -**Phase 13:** Stdio permanently_failed event -**Phase 18:** Tool execution failure status events +## Implementation Components + +The event emission system consists of several integrated components: +- Backend event handler system +- Satellite status event emission +- Server and request log batching +- Tool metadata event emission +- Stdio permanently_failed event +- Tool execution failure status events ## Related Documentation diff --git a/development/satellite/index.mdx b/development/satellite/index.mdx index f99b41e..1315a32 100644 --- a/development/satellite/index.mdx +++ b/development/satellite/index.mdx @@ -214,7 +214,7 @@ npm run release # Release management ## Implemented Features -### Phase 2: MCP Server Process Management +### MCP Server Process Management - **Process Lifecycle**: Spawn, monitor, auto-restart (max 3), and terminate MCP servers - **stdio Communication**: Full JSON-RPC 2.0 protocol over stdin/stdout - **HTTP Proxy**: Reverse proxy for external MCP server endpoints working @@ -223,20 +223,20 @@ npm run release # Release management - **Tool Discovery**: Automatic tool caching from both HTTP and stdio servers - **Team-Grouped Heartbeat**: processes_by_team reporting every 30 seconds -### Phase 3: Team Isolation +### Team Isolation - **nsjail Sandboxing**: Complete process isolation with built-in resource limits - **Namespace Isolation**: PID, mount, UTS, IPC namespaces per team - **Filesystem Isolation**: Team-specific read-only and writable directories - **Credential Management**: Secure environment injection via nsjail -### Phase 4: Backend Integration +### Backend Integration - **HTTP Polling**: Outbound communication with DeployStack Backend - **Configuration Sync**: Dynamic configuration updates from Backend - **Status Reporting**: Real-time satellite health and usage metrics - **Command Processing**: Execute Backend commands with acknowledgment - **Event System**: Real-time event emission with automatic batching (10 event types) -### Phase 5: Enterprise Features +### Enterprise Features - **OAuth 2.1 Authentication**: Resource server with token introspection - **Audit Logging**: Complete audit trails for compliance - **Multi-Region Support**: Global satellite deployment diff --git a/development/satellite/log-capture.mdx b/development/satellite/log-capture.mdx index 4f83bfd..6a01d4a 100644 --- a/development/satellite/log-capture.mdx +++ b/development/satellite/log-capture.mdx @@ -163,7 +163,7 @@ Request logs capture tool execution with full request parameters and server resp For each tool execution: - Tool name (e.g., `github:list-repos`) - Input parameters sent to tool -- **Full response from MCP server** (captured in Phase 14) +- **Full response from MCP server** (when request logging is enabled) - Response time in milliseconds - Success/failure status - Error message (if failed) @@ -436,12 +436,13 @@ cleanup() { } ``` -## Implementation References +## Implementation Components -**Phase 7:** Server and request log batching implementation -**Phase 14:** Request logging toggle and tool response capture -**Phase 5:** Backend log tables and event handlers -**Phase 6:** 100-line cleanup job +The log capture system consists of several integrated components: +- Server and request log batching implementation +- Request logging toggle and tool response capture +- Backend log tables and event handlers +- 100-line cleanup job ## Related Documentation diff --git a/development/satellite/mcp-server-token-injection.mdx b/development/satellite/mcp-server-token-injection.mdx index f1cb108..75cfd8a 100644 --- a/development/satellite/mcp-server-token-injection.mdx +++ b/development/satellite/mcp-server-token-injection.mdx @@ -341,7 +341,7 @@ private isCacheValid(cachedAt: number, expiresAt: string | null): boolean { async handleHttpToolCall(serverName: string, originalToolName: string, args: unknown) { const config = this.serverConfigs.get(serverName); - // Phase 10: OAuth token injection for HTTP/SSE MCP servers + // OAuth token injection for HTTP/SSE MCP servers let headers: Record = {}; // Add regular headers from config (API keys, custom headers, etc.) @@ -447,7 +447,7 @@ async handleHttpToolCall(serverName: string, originalToolName: string, args: unk ```typescript // From remote-tool-discovery-manager.ts:376-440 async discoverServerTools(serverName: string, config: ServerConfig) { - // Phase 10: OAuth token injection for tool discovery + // OAuth token injection for tool discovery let headers: Record = {}; // Add regular headers from config (API keys, custom headers, etc.) diff --git a/development/satellite/process-management.mdx b/development/satellite/process-management.mdx index 5c8ceb3..6a10fc3 100644 --- a/development/satellite/process-management.mdx +++ b/development/satellite/process-management.mdx @@ -147,17 +147,17 @@ All communication uses newline-delimited JSON following JSON-RPC 2.0 specificati ### Graceful Termination -Process termination follows a two-phase graceful shutdown approach to ensure clean process exit and proper resource cleanup. +Process termination follows a two-step graceful shutdown approach to ensure clean process exit and proper resource cleanup. -#### Termination Phases +#### Termination Steps -**Phase 1: SIGTERM (Graceful Shutdown)** +**Step 1: SIGTERM (Graceful Shutdown)** - Send SIGTERM signal to the process - Process has 10 seconds (default timeout) to shut down gracefully - Process can complete in-flight operations and cleanup resources - Wait for process to exit voluntarily -**Phase 2: SIGKILL (Force Termination)** +**Step 2: SIGKILL (Force Termination)** - If process doesn't exit within timeout period - Send SIGKILL signal to force immediate termination - Guaranteed process termination (cannot be caught or ignored) diff --git a/development/satellite/recovery-system.mdx b/development/satellite/recovery-system.mdx index 84527a5..bf8123d 100644 --- a/development/satellite/recovery-system.mdx +++ b/development/satellite/recovery-system.mdx @@ -356,11 +356,12 @@ Some failures cannot auto-recover: See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays). -## Implementation References +## Implementation Components -**Phase 13:** Stdio auto-recovery and permanently_failed status -**Phase 18:** Tool execution retry logic and recovery detection -**Phase 8:** Health check recovery via backend +The recovery system consists of several integrated components: +- Stdio auto-recovery and permanently_failed status +- Tool execution retry logic and recovery detection +- Health check recovery via backend ## Related Documentation diff --git a/development/satellite/status-tracking.mdx b/development/satellite/status-tracking.mdx index a530c55..710408d 100644 --- a/development/satellite/status-tracking.mdx +++ b/development/satellite/status-tracking.mdx @@ -267,14 +267,15 @@ Unavailable server: ${serverSlug}` | Server recovery detected | `connecting` | Previously offline server responds | | Stdio crashes 3 times | `permanently_failed` | 3 crashes within 5 minutes | -## Implementation References - -**Phase 1:** Database schema for status field -**Phase 3:** Backend event handler for status updates -**Phase 4:** Satellite status event emission -**Phase 10:** Tool availability filtering by status -**Phase 17:** Configuration update status transitions -**Phase 18:** Tool execution status updates + auto-recovery +## Implementation Components + +The status tracking system consists of several integrated components: +- Database schema for status field +- Backend event handler for status updates +- Satellite status event emission +- Tool availability filtering by status +- Configuration update status transitions +- Tool execution status updates with auto-recovery ## Related Documentation