From 195f35ac65a3aad3963640dc69a96869cc45ad02 Mon Sep 17 00:00:00 2001
From: Lasim <hajdas@hajdas.de>
Date: Thu, 25 Dec 2025 09:15:30 +0100
Subject: [PATCH 1/2] docs(satellite): add comprehensive status & health
 tracking documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add 4 new documentation files covering the MCP Status & Health Tracking System (18 implementation phases) and update 6 existing files with cross-references to maintain modular documentation structure.

New files:
- status-tracking.mdx: 11-state status system, lifecycle flows, tool filtering
- event-emission.mdx: Event types, payloads, batching configuration
- log-capture.mdx: Server/request logging, buffering, privacy controls
- recovery-system.mdx: Automatic recovery detection, retry logic, tool preservation

Updated files with cross-links:
- architecture.mdx: Add Status Tracking, Event System, Log Capture sections
- tool-discovery.mdx: Add Status Integration, Recovery System sections
- process-management.mdx: Add Status Events, Log Buffering sections
- backend-communication.mdx: Add Events vs Heartbeat, Health Check sections
- commands.mdx: Add health_check command documentation
- hierarchical-router.mdx: Add Status-Based Tool Filtering section

Navigation:
- docs.json: Add "Status & Health Tracking" group to Satellite Development tab

Technical details:
- 11 status values (provisioning → online → offline/error/requires_reauth)
- Event batching: 3-second interval, max 20 per batch
- Retry logic: exponential backoff (500ms, 1s, 2s)
- Log storage: 100-line limit per installation
- Request logging privacy control via settings
---
 development/satellite/architecture.mdx        |  47 +-
 .../satellite/backend-communication.mdx       |  68 ++-
 development/satellite/commands.mdx            |  38 ++
 development/satellite/event-emission.mdx      | 426 +++++++++++++++++
 development/satellite/hierarchical-router.mdx |  69 +--
 development/satellite/log-capture.mdx         | 450 ++++++++++++++++++
 development/satellite/process-management.mdx  |  74 +--
 development/satellite/recovery-system.mdx     | 370 ++++++++++++++
 development/satellite/status-tracking.mdx     | 284 +++++++++++
 development/satellite/tool-discovery.mdx      |  64 ++-
 docs.json                                     |   9 +
 11 files changed, 1787 insertions(+), 112 deletions(-)
 create mode 100644 development/satellite/event-emission.mdx
 create mode 100644 development/satellite/log-capture.mdx
 create mode 100644 development/satellite/recovery-system.mdx
 create mode 100644 development/satellite/status-tracking.mdx

diff --git a/development/satellite/architecture.mdx b/development/satellite/architecture.mdx
index 9407649..0183d01 100644
--- a/development/satellite/architecture.mdx
+++ b/development/satellite/architecture.mdx
@@ -284,31 +284,36 @@ For complete implementation details, see [Backend Polling Implementation](/devel
 
 ### Real-Time Event System
 
-**Event Emission with Batching:**
-```
-Satellite Operations          EventBus              Backend
-       │                         │                     │
-       │─── mcp.server.started ──▶│                    │
-       │─── mcp.tool.executed ───▶│ [Queue]            │
-       │─── mcp.client.connected ─▶│                    │
-       │                      [Every 3 seconds]         │
-       │                         │                     │
-       │                         │─── POST /events ───▶│
-       │                         │◀─── 200 OK ─────────│
-```
-
-**Event Features:**
-- **Immediate Emission**: Events emitted when actions occur (not delayed by 30s heartbeat)
-- **Automatic Batching**: Events collected for 3 seconds, then sent as single batch (max 100 events)
-- **Memory Management**: In-memory queue (10,000 event limit) with overflow protection
-- **Graceful Error Handling**: 429 exponential backoff, 400 drops invalid events, 500/network errors retry
-- **10 Event Types**: Server lifecycle, client connections, tool discovery, configuration updates
+The satellite emits typed events for status changes, logs, and tool metadata. Events enable real-time monitoring without polling.
 
 **Difference from Heartbeat:**
 - **Heartbeat** (every 30s): Aggregate metrics, system health, resource usage
-- **Events** (immediate): Point-in-time occurrences, user actions, precise timestamps
+- **Events** (immediate): Point-in-time status updates, precise timestamps
+
+See [Event Emission](/development/satellite/event-emission) for complete event types, payloads, and batching configuration.
+
+### Status Tracking System
+
+The satellite tracks MCP server installation health through an 11-state status system that drives tool availability and automatic recovery.
+
+**Status Values:**
+- Installation lifecycle: `provisioning`, `command_received`, `connecting`, `discovering_tools`, `syncing_tools`
+- Healthy state: `online` (tools available)
+- Configuration changes: `restarting`
+- Failure states: `offline`, `error`, `requires_reauth`, `permanently_failed`
+
+**Status Integration:**
+- **Tool Filtering**: Tools from non-online servers hidden from discovery
+- **Auto-Recovery**: Offline servers auto-recover when responsive
+- **Event Emission**: Status changes emitted immediately to backend
+
+See [Status Tracking](/development/satellite/status-tracking) for complete status lifecycle and transitions.
+
+### Log Capture System
+
+The satellite captures and batches two types of logs for debugging and monitoring: **server logs** (stderr output) and **request logs** (tool execution with full request/response data).
 
-For complete event system documentation, see [Event System](/development/satellite/event-system).
+See [Log Capture](/development/satellite/log-capture) for buffering implementation, batching configuration, backend storage limits, and privacy controls.
 
 ## Security Architecture
 
diff --git a/development/satellite/backend-communication.mdx b/development/satellite/backend-communication.mdx
index 1f09bcf..b066638 100644
--- a/development/satellite/backend-communication.mdx
+++ b/development/satellite/backend-communication.mdx
@@ -157,11 +157,11 @@ For detailed event system documentation, see [Event System](/development/satelli
 - Performance metrics collection
 
 **Terminate Process:**
-- Graceful shutdown with SIGTERM
-- Force kill with SIGKILL after timeout
 - Resource cleanup and deallocation
 - Final status report to Backend
 
+See [Process Management - Graceful Termination](/development/satellite/process-management#graceful-termination) for SIGTERM/SIGKILL shutdown details.
+
 ## Internal Architecture
 
 ### Five Core Components
@@ -375,6 +375,70 @@ server.log.info({
 4. Add comprehensive monitoring and alerting
 5. End-to-end testing and performance validation
 
+## Events vs Heartbeat
+
+The satellite communicates status and metrics through two distinct channels:
+
+**Events (Immediate):**
+- Emitted when actions occur (not delayed by heartbeat interval)
+- Point-in-time status updates with precise timestamps
+- Batched automatically (3-second interval, max 20 per batch)
+- Types: Status changes, logs, tool metadata, lifecycle events
+
+**Heartbeat (Periodic, every 30s):**
+- Aggregate metrics and system health
+- Resource usage statistics
+- Overall satellite status
+
+See [Event Emission](/development/satellite/event-emission) for complete event types and batching strategy.
+
+## Health Check Command
+
+The backend sends `health_check` commands for credential validation:
+
+**Command Structure:**
+```typescript
+{
+  commandType: 'health_check',
+  priority: 'immediate',
+  payload: {
+    check_type: 'credential_validation',
+    installation_id: string,
+    team_id: string
+  }
+}
+```
+
+**Satellite Action:**
+- Calls `tools/list` on MCP server with credentials
+- Detects auth errors (401, 403)
+- Emits `requires_reauth` status if validation fails
+
+See [Commands](/development/satellite/commands) for complete command reference.
+
+## Recovery Commands
+
+When offline servers recover, backend sends recovery commands:
+
+**Command Structure:**
+```typescript
+{
+  commandType: 'configure',
+  priority: 'high',
+  payload: {
+    event: 'mcp_recovery',
+    installation_id: string,
+    team_id: string
+  }
+}
+```
+
+**Satellite Action:**
+- Triggers re-discovery for the recovered server
+- Status progresses: `offline` → `connecting` → `discovering_tools` → `online`
+
+See [Recovery System](/development/satellite/recovery-system) for automatic recovery logic.
+
 <Info>
 The satellite communication system is designed for enterprise deployment with complete team isolation, resource management, and audit logging while maintaining the developer experience that defines the DeployStack platform.
 </Info>
diff --git a/development/satellite/commands.mdx b/development/satellite/commands.mdx
index ff428df..cb328e1 100644
--- a/development/satellite/commands.mdx
+++ b/development/satellite/commands.mdx
@@ -115,6 +115,44 @@ Each satellite command contains:
 4. Restart affected components
 5. Verify system integrity post-update
 
+### health_check
+
+**Purpose**: Validates MCP server credentials and connectivity
+
+**Priority**: `immediate`
+
+**Triggered By**:
+- Backend credential validation cron (every 1 minute)
+- Manual credential testing
+- OAuth token expiration detection
+
+**Payload Structure**:
+```json
+{
+  "check_type": "credential_validation",
+  "installation_id": "installation-uuid",
+  "team_id": "team-uuid"
+}
+```
+
+**Satellite Actions**:
+1. Find MCP server configuration by installation_id
+2. Skip stdio servers (no HTTP credentials to validate)
+3. Build HTTP request with configured credentials (headers, query params)
+4. Call `tools/list` with 15-second timeout
+5. Detect authentication errors:
+   - HTTP 401/403 responses
+   - Error messages containing "auth", "unauthorized", "forbidden"
+6. Emit status event:
+   - On auth failure → `requires_reauth` status
+   - On success → credentials valid (no status change)
+
+**Error Detection Patterns**:
+- HTTP status codes: 401, 403
+- Response body keywords: "auth", "unauthorized", "forbidden", "token", "credentials"
+
+See [Status Tracking](/development/satellite/status-tracking) for credential validation status flow.
+
 ## Command Lifecycle
 
 ### Creation
diff --git a/development/satellite/event-emission.mdx b/development/satellite/event-emission.mdx
new file mode 100644
index 0000000..b4d96ba
--- /dev/null
+++ b/development/satellite/event-emission.mdx
@@ -0,0 +1,426 @@
+---
+title: Event Emission
+description: Events emitted by the satellite to communicate with the backend
+---
+
+# Event Emission
+
+The satellite communicates with the backend through a centralized EventBus that emits typed events. These events enable real-time status updates, log streaming, and tool metadata synchronization without polling.
+
+## Overview
+
+The satellite emits events for:
+- **Status Changes**: Real-time installation status updates
+- **Server Logs**: Batched stderr output from MCP servers
+- **Request Logs**: Batched tool execution logs with request/response data
+- **Tool Metadata**: Tool discovery results with token counts
+- **Process Lifecycle**: Server start, crash, restart, permanent failure events
+
+All events are processed by the backend's event handler system and trigger database updates, SSE broadcasts to frontend, and health monitoring actions.
+
+## Event System Architecture
+
+```
+Satellite Component (ProcessManager, McpServerWrapper, DiscoveryManager)
+    ↓
+EventBus.emit(eventType, eventData)
+    ↓
+Backend Polling Service (30-second interval)
+    ↓
+Backend Event Handlers (process events, update database)
+    ↓
+Frontend SSE Streams (real-time updates to users)
+```
+
+## Event Types Reference
+
+### mcp.server.status_changed
+
+**Purpose:** Update installation status in real-time
+
+**Emitted by:**
+- ProcessManager (connecting, online, crashed, permanently_failed)
+- McpServerWrapper (offline, error, requires_reauth on tool execution failures)
+- RemoteToolDiscoveryManager (connecting, online, offline, error, requires_reauth)
+
+For complete status transition triggers and lifecycle flows, see [Status Tracking](/development/satellite/status-tracking).
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  status: 'provisioning' | 'command_received' | 'connecting' | 'discovering_tools'
+    | 'syncing_tools' | 'online' | 'restarting' | 'offline' | 'error'
+    | 'requires_reauth' | 'permanently_failed';
+  status_message?: string;
+  timestamp: string; // ISO 8601
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.server.status_changed', {
+  installation_id: 'inst_abc123',
+  team_id: 'team_xyz',
+  status: 'online',
+  status_message: 'Server connected successfully',
+  timestamp: '2025-01-15T10:30:00.000Z'
+});
+```
+
+**Backend Action:** Updates `mcpServerInstallations.status` and broadcasts via SSE
+
+---
+
+### mcp.server.logs
+
+**Purpose:** Stream server logs (stderr, connection errors, startup messages) to backend
+
+**Emitted by:**
+- ProcessManager (batched stderr output from stdio MCP servers)
+
+**Batching Strategy:**
+- **Interval**: 3 seconds after first log entry
+- **Max Size**: 20 logs per batch (forces immediate flush)
+- **Grouping**: By `installation_id + team_id`
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  logs: Array<{
+    level: 'info' | 'warn' | 'error' | 'debug';
+    message: string;
+    metadata?: Record<string, unknown>;
+    timestamp: string; // ISO 8601
+  }>;
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.server.logs', {
+  installation_id: 'inst_abc123',
+  team_id: 'team_xyz',
+  logs: [
+    {
+      level: 'error',
+      message: 'Connection refused to http://localhost:3568/sse',
+      metadata: { error_code: 'ECONNREFUSED' },
+      timestamp: '2025-01-15T10:30:00.000Z'
+    },
+    {
+      level: 'info',
+      message: 'Retrying connection in 2 seconds...',
+      timestamp: '2025-01-15T10:30:02.000Z'
+    }
+  ]
+});
+```
+
+**Backend Action:** Inserts logs into `mcpServerLogs` table, enforces 100-line limit per installation
+
+---
+
+### mcp.request.logs
+
+**Purpose:** Stream tool execution logs with full request/response data
+
+**Emitted by:**
+- McpServerWrapper (batched tool call logs)
+
+**Batching Strategy:**
+- **Interval**: 3 seconds after first request
+- **Max Size**: 20 requests per batch
+- **Grouping**: By `installation_id + team_id`
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  requests: Array<{
+    user_id?: string;
+    tool_name: string;
+    tool_params: Record<string, unknown>;
+    tool_response?: unknown; // Full MCP server response
+    response_time_ms: number;
+    success: boolean;
+    error_message?: string;
+    timestamp: string; // ISO 8601
+  }>;
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.request.logs', {
+  installation_id: 'inst_abc123',
+  team_id: 'team_xyz',
+  requests: [
+    {
+      user_id: 'user_xyz',
+      tool_name: 'github:list-repos',
+      tool_params: { owner: 'deploystackio' },
+      tool_response: { repos: ['deploystack', 'mcp-server'], total: 2 },
+      response_time_ms: 234,
+      success: true,
+      timestamp: '2025-01-15T10:30:00.000Z'
+    }
+  ]
+});
+```
+
+**Backend Action:** Inserts requests into `mcpRequestLogs` table, enforces 100-line limit
+
+**Privacy Note:** Only emitted if `settings.request_logging_enabled !== false`
+
+---
+
+### mcp.tools.discovered
+
+**Purpose:** Synchronize discovered tools and metadata to backend
+
+**Emitted by:**
+- UnifiedToolDiscoveryManager (after tool discovery completes)
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  tools: Array<{
+    tool_path: string; // e.g., "github:list-repos"
+    name: string;
+    description?: string;
+    inputSchema: unknown;
+    token_count: number; // Estimated token usage
+  }>;
+  timestamp: string; // ISO 8601
+}
+```
+
+**Example:**
+```typescript
+eventBus.emit('mcp.tools.discovered', {
+  installation_id: 'inst_abc123',
+  team_id: 'team_xyz',
+  tools: [
+    {
+      tool_path: 'github:list-repos',
+      name: 'list-repos',
+      description: 'List all repositories for an owner',
+      inputSchema: { type: 'object', properties: { owner: { type: 'string' } } },
+      token_count: 42
+    }
+  ],
+  timestamp: '2025-01-15T10:30:00.000Z'
+});
+```
+
+**Backend Action:** Updates `mcpTools` table with discovered tools and metadata
+
+---
+
+### Process Lifecycle Events
+
+These events track stdio MCP server process state:
+
+#### mcp.server.started
+
+**Emitted when:** Stdio process successfully spawned
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  process_id: string;
+  timestamp: string;
+}
+```
+
+#### mcp.server.crashed
+
+**Emitted when:** Stdio process terminates unexpectedly
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  process_id: string;
+  exit_code: number | null;
+  signal: string | null;
+  crash_count: number; // Crashes within 5-minute window
+  timestamp: string;
+}
+```
+
+#### mcp.server.restarted
+
+**Emitted when:** Stdio process automatically restarted after crash
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  process_id: string;
+  restart_count: number;
+  timestamp: string;
+}
+```
+
+#### mcp.server.permanently_failed
+
+**Emitted when:** Stdio process crashes 3 times within 5 minutes
+
+**Payload:**
+```typescript
+{
+  installation_id: string;
+  team_id: string;
+  process_id: string;
+  crash_count: number; // Always 3
+  message: string; // "Process crashed 3 times in 5 minutes"
+  timestamp: string;
+}
+```
+
+**Backend Action:** Sets installation status to `permanently_failed`, requires manual restart
+
+---
+
+## Event Batching Strategy
+
+### Why Batching?
+
+Batching reduces:
+- Backend API calls (20 logs = 1 API call instead of 20)
+- Database transactions (bulk insert instead of individual inserts)
+- Network overhead (fewer HTTP requests)
+- Backend processing load (batch operations are more efficient)
+
+### Batching Configuration
+
+| Parameter | Value | Reason |
+|-----------|-------|--------|
+| Batch Interval | 3 seconds | Balance between real-time feel and efficiency |
+| Max Batch Size | 20 entries | Prevent large payloads, force timely emission |
+| Grouping Key | `installation_id + team_id` | Separate batches per installation |
+
+### Batching Implementation
+
+Log batching implementation details are in [Log Capture - Buffering Implementation](/development/satellite/log-capture#buffering-implementation) for both server logs and request logs.
+
+## EventBus Usage
+
+### Emitting Events
+
+```typescript
+import { EventBus } from './events/event-bus';
+
+// EventBus is a singleton
+const eventBus = EventBus.getInstance();
+
+// Emit with type safety
+eventBus.emit('mcp.server.status_changed', {
+  installation_id: 'inst_123',
+  team_id: 'team_456',
+  status: 'online',
+  timestamp: new Date().toISOString()
+});
+```
+
+### Event Registry
+
+All event types are defined in the event registry:
+
+```typescript
+// services/satellite/src/events/registry.ts
+
+export type EventType =
+  | 'mcp.server.status_changed'
+  | 'mcp.server.logs'
+  | 'mcp.request.logs'
+  | 'mcp.tools.discovered'
+  | 'mcp.server.started'
+  | 'mcp.server.crashed'
+  | 'mcp.server.restarted'
+  | 'mcp.server.permanently_failed'
+  // ... 13 total event types
+  ;
+
+export interface EventDataMap {
+  'mcp.server.status_changed': { /* payload */ };
+  'mcp.server.logs': { /* payload */ };
+  // ... type definitions for all events
+}
+```
+
+## Backend Event Handlers
+
+Each event type has a dedicated backend handler:
+
+**Status Changed:**
+```typescript
+// services/backend/src/events/satellite/mcp-server-status-changed.ts
+// Updates mcpServerInstallations.status
+```
+
+**Server Logs:**
+```typescript
+// services/backend/src/events/satellite/mcp-server-logs.ts
+// Inserts into mcpServerLogs table
+```
+
+**Request Logs:**
+```typescript
+// services/backend/src/events/satellite/mcp-request-logs.ts
+// Inserts into mcpRequestLogs table (if logging enabled)
+```
+
+**Tools Discovered:**
+```typescript
+// services/backend/src/events/satellite/mcp-tools-discovered.ts
+// Updates mcpTools table with metadata
+```
+
+## Integration Points
+
+**Process Manager:**
+- Emits server logs (stderr batching)
+- Emits lifecycle events (started, crashed, restarted, permanently_failed)
+- Emits status changes (connecting, online, permanently_failed)
+
+**MCP Server Wrapper:**
+- Emits request logs (tool execution batching)
+- Emits status changes (offline, error, requires_reauth on failures)
+- Emits status changes (connecting, online on recovery)
+
+**Tool Discovery Managers:**
+- Emit status changes (connecting, discovering_tools, online, offline, error)
+- Trigger tool metadata emission via UnifiedToolDiscoveryManager
+
+**Unified Tool Discovery Manager:**
+- Emits `mcp.tools.discovered` after successful discovery
+- Coordinates status callbacks from discovery managers
+
+## Implementation References
+
+**Phase 3:** Backend event handler system
+**Phase 4:** Satellite status event emission
+**Phase 7:** Server and request log batching
+**Phase 10:** Tool metadata event emission
+**Phase 13:** Stdio permanently_failed event
+**Phase 18:** Tool execution failure status events
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Status values and lifecycle
+- [Log Capture](/development/satellite/log-capture) - Logging system details
+- [Process Management](/development/satellite/process-management) - Lifecycle events
+- [Tool Discovery](/development/satellite/tool-discovery) - Tool metadata events
diff --git a/development/satellite/hierarchical-router.mdx b/development/satellite/hierarchical-router.mdx
index 5267e6c..9a11617 100644
--- a/development/satellite/hierarchical-router.mdx
+++ b/development/satellite/hierarchical-router.mdx
@@ -384,66 +384,9 @@ Satellite → Client
 
 ## Format Conversion
 
-### External vs Internal Formats
+The satellite converts between user-facing format (`serverName:toolName`) and internal routing format (`serverName-toolName`) transparently during tool discovery and execution.
 
-The satellite uses different tool path formats for different purposes:
-
-**External Format (User-Facing): `serverName:toolName`**
-
-Used in:
-- `discover_mcp_tools` responses
-- `execute_mcp_tool` requests
-- Any client-facing communication
-
-Examples:
-- `github:create_issue`
-- `figma:get_file`
-- `postgres:query`
-
-Why colon?
-- Standard separator in URIs and paths
-- Clean, readable format
-- Industry convention (npm packages, docker images)
-
-**Internal Format (Routing): `serverName-toolName`**
-
-Used in:
-- Unified tool cache keys
-- Tool discovery manager
-- Process routing
-- Internal lookups
-
-Examples:
-- `github-create_issue`
-- `figma-get_file`
-- `postgres-query`
-
-Why dash?
-- Existing codebase convention
-- Backward compatibility
-- All existing code uses dash format
-
-### Conversion Logic
-
-```typescript
-// In handleExecuteTool()
-const toolPath = "github:create_issue";  // From client
-
-// Parse external format
-const [serverSlug, toolName] = toolPath.split(':');
-
-// Convert to internal format
-const namespacedToolName = `${serverSlug}-${toolName}`;
-// Result: "github-create_issue"
-
-// Look up in cache
-const cachedTool = toolDiscoveryManager.getTool(namespacedToolName);
-
-// Route to actual MCP server
-await executeToolCall(namespacedToolName, toolArguments);
-```
-
-The conversion is transparent to both clients and actual MCP servers - it's purely a satellite internal concern.
+See [Tool Discovery - Namespacing Strategy](/development/satellite/tool-discovery#namespacing-strategy) for complete details on naming conventions and format conversion logic.
 
 ## Search Implementation
 
@@ -586,9 +529,17 @@ Both meta-tools are implemented and production-ready:
 - Fast search performance
 - Easy to monitor and debug
 
+## Status-Based Tool Filtering
+
+The hierarchical router integrates with status tracking to hide tools from unavailable servers and provide clear error messages when unavailable tools are executed.
+
+See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete filtering logic, execution blocking rules, and status values.
+
 ## Related Documentation
 
 - [Tool Discovery Implementation](/development/satellite/tool-discovery) - Internal tool caching and discovery
+- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status
+- [Recovery System](/development/satellite/recovery-system) - How offline servers auto-recover
 - [MCP Transport Protocols](/development/satellite/mcp-transport) - How clients connect
 - [Process Management](/development/satellite/process-management) - stdio server lifecycle
 - [Architecture Overview](/development/satellite/architecture) - Complete satellite design
diff --git a/development/satellite/log-capture.mdx b/development/satellite/log-capture.mdx
new file mode 100644
index 0000000..4f83bfd
--- /dev/null
+++ b/development/satellite/log-capture.mdx
@@ -0,0 +1,450 @@
+---
+title: Log Capture
+description: Server and request logging system in the satellite
+---
+
+# Log Capture
+
+The satellite captures and batches two types of logs for each MCP server installation: **server logs** (stderr output, connection errors, startup messages) and **request logs** (tool execution with full request/response data).
+
+## Overview
+
+Log capture serves three purposes: **Debugging** lets developers see stderr output and tool execution details, **Monitoring** tracks server health and tool usage in real-time, and **Audit Trail** provides a complete record of tool calls with parameters and responses
+
+Both log types use the same batching strategy (3-second interval, max 20 per batch) to optimize backend API calls and database writes.
+
+## Server Logs
+
+Server logs capture stderr output and connection events from MCP servers, particularly useful for debugging stdio-based servers.
+
+### What Gets Logged
+
+**Stdio Servers:**
+- stderr output from the MCP server process
+- Connection errors (handshake failures)
+- Process spawn errors
+- Crash information
+
+**HTTP/SSE Servers:**
+- Connection errors (ECONNREFUSED, ETIMEDOUT)
+- HTTP error responses (4xx, 5xx)
+- OAuth authentication failures
+- Network timeouts
+
+### Log Levels
+
+| Level | Usage |
+|-------|-------|
+| `info` | Normal operations (connection established, tool discovery started) |
+| `warn` | Non-critical issues (retry attempts, temporary failures) |
+| `error` | Critical errors (connection refused, auth failures, crashes) |
+| `debug` | Detailed diagnostic information (handshake details, raw responses) |
+
+### Buffering Implementation
+
+```typescript
+// services/satellite/src/process/manager.ts
+
+interface BufferedLogEntry {
+  installation_id: string;
+  team_id: string;
+  level: 'info' | 'warn' | 'error' | 'debug';
+  message: string;
+  metadata?: Record<string, unknown>;
+  timestamp: string;
+}
+
+class ProcessManager {
+  private logBuffer: BufferedLogEntry[] = [];
+  private logFlushTimeout: NodeJS.Timeout | null = null;
+  private readonly LOG_BATCH_INTERVAL_MS = 3000;
+  private readonly LOG_BATCH_MAX_SIZE = 20;
+
+  // Called when stderr receives data
+  private handleStderrData(processInfo: ProcessInfo, data: Buffer) {
+    const message = data.toString().trim();
+
+    this.bufferLogEntry({
+      installation_id: processInfo.config.installation_id,
+      team_id: processInfo.config.team_id,
+      level: this.inferLogLevel(message), // 'error' if contains "error", etc.
+      message,
+      metadata: { process_id: processInfo.processId },
+      timestamp: new Date().toISOString()
+    });
+  }
+
+  private bufferLogEntry(entry: BufferedLogEntry) {
+    this.logBuffer.push(entry);
+
+    // Force immediate flush if buffer full
+    if (this.logBuffer.length >= this.LOG_BATCH_MAX_SIZE) {
+      this.flushLogBuffer();
+    } else {
+      this.scheduleLogFlush(); // Flush after 3 seconds
+    }
+  }
+
+  private scheduleLogFlush() {
+    if (this.logFlushTimeout) return; // Already scheduled
+
+    this.logFlushTimeout = setTimeout(() => {
+      this.flushLogBuffer();
+    }, this.LOG_BATCH_INTERVAL_MS);
+  }
+
+  private flushLogBuffer() {
+    if (this.logBuffer.length === 0) return;
+
+    // Group by installation
+    const groupedLogs = new Map<string, BufferedLogEntry[]>();
+    for (const entry of this.logBuffer) {
+      const key = `${entry.installation_id}:${entry.team_id}`;
+      if (!groupedLogs.has(key)) {
+        groupedLogs.set(key, []);
+      }
+      groupedLogs.get(key)!.push(entry);
+    }
+
+    // Emit one event per installation
+    for (const [key, logs] of groupedLogs.entries()) {
+      this.eventBus?.emit('mcp.server.logs', {
+        installation_id: logs[0].installation_id,
+        team_id: logs[0].team_id,
+        logs: logs.map(log => ({
+          level: log.level,
+          message: log.message,
+          metadata: log.metadata,
+          timestamp: log.timestamp
+        }))
+      });
+    }
+
+    // Clear buffer
+    this.logBuffer = [];
+    this.logFlushTimeout = null;
+  }
+}
+```
+
+### Example Server Logs
+
+```json
+{
+  "installation_id": "inst_abc123",
+  "team_id": "team_xyz",
+  "logs": [
+    {
+      "level": "info",
+      "message": "MCP server starting on port 3568",
+      "timestamp": "2025-01-15T10:30:00.000Z"
+    },
+    {
+      "level": "error",
+      "message": "Connection refused: ECONNREFUSED",
+      "metadata": { "error_code": "ECONNREFUSED" },
+      "timestamp": "2025-01-15T10:30:05.000Z"
+    },
+    {
+      "level": "warn",
+      "message": "Retrying connection in 2 seconds...",
+      "timestamp": "2025-01-15T10:30:07.000Z"
+    }
+  ]
+}
+```
+
+## Request Logs
+
+Request logs capture tool execution with full request parameters and server responses, providing complete visibility into MCP tool usage.
+
+### What Gets Logged
+
+For each tool execution:
+- Tool name (e.g., `github:list-repos`)
+- Input parameters sent to tool
+- **Full response from MCP server** (captured in Phase 14)
+- Response time in milliseconds
+- Success/failure status
+- Error message (if failed)
+- User ID (who called the tool)
+- Timestamp
+
+### Privacy Control
+
+Request logging can be disabled per-installation via settings:
+
+```typescript
+// Installation settings
+{
+  "request_logging_enabled": false
+}
+```
+
+When disabled:
+- No request logs are buffered or emitted
+- Tool execution still works normally
+- Server logs (stderr) still captured
+- Used for privacy-sensitive tools (internal APIs, credentials, PII)
+
+### Buffering Implementation
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+interface BufferedRequestEntry {
+  installation_id: string;
+  team_id: string;
+  user_id?: string;
+  tool_name: string;
+  tool_params: Record<string, unknown>;
+  tool_response?: unknown; // Full MCP server response
+  response_time_ms: number;
+  success: boolean;
+  error_message?: string;
+  timestamp: string;
+}
+
+class McpServerWrapper {
+  private requestLogBuffer: BufferedRequestEntry[] = [];
+  private requestLogFlushTimeout: NodeJS.Timeout | null = null;
+  private readonly REQUEST_LOG_BATCH_INTERVAL_MS = 3000;
+  private readonly REQUEST_LOG_BATCH_MAX_SIZE = 20;
+
+  async handleExecuteTool(toolPath: string, toolArguments: unknown) {
+    const startTime = Date.now();
+    let result: unknown;
+    let success = false;
+    let errorMessage: string | undefined;
+
+    try {
+      result = await this.executeToolCall(toolPath, toolArguments);
+      success = true;
+    } catch (error) {
+      errorMessage = error instanceof Error ? error.message : 'Unknown error';
+    } finally {
+      const responseTimeMs = Date.now() - startTime;
+
+      // Check if logging is enabled (default: true)
+      const loggingEnabled = config?.settings?.request_logging_enabled !== false;
+
+      // Buffer request log if installation context exists and logging enabled
+      if ((config?.installation_id && config?.team_id) && loggingEnabled) {
+        this.bufferRequestLogEntry({
+          installation_id: config.installation_id,
+          team_id: config.team_id,
+          user_id: config.user_id,
+          tool_name: toolPath,
+          tool_params: toolArguments as Record<string, unknown>,
+          tool_response: result, // Captured response
+          response_time_ms: responseTimeMs,
+          success,
+          error_message: errorMessage,
+          timestamp: new Date().toISOString()
+        });
+      }
+    }
+
+    return result;
+  }
+
+  private bufferRequestLogEntry(entry: BufferedRequestEntry) {
+    this.requestLogBuffer.push(entry);
+
+    // Force flush if buffer full
+    if (this.requestLogBuffer.length >= this.REQUEST_LOG_BATCH_MAX_SIZE) {
+      this.flushRequestLogBuffer();
+    } else {
+      this.scheduleRequestLogFlush();
+    }
+  }
+
+  private flushRequestLogBuffer() {
+    if (this.requestLogBuffer.length === 0) return;
+
+    // Group by installation
+    const grouped = this.groupRequestsByInstallation(this.requestLogBuffer);
+
+    // Emit one event per installation
+    for (const [key, requests] of grouped.entries()) {
+      this.eventBus?.emit('mcp.request.logs', {
+        installation_id: requests[0].installation_id,
+        team_id: requests[0].team_id,
+        requests: requests.map(req => ({
+          user_id: req.user_id,
+          tool_name: req.tool_name,
+          tool_params: req.tool_params,
+          tool_response: req.tool_response, // Include response
+          response_time_ms: req.response_time_ms,
+          success: req.success,
+          error_message: req.error_message,
+          timestamp: req.timestamp
+        }))
+      });
+    }
+
+    // Clear buffer
+    this.requestLogBuffer = [];
+    this.requestLogFlushTimeout = null;
+  }
+}
+```
+
+### Example Request Logs
+
+```json
+{
+  "installation_id": "inst_abc123",
+  "team_id": "team_xyz",
+  "requests": [
+    {
+      "user_id": "user_xyz",
+      "tool_name": "github:list-repos",
+      "tool_params": {
+        "owner": "deploystackio"
+      },
+      "tool_response": {
+        "repos": ["deploystack", "mcp-server"],
+        "total": 2
+      },
+      "response_time_ms": 234,
+      "success": true,
+      "timestamp": "2025-01-15T10:30:00.000Z"
+    },
+    {
+      "user_id": "user_xyz",
+      "tool_name": "slack:send-message",
+      "tool_params": {
+        "channel": "#general",
+        "text": "Deploy complete"
+      },
+      "response_time_ms": 456,
+      "success": false,
+      "error_message": "Channel not found",
+      "timestamp": "2025-01-15T10:30:05.000Z"
+    }
+  ]
+}
+```
+
+## Batching Configuration
+
+Both server logs and request logs use the same batching strategy. See [Event Emission - Batching Configuration](/development/satellite/event-emission#batching-configuration) for configuration parameters and rationale.
+
+### Batching Flow
+
+```
+Log/Request occurs
+    ↓
+Buffer entry in memory
+    ↓
+    ├─ Buffer size < 20?
+    │   ↓
+    │   Schedule flush after 3 seconds
+    │
+    └─ Buffer size >= 20?
+        ↓
+        Flush immediately (force)
+    ↓
+Group entries by installation
+    ↓
+Emit one event per installation
+    ↓
+Backend receives batched logs
+    ↓
+Bulk insert into database
+```
+
+## Backend Storage
+
+### Server Logs Table
+
+```sql
+CREATE TABLE mcpServerLogs (
+  id TEXT PRIMARY KEY,
+  installation_id TEXT NOT NULL,
+  level TEXT NOT NULL, -- 'info'|'warn'|'error'|'debug'
+  message TEXT NOT NULL,
+  metadata JSONB,
+  created_at TIMESTAMP NOT NULL,
+  FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id)
+);
+```
+
+### Request Logs Table
+
+```sql
+CREATE TABLE mcpRequestLogs (
+  id TEXT PRIMARY KEY,
+  installation_id TEXT NOT NULL,
+  user_id TEXT,
+  tool_name TEXT NOT NULL,
+  tool_params JSONB NOT NULL,
+  tool_response JSONB, -- Full response from MCP server
+  response_time_ms INTEGER NOT NULL,
+  success BOOLEAN NOT NULL,
+  error_message TEXT,
+  created_at TIMESTAMP NOT NULL,
+  FOREIGN KEY (installation_id) REFERENCES mcpServerInstallations(id),
+  FOREIGN KEY (user_id) REFERENCES authUser(id)
+);
+```
+
+### Cleanup Job
+
+A backend cron job enforces a 100-line limit per installation for both tables:
+
+```typescript
+// Runs every 10 minutes
+// For each installation with > 100 logs:
+//   1. Find oldest logs to delete (keep most recent 100)
+//   2. DELETE FROM table WHERE id NOT IN (recent 100)
+```
+
+This prevents unbounded table growth while maintaining recent debugging history.
+
+## Buffer Management
+
+### Memory Usage
+
+**Server Logs:**
+- Maximum ~20 entries in buffer before flush
+- Each entry: ~200 bytes average (message + metadata)
+- Max buffer size: ~4 KB per ProcessManager instance
+
+**Request Logs:**
+- Maximum ~20 entries in buffer before flush
+- Each entry: Variable (depends on params/response size)
+- Typically: 500 bytes - 5 KB per entry
+- Max buffer size: ~10-100 KB per McpServerWrapper instance
+
+### Cleanup on Shutdown
+
+Both buffer managers flush remaining logs on cleanup:
+
+```typescript
+// ProcessManager cleanup
+cleanup() {
+  this.flushLogBuffer(); // Flush any buffered logs
+  clearTimeout(this.logFlushTimeout);
+}
+
+// McpServerWrapper cleanup
+cleanup() {
+  this.flushRequestLogBuffer(); // Flush any buffered requests
+  clearTimeout(this.requestLogFlushTimeout);
+}
+```
+
+## Implementation References
+
+**Phase 7:** Server and request log batching implementation
+**Phase 14:** Request logging toggle and tool response capture
+**Phase 5:** Backend log tables and event handlers
+**Phase 6:** 100-line cleanup job
+
+## Related Documentation
+
+- [Event Emission](/development/satellite/event-emission) - Log event types and payloads
+- [Process Management](/development/satellite/process-management) - Server log buffering
+- [Status Tracking](/development/satellite/status-tracking) - How logs relate to status
diff --git a/development/satellite/process-management.mdx b/development/satellite/process-management.mdx
index 34c974e..5c8ceb3 100644
--- a/development/satellite/process-management.mdx
+++ b/development/satellite/process-management.mdx
@@ -408,36 +408,11 @@ The ProcessManager emits events for monitoring and integration:
 
 ## Event Emission
 
-The ProcessManager emits real-time events to the Backend for operational visibility and audit trails. These events are batched every 3 seconds and sent via the Event System.
+The ProcessManager emits real-time lifecycle events (started, crashed, restarted, permanently_failed) to the Backend for operational visibility and audit trails.
 
-### Lifecycle Events
+ProcessManager internal events (processSpawned, processTerminated) are for satellite-internal coordination. Event System events (mcp.server.started, etc.) are sent to Backend for external visibility.
 
-**mcp.server.started**
-- Emitted after successful spawn and handshake completion
-- Includes: server_id, process_id, spawn_duration_ms, tool_count
-- Provides immediate visibility into new MCP server availability
-
-**mcp.server.crashed**
-- Emitted on unexpected process exit with non-zero code
-- Includes: exit_code, signal, uptime_seconds, crash_count, will_restart
-- Enables real-time alerting for process failures
-
-**mcp.server.restarted**
-- Emitted after successful automatic restart
-- Includes: old_process_id, new_process_id, restart_reason, attempt_number
-- Tracks restart attempts for reliability monitoring
-
-**mcp.server.permanently_failed**
-- Emitted when restart limit (3 attempts) is exceeded
-- Includes: total_crashes, last_error, failed_at timestamp
-- Critical alert requiring manual intervention
-
-**Event vs Internal Events:**
-- ProcessManager internal events (processSpawned, processTerminated, etc.) are for satellite-internal coordination
-- Event System events (mcp.server.started, etc.) are sent to Backend for external visibility
-- Both work together: Internal events trigger state changes, Event System events provide audit trail
-
-For complete event system documentation and all event types, see [Event System](/development/satellite/event-system).
+See [Event Emission - Process Lifecycle Events](/development/satellite/event-emission#event-types-reference) for complete event types, payloads, and batching configuration.
 
 ## Team Isolation
 
@@ -531,10 +506,51 @@ LOG_LEVEL=debug npm run dev
 - Enabled by default (MCP servers need external connectivity)
 - Can be disabled for higher security requirements
 
+## Status Events
+
+Process lifecycle emits status events to backend for real-time monitoring:
+
+**Status Event Emission:**
+- `connecting` - When process spawn starts
+- `online` - After successful handshake and tool discovery
+- `permanently_failed` - When process crashes 3 times in 5 minutes
+
+See [Event Emission](/development/satellite/event-emission) for complete event types and payloads.
+
+## Log Buffering
+
+Process stderr output is buffered and batched before emission:
+
+**Buffering Strategy:**
+- Batch interval: 3 seconds after first log
+- Max batch size: 20 logs (forces immediate flush)
+- Grouping: By installation_id + team_id
+
+**Log Levels:**
+- Inferred from message content (`error` if contains "error", etc.)
+- Metadata includes process_id for debugging
+
+See [Log Capture](/development/satellite/log-capture) for buffer management details.
+
+## Configuration Restart Flow
+
+When configuration is updated (env vars, args, headers, query params):
+
+1. Backend sets installation status to `restarting`
+2. Backend sends `configure` command to satellite
+3. Satellite receives command and stops old process
+4. Satellite clears tool cache for installation
+5. Satellite spawns new process with updated configuration
+6. Status progresses: `restarting` → `connecting` → `discovering_tools` → `online`
+
+See [Status Tracking](/development/satellite/status-tracking) for configuration update status transitions.
+
 ## Related Documentation
 
 - [Satellite Architecture Design](/development/satellite/architecture) - Overall system architecture
 - [Idle Process Management](/development/satellite/idle-process-management) - Automatic termination and respawning of idle processes
 - [Tool Discovery Implementation](/development/satellite/tool-discovery) - How tools are discovered from processes
-- [Team Isolation Implementation](/development/satellite/team-isolation) - Team-based access control
+- [Event Emission](/development/satellite/event-emission) - Process lifecycle events
+- [Log Capture](/development/satellite/log-capture) - stderr log buffering
+- [Status Tracking](/development/satellite/status-tracking) - Process status management
 - [Backend Communication](/development/satellite/backend-communication) - Integration with Backend commands
diff --git a/development/satellite/recovery-system.mdx b/development/satellite/recovery-system.mdx
new file mode 100644
index 0000000..84527a5
--- /dev/null
+++ b/development/satellite/recovery-system.mdx
@@ -0,0 +1,370 @@
+---
+title: Recovery System
+description: Automatic recovery and failure handling for MCP servers
+---
+
+# Recovery System
+
+The satellite automatically detects and recovers from MCP server failures without manual intervention. Recovery works for HTTP/SSE servers (network failures) and stdio servers (process crashes).
+
+## Overview
+
+The recovery system handles **HTTP/SSE Servers** (network failures, server downtime, connection timeouts) and **Stdio Servers** (process crashes up to 3 times in 5 minutes)
+
+Recovery is fully automatic for recoverable failures. Permanent failures (3+ crashes, OAuth token expired) require manual action.
+
+## Recovery Detection
+
+### Tool Execution Recovery
+
+When a tool is executed on a server that was previously offline/error, recovery is detected automatically:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+async handleExecuteTool(toolPath: string, toolArguments: unknown) {
+  const serverSlug = toolPath.split(':')[0];
+  const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug);
+  const wasOfflineOrError = statusEntry && ['offline', 'error'].includes(statusEntry.status);
+
+  // Execute tool with retry logic
+  const result = await this.executeHttpToolCallWithRetry(...);
+
+  // If execution succeeded but server was offline/error → RECOVERY DETECTED
+  if (wasOfflineOrError) {
+    this.handleServerRecovery(serverSlug, config);
+  }
+
+  return result;
+}
+```
+
+### Health Check Recovery
+
+Backend health checks periodically test offline servers. When they respond again:
+
+```
+Backend health check runs (every 3 minutes)
+    ↓
+Offline template now responds
+    ↓
+Backend sets installations to 'connecting'
+    ↓
+Backend sends 'configure' command with event='mcp_recovery'
+    ↓
+Satellite receives command and triggers re-discovery
+    ↓
+Status progresses: connecting → discovering_tools → online
+```
+
+## Retry Logic (HTTP/SSE)
+
+Before marking a server as offline, the satellite retries tool execution with exponential backoff:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+interface RetryConfig {
+  maxRetries: 3;
+  backoffMs: [500, 1000, 2000]; // Exponential: 500ms, 1s, 2s
+}
+
+async executeHttpToolCallWithRetry(
+  serverConfig: McpServerConfig,
+  toolName: string,
+  args: unknown
+): Promise<unknown> {
+  let lastError: Error;
+
+  for (let attempt = 1; attempt <= 3; attempt++) {
+    try {
+      const response = await this.executeHttpToolCall(serverConfig, toolName, args);
+      return response; // Success - no retry needed
+    } catch (error) {
+      lastError = error;
+
+      // Non-retryable errors (auth failures) → fail immediately
+      if (this.isNonRetryableError(error)) {
+        throw error;
+      }
+
+      // Retryable errors (connection refused) → backoff and retry
+      if (attempt < 3) {
+        const backoffMs = [500, 1000, 2000][attempt - 1];
+        await new Promise(resolve => setTimeout(resolve, backoffMs));
+      }
+    }
+  }
+
+  // All retries exhausted → throw last error
+  throw lastError;
+}
+
+private isNonRetryableError(error: Error): boolean {
+  const msg = error.message.toLowerCase();
+  return msg.includes('401') || msg.includes('403') ||
+         msg.includes('unauthorized') || msg.includes('forbidden') ||
+         msg.includes('oauth') || msg.includes('authorization required');
+}
+```
+
+### Retryable vs Non-Retryable Errors
+
+| Error Type | Action | Reason |
+|------------|--------|--------|
+| ECONNREFUSED | **Retry** | Server may be restarting |
+| ETIMEDOUT | **Retry** | Network hiccup, may recover |
+| ENOTFOUND | **Retry** | DNS issue, may be temporary |
+| fetch failed | **Retry** | Network error, transient |
+| 401 Unauthorized | **No retry** | Token expired, retrying won't help |
+| 403 Forbidden | **No retry** | Access denied, retrying won't help |
+| OAuth errors | **No retry** | Auth issue, needs user action |
+
+## Recovery Flow
+
+When servers recover from failure, the satellite updates status and triggers re-discovery asynchronously without blocking tool execution responses.
+
+See [Status Tracking - Status Lifecycle](/development/satellite/status-tracking#status-lifecycle) for complete recovery flow diagrams including successful recovery, failed recovery, and status transitions.
+
+## Automatic Re-Discovery
+
+When recovery is detected, tools are refreshed from the server without blocking the user:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+private async handleServerRecovery(
+  serverSlug: string,
+  config: McpServerConfig
+): Promise<void> {
+  // Prevent duplicate recovery attempts
+  if (this.recoveryInProgress.has(serverSlug)) {
+    return; // Already recovering
+  }
+
+  this.recoveryInProgress.add(serverSlug);
+
+  try {
+    this.logger.info({ serverSlug }, 'Server recovered - triggering re-discovery');
+
+    // Emit status change to backend
+    this.eventBus?.emit('mcp.server.status_changed', {
+      installation_id: config.installation_id,
+      team_id: config.team_id,
+      status: 'connecting',
+      status_message: 'Server recovered, re-discovering tools',
+      timestamp: new Date().toISOString()
+    });
+
+    // Trigger re-discovery asynchronously (doesn't block tool response)
+    await this.toolDiscoveryManager?.remoteToolManager?.discoverServerTools(serverSlug);
+
+    this.logger.info({ serverSlug }, 'Tool re-discovery successful after recovery');
+  } catch (error) {
+    // Re-discovery failed (non-fatal, tool response still returned)
+    this.logger.error({ serverSlug, error }, 'Tool re-discovery failed after recovery');
+  } finally {
+    this.recoveryInProgress.delete(serverSlug);
+  }
+}
+```
+
+### Why Asynchronous Re-Discovery?
+
+**User Experience:**
+- Tool execution result returned immediately
+- User doesn't wait for tool discovery (can take 1-5 seconds)
+- If re-discovery fails, user already got their result
+
+**Reliability:**
+- Tool response isn't blocked by discovery errors
+- Discovery failure doesn't affect user's current request
+- Recovery can be retried later
+
+## Tool Preservation
+
+When re-discovery fails, tools are NOT removed from cache:
+
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+async rediscoverServerTools(serverSlug: string): Promise<void> {
+  try {
+    // Attempt discovery
+    const newTools = await this.fetchToolsFromServer(serverSlug);
+
+    // Discovery succeeded → remove old tools and add new ones
+    this.removeToolsForServer(serverSlug);
+    this.addTools(newTools);
+
+    this.statusCallback?.(serverSlug, 'online');
+  } catch (error) {
+    // Discovery failed → keep old tools in cache
+    // Tools remain available for future attempts
+    this.statusCallback?.(serverSlug, 'error', error.message);
+  }
+}
+```
+
+**Why preserve tools on failure?**
+- User can still see what tools are available
+- Tools may work if server recovers later
+- Better UX than empty tool list
+- Discovery can be retried without losing tool metadata
+
+## Stdio Process Recovery
+
+Stdio servers auto-restart after crashes (up to 3 times in 5 minutes):
+
+```typescript
+// services/satellite/src/process/manager.ts
+
+async handleProcessExit(processInfo: ProcessInfo, exitCode: number) {
+  const now = Date.now();
+  const fiveMinutesAgo = now - 5 * 60 * 1000;
+
+  // Track crashes in 5-minute window
+  processInfo.crashHistory = processInfo.crashHistory.filter(t => t > fiveMinutesAgo);
+  processInfo.crashHistory.push(now);
+
+  const crashCount = processInfo.crashHistory.length;
+
+  if (crashCount >= 3) {
+    // Permanent failure - emit status event
+    this.eventBus?.emit('mcp.server.permanently_failed', {
+      installation_id: processInfo.config.installation_id,
+      team_id: processInfo.config.team_id,
+      process_id: processInfo.processId,
+      crash_count: crashCount,
+      message: `Process crashed ${crashCount} times in 5 minutes`,
+      timestamp: new Date().toISOString()
+    });
+
+    // Also emit status_changed for database update
+    this.eventBus?.emit('mcp.server.status_changed', {
+      installation_id: processInfo.config.installation_id,
+      team_id: processInfo.config.team_id,
+      status: 'permanently_failed',
+      status_message: `Process crashed ${crashCount} times in 5 minutes. Manual restart required.`,
+      timestamp: new Date().toISOString()
+    });
+
+    return; // No auto-restart
+  }
+
+  // Auto-restart (crash count < 3)
+  this.logger.info({ processId: processInfo.processId, crashCount }, 'Auto-restarting crashed process');
+  await this.startProcess(processInfo.config);
+}
+```
+
+### Stdio Recovery Timeline
+
+```
+Process crashes (crash #1)
+    ↓
+Auto-restart immediately
+    ↓
+Process crashes again (crash #2, within 5 min)
+    ↓
+Auto-restart immediately
+    ↓
+Process crashes again (crash #3, within 5 min)
+    ↓
+Status → 'permanently_failed'
+    ↓
+No auto-restart (manual action required)
+```
+
+## Failure Status Mapping
+
+When tool execution fails after all retries, error messages are mapped to appropriate status values:
+
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+static getStatusFromError(error: Error): { status: string; message: string } {
+  const msg = error.message.toLowerCase();
+
+  // Auth errors → requires_reauth
+  if (msg.includes('401') || msg.includes('unauthorized')) {
+    return { status: 'requires_reauth', message: 'Authentication failed (HTTP 401)' };
+  }
+  if (msg.includes('403') || msg.includes('forbidden')) {
+    return { status: 'requires_reauth', message: 'Access forbidden (HTTP 403)' };
+  }
+
+  // Connection errors → offline
+  if (msg.includes('econnrefused') || msg.includes('etimedout') ||
+      msg.includes('enotfound') || msg.includes('fetch failed')) {
+    return { status: 'offline', message: 'Server unreachable' };
+  }
+
+  // Other errors → error
+  return { status: 'error', message: error.message };
+}
+```
+
+## Debouncing Concurrent Recovery
+
+Multiple tool executions may detect recovery simultaneously. Debouncing prevents duplicate re-discoveries:
+
+```typescript
+class McpServerWrapper {
+  private recoveryInProgress: Set<string> = new Set();
+
+  private async handleServerRecovery(serverSlug: string, config: McpServerConfig) {
+    // Check if already recovering
+    if (this.recoveryInProgress.has(serverSlug)) {
+      return; // Skip duplicate recovery
+    }
+
+    this.recoveryInProgress.add(serverSlug);
+
+    try {
+      await this.performRecovery(serverSlug, config);
+    } finally {
+      this.recoveryInProgress.delete(serverSlug);
+    }
+  }
+}
+```
+
+**Scenario:**
+- LLM executes 3 tools from same server concurrently
+- All 3 detect recovery (server was offline)
+- Only first execution triggers re-discovery
+- Other 2 skip (already in progress)
+
+## Recovery Timing
+
+| Recovery Type | Detection Time | Re-Discovery Time | Total |
+|---------------|----------------|-------------------|-------|
+| **Tool Execution** | Immediate (on next tool call) | 1-5 seconds | ~1-5s |
+| **Health Check** | Up to 3 minutes (polling interval) | 1-5 seconds | ~3-8 min |
+
+**Recommendation:** Tool execution recovery is faster and more responsive than health check recovery.
+
+## Manual Recovery (Requires User Action)
+
+Some failures cannot auto-recover:
+
+| Status | Reason | User Action |
+|--------|--------|-------------|
+| `requires_reauth` | OAuth token expired/revoked | Re-authenticate in dashboard |
+| `permanently_failed` | 3+ crashes in 5 minutes (stdio) | Check logs, fix issue, manual restart |
+
+See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays).
+
+## Implementation References
+
+**Phase 13:** Stdio auto-recovery and permanently_failed status
+**Phase 18:** Tool execution retry logic and recovery detection
+**Phase 8:** Health check recovery via backend
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Status values and transitions
+- [Event Emission](/development/satellite/event-emission) - Recovery status events
+- [Tool Discovery](/development/satellite/tool-discovery) - Re-discovery after recovery
+- [Process Management](/development/satellite/process-management) - Stdio crash recovery
diff --git a/development/satellite/status-tracking.mdx b/development/satellite/status-tracking.mdx
new file mode 100644
index 0000000..a530c55
--- /dev/null
+++ b/development/satellite/status-tracking.mdx
@@ -0,0 +1,284 @@
+---
+title: Status Tracking
+description: MCP server installation status tracking system in the satellite
+---
+
+# Status Tracking
+
+The satellite tracks the health and availability of each MCP server installation through an 11-state status system. This enables real-time monitoring, automatic recovery, and tool availability filtering.
+
+## Overview
+
+Status tracking serves three primary purposes:
+
+1. **User Visibility**: Users see current server state in real-time via the frontend
+2. **Tool Availability**: Tools from unavailable servers are filtered from discovery
+3. **Automatic Recovery**: System detects and recovers from failures automatically
+
+The status system is managed by `UnifiedToolDiscoveryManager` and updated through:
+- Installation lifecycle events (provisioning → online)
+- Health check results (online → offline)
+- Tool execution failures (online → offline/error/requires_reauth)
+- Configuration changes (online → restarting)
+- Recovery detection (offline → connecting → online)
+
+## Status Values
+
+| Status | Description | Tools Available? | User Action Required |
+|--------|-------------|------------------|---------------------|
+| `provisioning` | Initial state after installation created | No | Wait |
+| `command_received` | Satellite received configuration command | No | Wait |
+| `connecting` | Connecting to MCP server | No | Wait |
+| `discovering_tools` | Running tool discovery | No | Wait |
+| `syncing_tools` | Syncing tools to backend | No | Wait |
+| `online` | Server healthy and responding | **Yes** | None |
+| `restarting` | Configuration updated, server restarting | No | Wait |
+| `offline` | Server unreachable (auto-recovers) | No | Wait or check server |
+| `error` | General error state (auto-recovers) | No | Check logs |
+| `requires_reauth` | OAuth token expired/revoked | No | Re-authenticate |
+| `permanently_failed` | 3+ crashes in 5 minutes (stdio only) | No | Manual restart required |
+
+## Status Lifecycle
+
+### Initial Installation Flow
+
+```
+provisioning
+    ↓
+command_received (satellite received configure command)
+    ↓
+connecting (spawning MCP server process or connecting to HTTP/SSE)
+    ↓
+discovering_tools (calling tools/list)
+    ↓
+syncing_tools (sending tools to backend)
+    ↓
+online (ready for use)
+```
+
+### Configuration Update Flow
+
+```
+online
+    ↓
+restarting (user updated config, backend sets status immediately)
+    ↓
+connecting (satellite receives command, restarts server)
+    ↓
+discovering_tools
+    ↓
+online
+```
+
+### Failure and Recovery Flow
+
+```
+online
+    ↓
+offline/error (server unreachable or error response)
+    ↓
+[automatic recovery when server comes back]
+    ↓
+connecting
+    ↓
+discovering_tools
+    ↓
+online
+```
+
+### OAuth Failure Flow
+
+```
+online
+    ↓
+requires_reauth (401/403 response or token refresh failed)
+    ↓
+[user re-authenticates via dashboard]
+    ↓
+connecting
+    ↓
+discovering_tools
+    ↓
+online
+```
+
+### Stdio Crash Flow (Permanent Failure)
+
+```
+online
+    ↓
+(stdio process crashes)
+    ↓
+connecting (auto-restart attempt 1)
+    ↓
+(crashes again within 5 minutes)
+    ↓
+connecting (auto-restart attempt 2)
+    ↓
+(crashes again within 5 minutes)
+    ↓
+permanently_failed (manual intervention required)
+```
+
+## Status Tracking Implementation
+
+### UnifiedToolDiscoveryManager
+
+The status system is implemented in `UnifiedToolDiscoveryManager`:
+
+```typescript
+// services/satellite/src/services/unified-tool-discovery-manager.ts
+
+export type ServerAvailabilityStatus =
+  | 'online'
+  | 'offline'
+  | 'error'
+  | 'requires_reauth'
+  | 'permanently_failed'
+  | 'connecting'
+  | 'discovering_tools';
+
+export interface ServerStatusEntry {
+  status: ServerAvailabilityStatus;
+  lastUpdated: Date;
+  message?: string;
+}
+
+class UnifiedToolDiscoveryManager {
+  private serverStatus: Map<string, ServerStatusEntry> = new Map();
+
+  // Set server status (called by discovery managers and MCP wrapper)
+  setServerStatus(serverSlug: string, status: ServerAvailabilityStatus, message?: string): void {
+    this.serverStatus.set(serverSlug, {
+      status,
+      lastUpdated: new Date(),
+      message
+    });
+  }
+
+  // Check if server is available for tool execution
+  isServerAvailable(serverSlug: string): boolean {
+    const statusEntry = this.serverStatus.get(serverSlug);
+    if (!statusEntry) return true; // Unknown = available (safe default)
+    return statusEntry.status === 'online';
+  }
+
+  // Get all tools, filtered by server status
+  getAllTools(): ToolMetadata[] {
+    const allTools = this.getAllToolsUnfiltered();
+    return allTools.filter(tool => {
+      const serverSlug = tool.tool_path.split(':')[0];
+      return this.isServerAvailable(serverSlug);
+    });
+  }
+}
+```
+
+### Status Callbacks
+
+Discovery managers call status callbacks when discovery succeeds or fails:
+
+**HTTP/SSE Discovery:**
+```typescript
+// services/satellite/src/services/remote-tool-discovery-manager.ts
+
+// On successful discovery
+this.statusCallback?.(serverSlug, 'online');
+
+// On connection error
+const { status, message } = RemoteToolDiscoveryManager.getStatusFromError(error);
+this.statusCallback?.(serverSlug, status, message);
+```
+
+**Stdio Discovery:**
+```typescript
+// services/satellite/src/services/stdio-tool-discovery-manager.ts
+
+// On successful discovery
+this.statusCallback?.(processId, 'online');
+
+// On discovery error
+this.statusCallback?.(processId, 'error', errorMessage);
+```
+
+## Tool Filtering by Status
+
+### Discovery Filtering
+
+When LLMs call `discover_mcp_tools`, only tools from available servers are returned:
+
+```typescript
+// UnifiedToolDiscoveryManager.getAllTools() filters by status
+const tools = toolDiscoveryManager.getAllTools(); // Only 'online' servers
+
+// Tools from offline/error/requires_reauth servers are hidden
+```
+
+### Execution Blocking
+
+When LLMs attempt to execute tools from unavailable servers:
+
+```typescript
+// services/satellite/src/core/mcp-server-wrapper.ts
+
+const serverSlug = toolPath.split(':')[0];
+const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug);
+
+// Block execution for non-recoverable states
+if (statusEntry?.status === 'requires_reauth') {
+  return {
+    error: `Tool cannot be executed - server requires re-authentication.
+
+Status: ${statusEntry.status}
+The server requires re-authentication. Please re-authorize in the dashboard.
+
+Unavailable server: ${serverSlug}`
+  };
+}
+
+// Allow execution for offline/error (enables recovery detection)
+```
+
+## Status Transition Triggers
+
+### Backend-Triggered (Database Updates)
+
+**Source:** Backend API routes
+
+| Trigger | New Status | When |
+|---------|-----------|------|
+| Installation created | `provisioning` | User installs MCP server |
+| Config updated | `restarting` | User modifies environment vars/args/headers |
+| OAuth callback success | `connecting` | User re-authenticates |
+| Health check fails | `offline` | Server unreachable (3-min interval) |
+| Credential validation fails | `requires_reauth` | OAuth token invalid |
+
+### Satellite-Triggered (Event Emission)
+
+**Source:** Satellite emits `mcp.server.status_changed` events to backend
+
+| Trigger | New Status | When |
+|---------|-----------|------|
+| Configure command received | `command_received` | Satellite polls backend |
+| Server connection starts | `connecting` | Spawning process or HTTP connect |
+| Tool discovery starts | `discovering_tools` | Calling tools/list |
+| Tool discovery succeeds | `online` | Discovery completed successfully |
+| Tool execution fails (3 retries) | `offline`/`error`/`requires_reauth` | Tool call failed after retries |
+| Server recovery detected | `connecting` | Previously offline server responds |
+| Stdio crashes 3 times | `permanently_failed` | 3 crashes within 5 minutes |
+
+## Implementation References
+
+**Phase 1:** Database schema for status field
+**Phase 3:** Backend event handler for status updates
+**Phase 4:** Satellite status event emission
+**Phase 10:** Tool availability filtering by status
+**Phase 17:** Configuration update status transitions
+**Phase 18:** Tool execution status updates + auto-recovery
+
+## Related Documentation
+
+- [Event Emission](/development/satellite/event-emission) - Status change event details
+- [Recovery System](/development/satellite/recovery-system) - Automatic recovery logic
+- [Tool Discovery](/development/satellite/tool-discovery) - How status affects tool discovery
+- [Hierarchical Router](/development/satellite/hierarchical-router) - Status-based tool filtering
diff --git a/development/satellite/tool-discovery.mdx b/development/satellite/tool-discovery.mdx
index b4cdba1..00abbbe 100644
--- a/development/satellite/tool-discovery.mdx
+++ b/development/satellite/tool-discovery.mdx
@@ -281,7 +281,7 @@ The satellite uses `estimateMcpServerTokens()` from `token-counter.ts` to calcul
 - Enable frontend tool catalog display with token consumption metrics
 - Provide analytics on MCP server complexity and context window usage
 
-See [Event System](/development/satellite/event-system) for event batching and delivery details.
+For event payload structure and event batching details, see [Event Emission - mcp.tools.discovered](/development/satellite/event-emission#mcp-tools-discovered).
 
 ## Development Considerations
 
@@ -349,4 +349,66 @@ curl http://localhost:3001/api/status/debug
 - Detailed usage and performance analytics
 - Cache persistence for faster startup (HTTP only)
 
+## Status Integration
+
+Tool discovery integrates with the status tracking system to filter tools and enable automatic recovery. Discovery managers call status callbacks on success/failure to update installation status in real-time.
+
+See [Status Tracking - Tool Filtering](/development/satellite/status-tracking#tool-filtering-by-status) for complete details on status-based tool filtering and execution blocking.
+
+## Recovery System
+
+When offline servers recover, tool discovery is automatically triggered. The satellite preserves existing tools during re-discovery attempts to prevent tool loss on failure.
+
+See [Recovery System - Recovery Detection](/development/satellite/recovery-system#recovery-detection) for complete recovery logic, retry strategy, and tool preservation implementation.
+
+## Tool Metadata Events
+
+Discovered tools are emitted to backend with token count estimates.
+
+**Event Structure:**
+```typescript
+eventBus.emit('mcp.tools.discovered', {
+  installation_id: string,
+  team_id: string,
+  tools: [{
+    tool_path: string,
+    name: string,
+    description?: string,
+    inputSchema: unknown,
+    token_count: number  // Estimated token usage
+  }]
+});
+```
+
+**Token Calculation:**
+- Name + description + input schema serialized
+- Estimated using character count / 4 (approximate tokens)
+- Used for analytics and optimization
+
+See [Event Emission](/development/satellite/event-emission) for complete event types.
+
+## Request Logging
+
+Tool execution is logged with full request/response data for debugging.
+
+**Logged Information:**
+- Tool name and input parameters
+- Full MCP server response (captured)
+- Response time in milliseconds
+- Success/failure status and error messages
+- User attribution (who called the tool)
+
+**Privacy Control:**
+Request logging can be disabled per-installation via `settings.request_logging_enabled = false`.
+
+See [Log Capture](/development/satellite/log-capture) for buffering and storage details.
+
+## Related Documentation
+
+- [Status Tracking](/development/satellite/status-tracking) - Tool filtering by server status
+- [Recovery System](/development/satellite/recovery-system) - Automatic re-discovery on recovery
+- [Event Emission](/development/satellite/event-emission) - Tool metadata events
+- [Log Capture](/development/satellite/log-capture) - Request logging system
+- [Hierarchical Router](/development/satellite/hierarchical-router) - How tools are exposed to MCP clients
+
 The unified tool discovery implementation provides a solid foundation for multi-transport MCP server integration while maintaining simplicity and reliability for development and production use.
diff --git a/docs.json b/docs.json
index 47479f6..25fd8ef 100644
--- a/docs.json
+++ b/docs.json
@@ -207,6 +207,15 @@
               "/development/satellite/mcp-server-token-injection"
             ]
           },
+          {
+            "group": "Status & Health Tracking",
+            "pages": [
+              "/development/satellite/status-tracking",
+              "/development/satellite/event-emission",
+              "/development/satellite/log-capture",
+              "/development/satellite/recovery-system"
+            ]
+          },
           {
             "group": "Backend Communication",
             "pages": [

From 080b90af4e8383d02324e26f2c58068547709256 Mon Sep 17 00:00:00 2001
From: Lasim <hajdas@hajdas.de>
Date: Thu, 25 Dec 2025 09:49:40 +0100
Subject: [PATCH 2/2] docs(satellite): update documentation for status
 tracking, health checks, and OAuth token handling

---
 development/backend/plugins.mdx               |   8 +-
 development/backend/satellite/commands.mdx    |  53 ++++-
 .../backend/satellite/communication.mdx       | 186 +++++++++++++++++-
 development/backend/satellite/events.mdx      |  52 ++++-
 development/satellite/architecture.mdx        |   6 +-
 .../satellite/backend-communication.mdx       |   4 +-
 development/satellite/event-emission.mdx      |  17 +-
 development/satellite/index.mdx               |   8 +-
 development/satellite/log-capture.mdx         |  13 +-
 .../satellite/mcp-server-token-injection.mdx  |   4 +-
 development/satellite/process-management.mdx  |   8 +-
 development/satellite/recovery-system.mdx     |   9 +-
 development/satellite/status-tracking.mdx     |  17 +-
 13 files changed, 331 insertions(+), 54 deletions(-)

diff --git a/development/backend/plugins.mdx b/development/backend/plugins.mdx
index e1dbdf4..58e76d1 100644
--- a/development/backend/plugins.mdx
+++ b/development/backend/plugins.mdx
@@ -313,8 +313,8 @@ The `databaseExtension` property allows your plugin to:
 #### How Plugin Database Tables Work
 
 **Security Architecture:**
-- **Phase 1 (Trusted)**: Core migrations run first (static, secure)
-- **Phase 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
+- **Stage 1 (Trusted)**: Core migrations run first (static, secure)
+- **Stage 2 (Untrusted)**: Plugin tables created dynamically (sandboxed)
 - **Clear Separation**: Plugin tables cannot interfere with core database structure
 
 **Dynamic Table Creation:**
@@ -421,7 +421,7 @@ The database initialization follows a strict security-first approach:
 
 ```
 ┌─────────────────────────────────────────┐
-│ Phase 1: Core System (Trusted)         │
+│ Stage 1: Core System (Trusted)         │
 ├─────────────────────────────────────────┤
 │ 1. Apply core migrations               │
 │ 2. Create core tables                  │
@@ -430,7 +430,7 @@ The database initialization follows a strict security-first approach:
                     │
                     ▼ Security Boundary
 ┌─────────────────────────────────────────┐
-│ Phase 2: Plugin System (Sandboxed)     │
+│ Stage 2: Plugin System (Sandboxed)     │
 ├─────────────────────────────────────────┤
 │ 1. Generate CREATE TABLE SQL           │
 │ 2. Drop existing plugin tables         │
diff --git a/development/backend/satellite/commands.mdx b/development/backend/satellite/commands.mdx
index 0608d69..c8cdeaf 100644
--- a/development/backend/satellite/commands.mdx
+++ b/development/backend/satellite/commands.mdx
@@ -32,7 +32,7 @@ The system supports 5 command types defined in the `command_type` enum:
 | `spawn` | Start MCP server process | Launch HTTP proxy or stdio process |
 | `kill` | Stop MCP server process | Terminate process gracefully |
 | `restart` | Restart MCP server | Stop and start process |
-| `health_check` | Verify server health | Call tools/list to check connectivity |
+| `health_check` | Verify server health and validate credentials | Check connectivity or validate OAuth tokens |
 
 ### Configure Commands
 
@@ -74,6 +74,30 @@ interface CommandPayload {
 }
 ```
 
+## Status Changes Triggered by Commands
+
+Commands trigger installation status changes through satellite event emission:
+
+| Command | Status Before | Status After | When |
+|---------|--------------|--------------|------|
+| `configure` (install) | N/A | `provisioning` → `command_received` → `connecting` | Installation creation flow |
+| `configure` (update) | `online` | `restarting` → `online` | Configuration change applied |
+| `configure` (delete) | Any | Process terminated | Installation removal |
+| `health_check` (credential) | `online` | `requires_reauth` | OAuth token invalid |
+| `restart` | `online` | `restarting` → `online` | Manual restart requested |
+
+**Status Lifecycle on Installation**:
+1. Backend creates installation → status=`provisioning`
+2. Backend sends `configure` command → status=`command_received`
+3. Satellite connects to server → status=`connecting`
+4. Satellite discovers tools → status=`discovering_tools`
+5. Satellite syncs tools to backend → status=`syncing_tools`
+6. Process complete → status=`online`
+
+For complete status transition documentation, see [Backend Events - Status Values](/development/backend/satellite/events#mcp-server-status_changed).
+
+---
+
 ## Command Event Types
 
 All `configure` commands include an `event` field in the payload for tracking and logging:
@@ -168,6 +192,14 @@ await satelliteCommandService.notifyMcpRecovery(
 
 **Payload**: `event: 'mcp_recovery'`
 
+**Status Flow**:
+- Triggered by health check detecting offline installation
+- Sets status to `connecting`
+- Satellite rediscovers tools
+- Status progresses: offline → connecting → discovering_tools → online
+
+For complete recovery system documentation, see [Backend Communication - Auto-Recovery](/development/backend/satellite/communication#auto-recovery-system).
+
 ## Critical Pattern
 
 **ALWAYS use the correct convenience method**:
@@ -247,9 +279,22 @@ When satellites receive commands:
 3. Execute spawn sequence
 
 **For `health_check` commands**:
-1. Call tools/list on target server
-2. Verify response
-3. Report health status
+1. Check `payload.check_type` field:
+   - `connectivity` (default): Call tools/list to verify server responds
+   - `credential_validation`: Validate OAuth tokens for installation
+2. Execute appropriate validation
+3. Report health status via `mcp.server.status_changed` event:
+   - `online` - Health check passed
+   - `requires_reauth` - OAuth token expired/revoked
+   - `error` - Validation failed with error
+
+**Credential Validation Flow**:
+- Backend cron job sends `health_check` command with `check_type: 'credential_validation'`
+- Satellite validates OAuth token (performs token refresh test)
+- Emits status event based on validation result
+- Backend updates `mcpServerInstallations.status` and `last_credential_check_at`
+
+For satellite-side credential validation implementation, see [Satellite OAuth Authentication](/development/satellite/oauth-authentication).
 
 ## Example Usage
 
diff --git a/development/backend/satellite/communication.mdx b/development/backend/satellite/communication.mdx
index 3eefcce..a6a9dde 100644
--- a/development/backend/satellite/communication.mdx
+++ b/development/backend/satellite/communication.mdx
@@ -106,20 +106,20 @@ The system uses three distinct communication patterns:
 
 ### Security Architecture
 
-The satellite pairing process implements a secure **two-phase JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
+The satellite pairing process implements a secure **two-step JWT-based authentication system** that prevents unauthorized satellite connections. For complete implementation details, see [API Security - Registration Token Authentication](/development/backend/api/security#registration-token-authentication).
 
-**Phase 1: Token Generation**
+**Step 1: Token Generation**
 - Administrators generate temporary registration tokens through admin APIs
 - Scope-specific tokens (global vs team) with cryptographic signatures
 - Token management endpoints for generation, listing, and revocation
 
-**Phase 2: Satellite Registration**
+**Step 2: Satellite Registration**
 - Satellites authenticate using `Authorization: Bearer deploystack_satellite_*` headers
 - Backend validates JWT tokens with single-use consumption
 - Permanent API keys issued after successful token validation
 - Token consumed to prevent replay attacks
 
-**Breaking Change**: As of Phase 3 implementation, all new satellite registrations require valid registration tokens. The open registration system has been secured.
+**Note**: All new satellite registrations require valid registration tokens. The open registration system has been secured.
 
 ### Registration Middleware
 
@@ -261,6 +261,153 @@ Configuration respects team boundaries and isolation:
 - Team-defined security policies
 - Internal resource access settings
 
+## Frontend API Endpoints
+
+The backend provides REST and SSE endpoints for frontend access to installation status, logs, and requests.
+
+### Status & Monitoring Endpoints
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/status`**
+- Returns current installation status, status message, and last update timestamp
+- Used by frontend for real-time status badges and progress indicators
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs`**
+- Returns paginated server logs (stderr output, connection errors)
+- Query params: `limit`, `offset` for pagination
+- Limited to 100 lines per installation (enforced by cleanup cron job)
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests`**
+- Returns paginated request logs (tool execution history)
+- Includes request params, duration, success status
+- Response data included if `request_logging_enabled=true`
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/{requestId}`**
+- Returns detailed request log for specific execution
+- Includes full request/response payloads when available
+
+### Settings Management
+
+**PATCH `/api/teams/{teamId}/mcp/installations/{installationId}/settings`**
+- Updates installation settings (stored in `mcpServerInstallations.settings` jsonb column)
+- Settings distributed to satellites via config endpoint
+- Current settings:
+  - `request_logging_enabled` (boolean) - Controls capture of tool responses
+
+### Real-Time Streaming (SSE)
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/logs/stream`**
+- Server-Sent Events endpoint for real-time log streaming
+- Frontend subscribes for live stderr output
+- Auto-reconnects on connection loss
+
+**GET `/api/teams/{teamId}/mcp/installations/{installationId}/requests/stream`**
+- Server-Sent Events endpoint for real-time request log streaming
+- Frontend subscribes for live tool execution updates
+- Includes duration, status, and optionally response data
+
+**SSE vs REST Comparison**:
+| Feature | REST Endpoints | SSE Endpoints |
+|---------|---------------|---------------|
+| Use Case | Historical data, pagination | Real-time updates |
+| Connection | Request/response | Persistent connection |
+| Data Flow | Pull (client requests) | Push (server sends) |
+| Frontend Usage | Initial load, manual refresh | Live monitoring |
+
+**SSE Controller Implementation**: `services/backend/src/controllers/mcp/sse.controller.ts`
+
+**Routes Implementation**: `services/backend/src/routes/api/teams/mcp/installations.routes.ts`
+
+---
+
+## Health Check & Recovery Systems
+
+### Cumulative Health Check System
+
+**Purpose**: Template-level health aggregation across all installations of an MCP server.
+
+**McpHealthCheckService** (`services/backend/src/services/mcp-health-check.service.ts`):
+- Aggregates health status from all installations of each MCP server template
+- Updates `mcpServers.health_status` based on installation health
+- Provides template-level health visibility in admin dashboard
+
+**Cron Job**: `mcp-health-check` runs every 3 minutes
+- Implementation: `services/backend/src/jobs/mcp-health-check.job.ts`
+- Checks all MCP server templates
+- Updates template health status for admin visibility
+
+### Credential Validation System
+
+**Purpose**: Per-installation OAuth token validation to detect expired/revoked credentials.
+
+**McpCredentialValidationWorker** (`services/backend/src/workers/mcp-credential-validation.worker.ts`):
+- Validates OAuth tokens for each installation
+- Sends `health_check` command to satellite with `check_type: 'credential_validation'`
+- Satellite performs OAuth validation and reports status
+
+**Cron Job**: `mcp-credential-validation` runs every 1 minute
+- Implementation: `services/backend/src/jobs/mcp-credential-validation.job.ts`
+- Validates installations on 15-minute rotation
+- Triggers `requires_reauth` status on validation failure
+
+**Health Check Command Payload**:
+```json
+{
+  "commandType": "health_check",
+  "priority": "immediate",
+  "payload": {
+    "check_type": "credential_validation",
+    "installation_id": "inst_123",
+    "team_id": "team_xyz"
+  }
+}
+```
+
+Satellite validates credentials and emits `mcp.server.status_changed` with status:
+- `online` - Credentials valid
+- `requires_reauth` - OAuth token expired/revoked
+- `error` - Validation failed with error
+
+### Auto-Recovery System
+
+**Recovery Trigger**:
+- Health check system detects offline installations
+- Backend calls `notifyMcpRecovery(installation_id, team_id)`
+- Sends command to satellite: Set status=`connecting`, rediscover tools
+- Status progression: offline → connecting → discovering_tools → online
+
+**Tool Execution Recovery**:
+- Satellite detects recovery during tool execution (offline server responds)
+- Emits immediate status change event (doesn't wait for health check)
+- Triggers asynchronous re-discovery
+
+For satellite-side recovery implementation, see [Satellite Recovery System](/development/satellite/recovery-system).
+
+---
+
+## Background Cron Jobs
+
+The backend runs three MCP-related cron jobs for maintenance and monitoring:
+
+**cleanup-mcp-server-logs**:
+- **Schedule**: Every 10 minutes
+- **Purpose**: Enforce 100-line limit per installation in `mcpServerLogs` table
+- **Action**: Deletes oldest logs beyond 100-line limit
+- **Implementation**: `services/backend/src/jobs/cleanup-mcp-server-logs.job.ts`
+
+**mcp-health-check**:
+- **Schedule**: Every 3 minutes
+- **Purpose**: Template-level health aggregation
+- **Action**: Updates `mcpServers.health_status` column
+- **Implementation**: `services/backend/src/jobs/mcp-health-check.job.ts`
+
+**mcp-credential-validation**:
+- **Schedule**: Every 1 minute
+- **Purpose**: Detect expired/revoked OAuth tokens
+- **Action**: Sends `health_check` commands to satellites
+- **Implementation**: `services/backend/src/jobs/mcp-credential-validation.job.ts`
+
+---
+
 ## Database Schema Integration
 
 ### Core Table Structure
@@ -298,6 +445,37 @@ The satellite system integrates with existing DeployStack schema through 5 speci
 - Alert generation and notification triggers
 - Historical health trend analysis
 
+### New Columns Added (Status & Health Tracking System)
+
+**mcpServerInstallations** table:
+- `status` (text) - Current installation status (11 possible values)
+- `status_message` (text, nullable) - Human-readable status context or error details
+- `status_updated_at` (timestamp) - Last status change timestamp
+- `last_health_check_at` (timestamp, nullable) - Last health check execution time
+- `last_credential_check_at` (timestamp, nullable) - Last credential validation time
+- `settings` (jsonb, nullable) - Generic settings object (e.g., `request_logging_enabled`)
+
+**mcpServers** table:
+- `health_status` (text, nullable) - Template-level aggregated health status
+- `last_health_check_at` (timestamp, nullable) - Last template health check time
+- `health_check_error` (text, nullable) - Last health check error message
+
+**mcpServerLogs** table:
+- Stores batched stderr logs from satellites
+- 100-line limit per installation (enforced by cleanup cron job)
+- Fields: `installation_id`, `team_id`, `log_level`, `message`, `timestamp`
+
+**mcpRequestLogs** table:
+- Stores batched tool execution logs
+- `tool_response` (jsonb, nullable) - MCP server response data
+- Privacy control: Only captured when `request_logging_enabled=true`
+- Fields: `installation_id`, `team_id`, `tool_name`, `request_params`, `tool_response`, `duration_ms`, `success`, `error_message`, `timestamp`
+
+**mcpToolMetadata** table:
+- Stores discovered tools with token counts
+- Used for hierarchical router token savings calculations
+- Fields: `installation_id`, `server_slug`, `tool_name`, `description`, `input_schema`, `token_count`, `discovered_at`
+
 ### Team Isolation in Data Model
 
 All satellite data respects team boundaries:
diff --git a/development/backend/satellite/events.mdx b/development/backend/satellite/events.mdx
index 7524a68..d77b0e4 100644
--- a/development/backend/satellite/events.mdx
+++ b/development/backend/satellite/events.mdx
@@ -197,18 +197,23 @@ Updates `mcpServerInstallations` table when server status changes during install
 
 **Optional Fields**: `status_message` (string, human-readable context or error details)
 
-**Status Values**:
+**Status Values** (11 total):
 - `provisioning` - Installation created, waiting for satellite
 - `command_received` - Satellite acknowledged install command
 - `connecting` - Satellite connecting to MCP server
 - `discovering_tools` - Tool discovery in progress
 - `syncing_tools` - Sending discovered tools to backend
 - `online` - Server healthy and responding
+- `restarting` - Configuration changed, server restarting
 - `offline` - Server unreachable
 - `error` - Connection failed with specific error
 - `requires_reauth` - OAuth token expired/revoked
 - `permanently_failed` - Process crashed 3+ times in 5 minutes
 
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/status-changed.handler.ts`
+
+For satellite-side status detection logic and lifecycle flows, see [Satellite Status Tracking](/development/satellite/status-tracking).
+
 **Emission Points**:
 - Success path: After successful tool discovery → status='online'
 - Failure path: On connection errors → status='offline', 'error', or 'requires_reauth'
@@ -225,6 +230,48 @@ Inserts record into `satelliteUsageLogs` for analytics and audit trails.
 
 **Optional Fields**: `error_message` (string, only present when success=false)
 
+### Logging Events
+
+#### mcp.server.logs
+
+Inserts batched stderr output from MCP servers into `mcpServerLogs` table for debugging and monitoring.
+
+**Business Logic**: Captures stderr output, connection errors, and process lifecycle events. Limited to 100 lines per installation via cleanup cron job.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `logs` (array of log entries)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/server-logs.handler.ts`
+
+Event batching strategy (3-second interval, max 20 per batch) is documented in [Satellite Event Emission](/development/satellite/event-emission).
+
+#### mcp.request.logs
+
+Inserts batched tool execution logs into `mcpRequestLogs` table with full request/response data for audit trails.
+
+**Business Logic**: Captures tool execution with request parameters, response data, duration, and success status. Privacy controlled via `mcpServerInstallations.settings.request_logging_enabled`.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `tool_name`, `request_params`, `duration_ms`, `success`
+
+**Optional Fields**: `tool_response` (jsonb), `error_message` (string)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/request-logs.handler.ts`
+
+**Database Storage**: `mcpRequestLogs.tool_response` column stores MCP server responses when request logging is enabled.
+
+### Tool Discovery Events
+
+#### mcp.tools.discovered
+
+Updates `mcpToolMetadata` table with discovered tools, token counts, and tool schemas from MCP servers.
+
+**Business Logic**: Stores tool metadata for team visibility, hierarchical router token savings calculations, and frontend tool catalog display.
+
+**Required Fields** (snake_case): `installation_id`, `team_id`, `server_slug`, `tool_count`, `total_tokens`, `tools` (array)
+
+**Handler Implementation**: `services/backend/src/events/handlers/mcp/tools-discovered.handler.ts`
+
+For satellite-side tool discovery implementation, see [Satellite Tool Discovery](/development/satellite/tool-discovery).
+
 ## Creating New Event Handlers
 
 ### Handler Template
@@ -339,6 +386,9 @@ Events route to existing business tables based on their purpose:
 | `mcp.server.crashed` | `satelliteProcesses` | Update status='failed', log error details |
 | `mcp.server.status_changed` | `mcpServerInstallations` | Update status, status_message, status_updated_at |
 | `mcp.tool.executed` | `satelliteUsageLogs` | Insert usage record with metrics |
+| `mcp.server.logs` | `mcpServerLogs` | Insert batched stderr logs (100-line limit) |
+| `mcp.request.logs` | `mcpRequestLogs` | Insert tool execution logs with request/response |
+| `mcp.tools.discovered` | `mcpToolMetadata` | Update tool metadata with token counts |
 
 ### Transaction Strategy
 
diff --git a/development/satellite/architecture.mdx b/development/satellite/architecture.mdx
index 0183d01..7df1c60 100644
--- a/development/satellite/architecture.mdx
+++ b/development/satellite/architecture.mdx
@@ -442,14 +442,14 @@ For testing the hierarchical router (tool discovery and execution), see [Hierarc
 
 ## Implementation Status
 
-The satellite service has completed **Phase 1: MCP Transport Implementation** and **Phase 4: Backend Integration**. Current implementation provides:
+The satellite service has completed MCP Transport Implementation and Backend Integration. Current implementation provides:
 
-**Phase 1 - MCP Transport Layer:**
+**MCP Transport Layer:**
 - **Complete MCP Transport Layer**: SSE, SSE Messaging, Streamable HTTP
 - **Session Management**: Cryptographically secure with automatic cleanup
 - **JSON-RPC 2.0 Compliance**: Full protocol support with error handling
 
-**Phase 4 - Backend Integration:**
+**Backend Integration:**
 - **Command Polling Service**: Adaptive polling with three modes (normal/immediate/error)
 - **Dynamic Configuration Management**: Replaces hardcoded MCP server configurations
 - **Command Processing**: HTTP MCP server management (spawn/kill/restart/health_check)
diff --git a/development/satellite/backend-communication.mdx b/development/satellite/backend-communication.mdx
index b066638..f2d9018 100644
--- a/development/satellite/backend-communication.mdx
+++ b/development/satellite/backend-communication.mdx
@@ -286,7 +286,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions.
 
 ### Authentication Flow
 
-**Registration Phase:**
+**Registration:**
 1. Admin generates JWT registration token via backend API
 2. Satellite includes token in Authorization header during registration
 3. Backend validates token signature, scope, and expiration
@@ -295,7 +295,7 @@ See `services/backend/src/db/schema.ts` for complete schema definitions.
 
 For detailed token validation process, see [Registration Security](/development/backend/satellite-communication#satellite-pairing-process).
 
-**Operational Phase:**
+**Ongoing Operations:**
 1. All requests include `Authorization: Bearer {api_key}`
 2. Backend validates API key and satellite scope
 3. Team context extracted from satellite registration
diff --git a/development/satellite/event-emission.mdx b/development/satellite/event-emission.mdx
index b4d96ba..46ae8e9 100644
--- a/development/satellite/event-emission.mdx
+++ b/development/satellite/event-emission.mdx
@@ -409,14 +409,15 @@ Each event type has a dedicated backend handler:
 - Emits `mcp.tools.discovered` after successful discovery
 - Coordinates status callbacks from discovery managers
 
-## Implementation References
-
-**Phase 3:** Backend event handler system
-**Phase 4:** Satellite status event emission
-**Phase 7:** Server and request log batching
-**Phase 10:** Tool metadata event emission
-**Phase 13:** Stdio permanently_failed event
-**Phase 18:** Tool execution failure status events
+## Implementation Components
+
+The event emission system consists of several integrated components:
+- Backend event handler system
+- Satellite status event emission
+- Server and request log batching
+- Tool metadata event emission
+- Stdio permanently_failed event
+- Tool execution failure status events
 
 ## Related Documentation
 
diff --git a/development/satellite/index.mdx b/development/satellite/index.mdx
index f99b41e..1315a32 100644
--- a/development/satellite/index.mdx
+++ b/development/satellite/index.mdx
@@ -214,7 +214,7 @@ npm run release  # Release management
 
 ## Implemented Features
 
-### Phase 2: MCP Server Process Management
+### MCP Server Process Management
 - **Process Lifecycle**: Spawn, monitor, auto-restart (max 3), and terminate MCP servers
 - **stdio Communication**: Full JSON-RPC 2.0 protocol over stdin/stdout
 - **HTTP Proxy**: Reverse proxy for external MCP server endpoints working
@@ -223,20 +223,20 @@ npm run release  # Release management
 - **Tool Discovery**: Automatic tool caching from both HTTP and stdio servers
 - **Team-Grouped Heartbeat**: processes_by_team reporting every 30 seconds
 
-### Phase 3: Team Isolation
+### Team Isolation
 - **nsjail Sandboxing**: Complete process isolation with built-in resource limits
 - **Namespace Isolation**: PID, mount, UTS, IPC namespaces per team
 - **Filesystem Isolation**: Team-specific read-only and writable directories
 - **Credential Management**: Secure environment injection via nsjail
 
-### Phase 4: Backend Integration
+### Backend Integration
 - **HTTP Polling**: Outbound communication with DeployStack Backend
 - **Configuration Sync**: Dynamic configuration updates from Backend
 - **Status Reporting**: Real-time satellite health and usage metrics
 - **Command Processing**: Execute Backend commands with acknowledgment
 - **Event System**: Real-time event emission with automatic batching (10 event types)
 
-### Phase 5: Enterprise Features
+### Enterprise Features
 - **OAuth 2.1 Authentication**: Resource server with token introspection
 - **Audit Logging**: Complete audit trails for compliance
 - **Multi-Region Support**: Global satellite deployment
diff --git a/development/satellite/log-capture.mdx b/development/satellite/log-capture.mdx
index 4f83bfd..6a01d4a 100644
--- a/development/satellite/log-capture.mdx
+++ b/development/satellite/log-capture.mdx
@@ -163,7 +163,7 @@ Request logs capture tool execution with full request parameters and server resp
 For each tool execution:
 - Tool name (e.g., `github:list-repos`)
 - Input parameters sent to tool
-- **Full response from MCP server** (captured in Phase 14)
+- **Full response from MCP server** (when request logging is enabled)
 - Response time in milliseconds
 - Success/failure status
 - Error message (if failed)
@@ -436,12 +436,13 @@ cleanup() {
 }
 ```
 
-## Implementation References
+## Implementation Components
 
-**Phase 7:** Server and request log batching implementation
-**Phase 14:** Request logging toggle and tool response capture
-**Phase 5:** Backend log tables and event handlers
-**Phase 6:** 100-line cleanup job
+The log capture system consists of several integrated components:
+- Server and request log batching implementation
+- Request logging toggle and tool response capture
+- Backend log tables and event handlers
+- 100-line cleanup job
 
 ## Related Documentation
 
diff --git a/development/satellite/mcp-server-token-injection.mdx b/development/satellite/mcp-server-token-injection.mdx
index f1cb108..75cfd8a 100644
--- a/development/satellite/mcp-server-token-injection.mdx
+++ b/development/satellite/mcp-server-token-injection.mdx
@@ -341,7 +341,7 @@ private isCacheValid(cachedAt: number, expiresAt: string | null): boolean {
 async handleHttpToolCall(serverName: string, originalToolName: string, args: unknown) {
   const config = this.serverConfigs.get(serverName);
 
-  // Phase 10: OAuth token injection for HTTP/SSE MCP servers
+  // OAuth token injection for HTTP/SSE MCP servers
   let headers: Record<string, string> = {};
 
   // Add regular headers from config (API keys, custom headers, etc.)
@@ -447,7 +447,7 @@ async handleHttpToolCall(serverName: string, originalToolName: string, args: unk
 ```typescript
 // From remote-tool-discovery-manager.ts:376-440
 async discoverServerTools(serverName: string, config: ServerConfig) {
-  // Phase 10: OAuth token injection for tool discovery
+  // OAuth token injection for tool discovery
   let headers: Record<string, string> = {};
 
   // Add regular headers from config (API keys, custom headers, etc.)
diff --git a/development/satellite/process-management.mdx b/development/satellite/process-management.mdx
index 5c8ceb3..6a10fc3 100644
--- a/development/satellite/process-management.mdx
+++ b/development/satellite/process-management.mdx
@@ -147,17 +147,17 @@ All communication uses newline-delimited JSON following JSON-RPC 2.0 specificati
 
 ### Graceful Termination
 
-Process termination follows a two-phase graceful shutdown approach to ensure clean process exit and proper resource cleanup.
+Process termination follows a two-step graceful shutdown approach to ensure clean process exit and proper resource cleanup.
 
-#### Termination Phases
+#### Termination Steps
 
-**Phase 1: SIGTERM (Graceful Shutdown)**
+**Step 1: SIGTERM (Graceful Shutdown)**
 - Send SIGTERM signal to the process
 - Process has 10 seconds (default timeout) to shut down gracefully
 - Process can complete in-flight operations and cleanup resources
 - Wait for process to exit voluntarily
 
-**Phase 2: SIGKILL (Force Termination)**
+**Step 2: SIGKILL (Force Termination)**
 - If process doesn't exit within timeout period
 - Send SIGKILL signal to force immediate termination
 - Guaranteed process termination (cannot be caught or ignored)
diff --git a/development/satellite/recovery-system.mdx b/development/satellite/recovery-system.mdx
index 84527a5..bf8123d 100644
--- a/development/satellite/recovery-system.mdx
+++ b/development/satellite/recovery-system.mdx
@@ -356,11 +356,12 @@ Some failures cannot auto-recover:
 
 See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays).
 
-## Implementation References
+## Implementation Components
 
-**Phase 13:** Stdio auto-recovery and permanently_failed status
-**Phase 18:** Tool execution retry logic and recovery detection
-**Phase 8:** Health check recovery via backend
+The recovery system consists of several integrated components:
+- Stdio auto-recovery and permanently_failed status
+- Tool execution retry logic and recovery detection
+- Health check recovery via backend
 
 ## Related Documentation
 
diff --git a/development/satellite/status-tracking.mdx b/development/satellite/status-tracking.mdx
index a530c55..710408d 100644
--- a/development/satellite/status-tracking.mdx
+++ b/development/satellite/status-tracking.mdx
@@ -267,14 +267,15 @@ Unavailable server: ${serverSlug}`
 | Server recovery detected | `connecting` | Previously offline server responds |
 | Stdio crashes 3 times | `permanently_failed` | 3 crashes within 5 minutes |
 
-## Implementation References
-
-**Phase 1:** Database schema for status field
-**Phase 3:** Backend event handler for status updates
-**Phase 4:** Satellite status event emission
-**Phase 10:** Tool availability filtering by status
-**Phase 17:** Configuration update status transitions
-**Phase 18:** Tool execution status updates + auto-recovery
+## Implementation Components
+
+The status tracking system consists of several integrated components:
+- Database schema for status field
+- Backend event handler for status updates
+- Satellite status event emission
+- Tool availability filtering by status
+- Configuration update status transitions
+- Tool execution status updates with auto-recovery
 
 ## Related Documentation