diff --git a/docs/IndexerRefactorStrategy.md b/docs/IndexerRefactorStrategy.md new file mode 100644 index 000000000..641500183 --- /dev/null +++ b/docs/IndexerRefactorStrategy.md @@ -0,0 +1,328 @@ +# Ocean Node Indexer - Event Monitoring & Error Handling + +## Table of Contents + +1. [Overview](#overview) +2. [🔴 PROPOSED IMPROVEMENTS (Post-Meeting Changes)](#-proposed-improvements-post-meeting-changes) + +--- + +## Overview + +### Current Indexer Architecture + +The Ocean Node Indexer is built with the following design principles (see [Architecture.md](./Arhitecture.md) for details): + +- **Single-threaded, non-blocking design**: Uses Node.js async/await for concurrent execution across multiple chains +- **ChainIndexer instances**: Each blockchain network is monitored by a dedicated ChainIndexer instance running concurrently via the event loop +- **Event-driven communication**: Components communicate through EventEmitter for clean separation of concerns +- **Efficient I/O handling**: All RPC calls, database operations, and network requests are non-blocking, allowing high concurrency without worker threads + +### Proposed Architecture Evolution + +The refactoring strategy below maintains the core single-threaded, non-blocking architecture while introducing key improvements: + +1. **EventEmitter → Persistent Queues**: Replace synchronous EventEmitter with persistent queue system for better reliability and observability +2. **Event-level retry**: Move retry logic from embedded operations to event-level processing +3. **Enhanced error tracking**: Introduce comprehensive logging via `ddo_logs` index +4. **Per-chain resilience**: Isolate chain failures to prevent cascading issues + +These changes preserve the efficient I/O model and concurrent ChainIndexer execution while adding production-grade error handling and monitoring. + +--- + +## 🔴 PROPOSED IMPROVEMENTS (Post-Meeting Changes) + +> **Status:** Draft proposals from Jan 27, 2026 meeting +> **Goal:** Improve reliability, decoupling, and error handling + +### 1. 🎯 EVENT-LEVEL RETRY MECHANISM WITH QUEUES + +**Current Issue:** Retry logic is deeply embedded in event processing steps (e.g., inside DDO decryption) + +**Proposed Change:** + +- **Move retry logic to event level** (not deep inside processing steps) +- **Implement queue-based retry system** for all 12 event types +- **Decouple retry from specific operations** (e.g., decrypt, p2p, HTTP) + +**Implementation:** + +``` +┌─────────────────────────────────────────────────────────────┐ +│ EVENT ERROR PROCESSING QUEUE │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ Event Detected │ +│ ↓ │ +│ Send to Processor │ +│ ↓ │ +│ Process Event │ +│ ├─ Success → Mark complete, update DB │ +│ └─ Failure → Add to Retry Queue with backoff │ +│ │ +│ Retry Queue (exponential backoff): │ +│ - Retry 1: ~10 seconds │ +│ - Retry 2: ~1 minute │ +│ - Retry 3: ~10 minutes │ +│ - Retry 4: ~1 hour │ +│ - Retry 5: ~1 week (final attempt) │ +│ │ +│ Benefits: │ +│ ✓ Non-blocking (doesn't halt chain indexing) │ +│ ✓ Works for ALL error types (HTTP, P2P, RPC, DB) │ +│ ✓ Visible retry state in monitoring │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Applies to:** All event processors, especially METADATA_CREATED/UPDATED + +--- + +### 2. 🗄️ NEW DATABASE INDEX: `ddo_logs` + +**Current Issue:** + +- `ddoState` only tracks metadata events +- Order and pricing events have no error tracking +- No unified view of all DDO-related events + +**Proposed Change:** + +- Create new DB index: **`ddo_logs`** +- Store **all events** related to a DID (metadata, orders, pricing) +- Similar structure to `ddoState` but broader scope +- **Add handler and routes (HTTP + P2P)** to query all information about a DID, transaction, or event + - Similar to existing `ddo-state` handler but for comprehensive logs + - Enable querying by: `did`, `txHash`, `blockNumber`, `eventType` + - Support both HTTP API endpoints and P2P protocol for distributed querying + +**Schema:** + +```typescript +interface DdoLog { + did: string // Indexed + chainId: number // Indexed + eventType: string // METADATA_CREATED, ORDER_STARTED, etc. + eventHash: string // Event signature hash + txHash: string // Transaction hash + blockNumber: number // Block number + timestamp: number // Event timestamp + status: 'success' | 'failed' | 'retrying' + error?: string // Error message if failed + retryCount: number // Number of retry attempts default 0 + lastRetry?: number // Timestamp of last retry + metadata?: Record // Event-specific data +} +``` + +**Benefits:** + +- Single source of truth for all DDO events +- Easier debugging (see all events for a DID) +- Track pricing/order event errors (not just metadata) +- Audit trail for compliance + +--- + +### 3. 🔄 REPLACE EventEmitter WITH QUEUES + +**Current Issue:** + +- Using `EventEmitter` for communication +- Synchronous, blocking behavior +- No retry/replay capability +- Difficult to test + +**Proposed Change:** + +- Replace `EventEmitter` with **persistent queue system** +- Use queue for: + - ✓ Newly indexed assets (instead of `eventEmitter.emit()`) + - ✓ Reindex requests (block & transaction level) + - ✓ Admin commands + +**Queue Types:** + +``` +1. EVENT_PROCESSING_QUEUE (primary) + - New events from blockchain + - Priority: FIFO with retry backoff + +2. REINDEX_QUEUE (existing, enhance) + - Block-level reindex + - Transaction-level reindex + - Priority: Admin requests > Auto-retry + +3. ORDER_QUEUE (new) + - Store orders even if DDO not found + - Process when DDO becomes available +``` + +**Benefits:** + +- Testable (can inject mock queue) +- Observable (queue depth, retry counts) +- Resilient (survives crashes) +- Decoupled (no tight coupling between components) + +--- + +### 4. 📦 HANDLE MISSING DDO IN ORDER/PRICING EVENTS + +**Current Issue:** + +- If DDO not found → skip order/pricing event +- Lost data if DDO indexed later + +**Proposed Change:** + +**Unified Queue-Based Approach for Both Orders and Pricing Events:** + +``` +IF DDO not found: + 1. Check if DDO exists in database + 2. If not found → add event to pending queue + 3. Store event in ddo_logs with status: 'pending_ddo' + 4. Link event to DID for future reconciliation + 5. When DDO is successfully indexed: + → Process all pending events for that DID (orders + pricing) + → Update event status from 'pending_ddo' to 'success' or 'failed' + → Maintain event order based on blockNumber and logIndex +``` + +**Queue Structure:** + +```typescript +interface PendingEvent { + did: string + eventType: string // ORDER_STARTED, ORDER_REUSED, DISPENSER_*, EXCHANGE_* + chainId: number + txHash: string + blockNumber: number + timestamp: number + retryCount: number + queuedAt: number +} +``` + +**Benefits:** + +- **Consistent approach** for all event types (orders + pricing) +- **No data loss** - all events queued and processed eventually +- **Maintains event order** using blockNumber and logIndex +- **Automatic reconciliation** when DDO becomes available +- **Better observability** - track pending events per DID +- **Prevents orphaned records** - only create records when DDO exists + +--- + +### 5. 🚫 MOVE RETRY LOGIC TO ChainIndexer (Block Only That Chain) + +**Current Issue:** + +- Crawler startup retry in `OceanIndexer` +- Failure blocks **entire node** (all chains) + +**Proposed Change:** + +- Move `retryCrawlerWithDelay()` → **ChainIndexer** +- Each chain fails independently +- Other chains continue indexing + +**Benefits:** + +- Resilient multi-chain indexing +- One bad RPC doesn't kill everything +- Easier debugging (per-chain logs) + +--- + +### 6. 📍 BLOCK RETRY QUEUE IMPROVEMENTS + +**Current Issue:** + +- Failed block retried, but `lastIndexedBlock` not updated +- Same block retried indefinitely +- No expiry/max retry limit + +**Proposed Change:** + +``` +When block added to retry queue: + 1. Update lastIndexedBlock (move forward) + 2. Add block to retry queue with metadata: + - blockNumber + - retryCount (starts at 0) + - maxRetries (default: 5) + - lastError + - expiryDate (when to give up) + 3. Process retry queue separately (exponential backoff) + 4. If maxRetries exceeded → log to failed_blocks table +``` + +**Retry Queue Schema:** + +```typescript +interface BlockRetryTask { + chainId: number + blockNumber: number + retryCount: number + maxRetries: number + lastError: string + lastRetryAt: number +} +``` + +**Benefits:** + +- Indexer moves forward (doesn't get stuck) +- Failed blocks retried in background +- Clear failure tracking + +--- + +### 7. 🌐 REMOVE ECONNREFUSED-ONLY CONDITION + +**Current Issue:** + +- Retry only on `ECONNREFUSED` error +- Other errors (timeout, 500, p2p failures) not retried + +**Proposed Change:** + +- With event-level retry, **retry ALL error types**: + - ✓ RPC errors (timeout, 500, 429 rate limit) + - ✓ HTTP errors (decrypt service down) + - ✓ P2P errors (peer unreachable) + - ✓ Database errors (temp unavailable) + - ✓ Validation errors (maybe retryable) + +--- + +### 8. ✅ UPDATE TESTS + +**Required Test Updates:** + +- Remove tests checking `EventEmitter` behavior +- Add tests for queue-based processing +- Add tests for retry with exponential backoff +- Add tests for orphaned orders +- Add tests for per-chain failure isolation +- Add tests for `ddo_logs` index +- Add tests for block retry with expiry + +--- + +### Summary Table + +| # | Change | Current Pain | Benefit | Effort | Priority | +| --- | --------------------------------------------- | --------------------------------- | ------------------------------------ | ------ | ----------- | +| 1 | Event-level retry + queues | Retry logic scattered, blocking | Unified, non-blocking, testable | High | 🔴 Critical | +| 2 | `ddo_logs` DB index | No order/pricing error tracking | Full audit trail, debugging | Medium | 🟡 High | +| 3 | Replace EventEmitter with queues | Blocking, not testable, no replay | Observable, resilient, testable | High | 🔴 Critical | +| 4 | Handle missing DDO (orphaned orders) | Lost orders/pricing data | No data loss, reconciliation | Medium | 🟡 High | +| 5 | Per-chain startup retry (ChainIndexer) | One failure kills entire node | Isolated failures, resilient | Low | 🔴 Critical | +| 6 | Block retry queue with expiry | Indexer stuck on bad blocks | Progress continues, background retry | Medium | 🟡 High | +| 7 | Retry ALL error types (not just ECONNREFUSED) | P2P/timeout/429 not retried | Comprehensive error handling | Low | 🟡 High | +| 8 | Update tests | Tests assume old architecture | Tests match new architecture | Medium | 🟢 Medium |