Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

Copy link
Member Author

NathanFlurry commented Jan 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 26, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4037

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4037

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4037

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4037

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4037

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4037

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4037

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4037

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4037

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4037

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4037

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4037

commit: 7e0a15d

Comment on lines +1 to +55
import { AsyncLocalStorage } from "node:async_hooks";
import { Buffer } from "node:buffer";
import { randomBytes } from "node:crypto";
import { performance } from "node:perf_hooks";
import { decode as decodeCbor, encode as encodeCbor } from "cbor-x";
import { pack, unpack } from "fdb-tuple";
import {
CHUNK_VERSIONED,
CURRENT_VERSION,
READ_RANGE_VERSIONED,
encodeRecord,
type ActiveSpanRef,
type Attributes,
type Chunk,
type KeyValue,
type ReadRangeWire,
type Record as TraceRecord,
type RecordBody,
type SpanEnd,
type SpanEvent,
type SpanId,
type SpanLink,
type SpanRecordKey,
type SpanSnapshot,
type SpanStart,
type SpanStatus,
SpanStatusCode,
type SpanUpdate,
type StringId,
type TraceId,
} from "../schemas/versioned.js";
import {
anyValueFromJs,
hexFromBytes,
type OtlpExportTraceServiceRequestJson,
type OtlpKeyValue,
type OtlpResource,
type OtlpSpan,
type OtlpSpanEvent,
type OtlpSpanLink,
type OtlpSpanStatus,
} from "./otlp.js";
import type {
EndSpanOptions,
EventOptions,
ReadRangeOptions,
ReadRangeResult,
SpanHandle,
SpanStatusInput,
StartSpanOptions,
Traces,
TracesDriver,
TracesOptions,
UpdateSpanOptions,
} from "./types.js";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imports need to be sorted alphabetically. Biome linter expects imports to be sorted in a specific order. Run 'biome check --write .' to automatically fix this issue.

Spotted by Graphite Agent (based on CI logs)

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from 2bdce1c to 9524546 Compare January 28, 2026 01:12
@jog1t jog1t force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from 9524546 to e9cb986 Compare January 28, 2026 19:55
@claude
Copy link

claude bot commented Jan 28, 2026

PR Review: feat(rivetkit): traces

This PR introduces a new traces storage and API package (@rivetkit/traces) for RivetKit, implementing an efficient local tracing system with OTLP v1 JSON export capabilities. Overall, this is a well-architected implementation with comprehensive test coverage.


🎯 Architecture & Design

Strengths

  • Excellent design documentation: Both TRACES_SPEC.md and TRACES_DESIGN_NOTES.md provide clear rationale and implementation guidelines
  • Smart storage strategy: Using append-only records with periodic snapshots for long-running spans is the right approach for heavy write loads
  • Efficient encoding: VBARE + CBOR + string table interning provides good compression without sacrificing performance
  • Hybrid read path: Merging in-memory and on-disk data seamlessly handles unflushed records

Concerns

  1. Active span memory management (traces.ts:580-602): The depth-based eviction strategy is good, but consider adding memory pressure monitoring. If you have 10,000 shallow spans with large attributes, you could still exhaust memory before hitting maxActiveSpans.

  2. String table memory leak potential (traces.ts:259-268): The string table is per-chunk and never deduplicated across chunks. For very long-running traces with many unique string values, this could grow unbounded. Consider adding a max string table size check.


🐛 Potential Bugs

Critical

1. Race condition in write chain (traces.ts:559-567)

function enqueueWrite(pending: PendingChunk): void {
    writeChain = writeChain.then(async () => {
        await driver.set(pending.key, pending.bytes);
        const index = pendingChunks.indexOf(pending);
        if (index !== -1) {
            pendingChunks.splice(index, 1);
        }
    });
}

If flush() is called multiple times rapidly, pendingChunks could have duplicate entries added before any writes complete. The array splice only removes one entry. Consider using a Set or Map keyed by chunk identifier.

2. Time anchor drift (traces.ts:205-208, 220-225)
The time anchor is set once on initialization, but performance.now() can drift from wall clock time over long periods. For Rivet Actors that run for months, this could cause significant timestamp errors. Consider periodically re-anchoring (e.g., every hour) or using process.hrtime.bigint() for more stable relative timing.

3. Unsafe bigint to number conversion (traces.ts:1147-1153)

function toNumber(value: bigint): number {
    const asNumber = Number(value);
    if (!Number.isSafeInteger(asNumber)) {
        throw new Error("Value exceeds safe integer range");
    }
    return asNumber;
}

This is used for bucketStartSec which is Unix seconds. This will overflow around year 2255 (2^53 nanoseconds / 1e9). While far in the future, consider using bigint throughout or documenting this limitation.

Medium Priority

4. Missing flush error handling (traces.ts:777-784)

async function flush(): Promise<boolean> {
    const didFlush = flushChunk();
    if (didFlush) {
        resetChunkState(currentChunk.bucketStartSec);
    }
    await writeChain; // No error handling
    return didFlush;
}

If a driver.set() fails in the write chain, the error is uncaught. Should either propagate the error or log it and continue.

5. Attribute sanitization inconsistency (traces.ts:294-320)
Maps are sanitized recursively, but the function returns undefined for non-string Map keys, which causes the entire Map to be dropped. Consider logging a warning or counting dropped entries separately.

6. listRange implementation in ActorTracesDriver (traces-driver.ts:85-114)
The implementation loads ALL entries with the prefix, then filters in memory. This is inefficient and could cause OOM if there are millions of trace chunks. Should stream/paginate or leverage the underlying driver's range capabilities more efficiently.


⚡ Performance Considerations

Good Practices

  • String interning reduces redundant storage ✅
  • Batch writes minimize driver round-trips ✅
  • Snapshot intervals prevent unbounded replay ✅

Optimization Opportunities

1. CBOR encoding overhead (traces.ts:285-287, 437-441)
Every attribute is CBOR-encoded individually. For hot paths with many small attributes, batch encoding could reduce overhead:

// Instead of encoding each attribute separately
const encoded = encodeCbor({ attr1, attr2, attr3 });

2. Chunk size estimation (traces.ts:506-507, 520-523)
encodeRecord(record) is called twice when a chunk is flushed due to size - once to check size, then again after resetting. Consider caching the encoded result:

const encodedRecord = encodeRecord(record);
if (currentChunk.sizeBytes + encodedRecord.length > targetChunkBytes) {
    flushChunk();
    resetChunkState(recordBucketStart);
    // Reuse encodedRecord here instead of rebuilding body and re-encoding
}

3. Active span snapshot linear search (traces.ts:760-771)
activeSpanRefs.get() is O(1), but you're reconstructing the entire object on each snapshot update. Consider using a mutable structure.


🔒 Security Concerns

1. Unbounded attribute values (traces.ts:294-320)
There's no size limit on individual attribute values. A malicious/buggy user could pass huge strings or deeply nested objects, causing memory exhaustion or extremely large chunks that exceed maxChunkBytes. Add validation:

if (typeof value === 'string' && value.length > MAX_ATTRIBUTE_STRING_LENGTH) {
    return undefined; // or truncate
}

2. Denial of service via span creation (traces.ts:620-679)
While maxActiveSpans limits active spans, there's no rate limit on span creation. An attacker could create millions of short-lived spans, filling up disk storage. Consider adding:

  • Rate limiting per time window
  • Total span count limit per actor
  • Disk space quota checks

✅ Test Coverage

Excellent Coverage

The test file (traces.test.ts) has 750 lines of comprehensive tests covering:

  • Encoding/decoding round-trips ✅
  • Chunking and flushing logic ✅
  • Snapshot creation and recovery ✅
  • Time bucketing edge cases ✅
  • Active span eviction ✅

Missing Test Cases

  1. Concurrent flush scenarios: Test rapid flush() calls to verify write chain ordering
  2. Driver failure modes: Test behavior when driver.set() throws
  3. Memory pressure: Test with maxActiveSpans filled and new spans created
  4. Time anchor drift: Mock performance.now() drift over long durations
  5. Large attribute values: Test behavior with extremely large/deep attributes
  6. listRange pagination: Test with thousands of chunks

📝 Code Quality & Style

Adherence to CLAUDE.md

  • ✅ No glob imports from anyhow (N/A - TypeScript)
  • ✅ Structured logging with tracing (though this is a library, not an application)
  • ✅ Use of bigint for nanosecond timestamps
  • ⚠️ Comments could be improved - several complex functions lack doc comments (e.g., buildSpansFromRecords, findPreviousChunk)

Suggestions

  1. Add JSDoc comments for public API surface (createTraces, SpanHandle, etc.)
  2. Extract magic numbers - Several hardcoded constants aren't in the DEFAULT_* constants (e.g., line 77: SPAN_ID_BYTES = 8)
  3. Type safety - Consider making SpanId and TraceId branded types instead of plain ArrayBuffer to prevent mixing them up

🎨 API Design

Well Designed

  • withSpan for context propagation is idiomatic ✅
  • SpanHandle.isActive() prevents use-after-end ✅
  • Explicit parent span option ✅

Suggestions

  1. Error messages: Consider adding more context to errors. For example, "Span handle is not active" (line 612) could include the spanId for debugging.

  2. Trace context injection: The spec notes this is "not supported in v1" (line 352 of TRACES_SPEC.md). Consider adding a placeholder API or note in types for future extensibility:

interface StartSpanOptions {
    // ... existing options
    // TODO(v2): Support continuing remote traces
    // injectTraceContext?: { traceId: Uint8Array; parentSpanId: Uint8Array };
}

📋 Minor Issues

  1. Typo in TRACES_DESIGN_NOTES.md:2: "thi sin code" should be "this in code"
  2. Inconsistent terminology: "Rivet Actors" vs "Rivet Actor" vs "actor" - the spec says to capitalize (per CLAUDE.md), but it's inconsistent in traces-driver.ts comments
  3. Unused imports: otlp.ts imports decode as decodeCbor (line 2) but it's unused in that file

✨ Recommendations

High Priority

  1. Fix the write chain race condition
  2. Add error handling to flush() and write chain
  3. Optimize ActorTracesDriver.listRange() to avoid loading all entries
  4. Add attribute size limits to prevent DoS

Medium Priority

  1. Periodically re-anchor time or document drift limitations
  2. Add missing test cases (concurrent flush, driver failures, memory pressure)
  3. Cache encoded records when checking chunk size to avoid double encoding
  4. Add JSDoc comments to public APIs

Low Priority

  1. Extract magic numbers to named constants
  2. Consider branded types for SpanId/TraceId
  3. Fix typos and terminology inconsistencies

🎉 Overall Assessment

This is a strong implementation with thoughtful design decisions. The architecture is well-suited for long-running Rivet Actors with heavy write loads. The main concerns are around error handling, potential race conditions, and DoS protection. With the critical issues addressed, this will be a solid foundation for tracing in RivetKit.

Recommendation: Request changes for critical issues, then approve.

@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from e9cb986 to 7ff09e8 Compare January 30, 2026 08:19
@claude
Copy link

claude bot commented Jan 30, 2026

PR Review: feat(rivetkit): traces

Overview

This PR introduces a comprehensive distributed tracing system for RivetKit actors. The implementation adds a new @rivetkit/traces package with a custom, compact storage format optimized for long-running spans and efficient persistence.

Positive Aspects

Architecture & Design

  • Well-documented specification: The TRACES_SPEC.md provides excellent documentation of the storage model, API surface, and design decisions
  • Compact storage format: Using VBARE binary encoding with string tables and CBOR for attributes is a smart optimization
  • OTLP v1 compatibility: Export to standard OTLP JSON format ensures interoperability
  • Long-running span support: The snapshot mechanism (via SpanSnapshot records) is well-designed for actors that can run for months
  • Single-writer assumption: Simplifies the implementation appropriately for the actor model

Implementation Quality

  • Clean separation of concerns: The traces library is properly isolated from RivetKit core
  • Good test coverage: The test file shows comprehensive coverage of edge cases
  • Proper use of bigint for timestamps: Correctly handles nanosecond precision without overflow
  • Time anchoring: Smart use of performance.now() anchored to Date.now() for precise relative timing

Issues & Concerns

🔴 Critical Issues

1. Actor Instance Integration - Missing Error Handling

// rivetkit/src/actor/instance/mod.ts:886-889
#initializeTraces() {
    this.#traces = createTraces({
        driver: new ActorTracesDriver(this.driver, this.#actorId),
    });
}

Issue: No error handling for trace initialization. If trace initialization fails, the actor will crash without proper cleanup.

Recommendation: Wrap in try-catch and add resource configuration.

2. Trace Span Memory Leak Risk

The action execution at rivetkit/src/actor/instance/mod.ts:620-707 creates a span but doesn't guarantee cleanup in all error paths. If an exception occurs between span creation and the try block, the span won't be ended.

Recommendation: Move span creation inside the try block or use a more robust pattern.

3. Race Condition in Flush Chain

// traces/src/traces.ts:559-567
function enqueueWrite(pending: PendingChunk): void {
    writeChain = writeChain.then(async () => {
        await driver.set(pending.key, pending.bytes);
        const index = pendingChunks.indexOf(pending);
        if (index !== -1) {
            pendingChunks.splice(index, 1);
        }
    });
}

Issue: If multiple flushes happen rapidly, the pendingChunks array manipulation could be racy. Errors in driver.set will break the chain for all subsequent writes.

Recommendation: Add error handling to prevent breaking the write chain.

🟡 High Priority Issues

4. Unclosed Spans on Actor Shutdown

The onStop method at rivetkit/src/actor/instance/mod.ts:435-489 doesn't explicitly flush traces before shutdown. Active trace spans and unflushed chunks may be lost.

Recommendation: Add await this.#traces.flush() before final state save.

5. Missing Resource Configuration

Traces are initialized without resource metadata. According to the spec, resource data should be attached at export time for proper span identification.

Recommendation: Add resource configuration with actor.id, actor.name, actor.region attributes.

6. Inefficient Driver Implementation

// rivetkit/src/actor/instance/traces-driver.ts:85-114
async listRange(...) {
    const entries = await this.#driver.kvListPrefix(this.#actorId, this.#prefix);
    const filtered = entries.filter(...)
}

Issue: Fetches ALL trace keys then filters in memory. Very inefficient for large datasets.

Recommendation: Check if underlying driver supports native range queries or document performance implications.

🟢 Medium Priority Issues

7. Missing Trace Context Validation

emitTraceEvent silently returns if no span exists at rivetkit/src/actor/instance/mod.ts:267-279. This could hide bugs where events are emitted outside span contexts.

8. Logger Patching Side Effects

The logger monkey-patching at rivetkit/src/actor/instance/mod.ts:904-917 could cause unexpected behavior. The type casting as any[] is unsafe.

9. Snapshot Threshold Tuning

Fixed thresholds (256KB bytes, 300s interval) may not be optimal for all use cases. Consider making configurable.

🔵 Low Priority / Style Issues

10. Inconsistent Error Messages

Error at traces/src/traces.ts:507-518 should include actual size for debugging.

11. Magic Numbers

Add comments explaining constants like MAX_CHUNK_ID = 0xffff_ffff.

12. Silent CBOR Failures

Silent catch blocks when encoding attributes could make debugging difficult.


Security Considerations

No major security issues identified

  • Input sanitization for attributes looks appropriate
  • No obvious injection vulnerabilities
  • Proper handling of binary data

Performance Considerations

Strengths:

  • Efficient binary encoding (VBARE + CBOR)
  • String table deduplication
  • Batched writes with flush chain
  • Snapshot mechanism for long-running spans

Concerns:

  1. Memory usage: The maxActiveSpans limit (10,000) could still consume significant memory
  2. KV list operations: Driver implementation fetches all keys which could be slow
  3. Synchronous encoding: Record encoding happens synchronously which could block event loop

Test Coverage Assessment

Good coverage of basic functionality. Missing tests for:

  • ❌ Concurrent span creation/ending
  • ❌ Error scenarios (encoding failures, driver failures)
  • ❌ Memory limits (maxActiveSpans enforcement)
  • ❌ Actor shutdown with active spans
  • ❌ Large span hierarchies (depth testing)

Recommendations Summary

Must Fix Before Merge:

  1. Add error handling to trace initialization
  2. Fix potential race condition in flush chain
  3. Add trace flush to actor shutdown sequence
  4. Add resource configuration to trace initialization

Should Fix Soon:

  1. Improve driver listRange implementation or document performance implications
  2. Add validation/logging for trace context issues
  3. Review logger patching approach
  4. Add tests for error scenarios and actor lifecycle integration

Nice to Have:

  1. Make snapshot thresholds configurable
  2. Improve error messages with more context
  3. Add debug logging for dropped attributes
  4. Expand test coverage

Overall Assessment

This is a well-designed and implemented feature with good documentation. The core tracing library is solid, but the integration with ActorInstance needs some refinement around error handling and lifecycle management.

Recommendation: Approve with required changes (items 1-4 above).

The design choices (custom binary format, snapshot mechanism, single-writer model) are appropriate for the use case of long-running actor spans. The implementation quality is generally high, with most issues being in the integration layer rather than the core traces library.

Great work on the comprehensive specification and test coverage! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants