feat(rivetkit): traces #4037

NathanFlurry · 2026-01-26T10:26:14Z

No description provided.

NathanFlurry · 2026-01-26T10:26:38Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

feat(frontend): traces ui #4038 : 2 dependent PRs (#4032 , #4051 )
feat(rivetkit): traces #4037 👈 (View in Graphite)
feat(rivetkit): integrate workflows in to actors #4006
feat(rivetkit): workflows #3734
feat(rivetkit): add actor run handler #4005 : 1 other dependent PR (#4008 )
feat: actor messages & queues #3989 : 1 other dependent PR (#4002 )
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

pkg-pr-new · 2026-01-26T10:28:54Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4037

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4037

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4037

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4037

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4037

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4037

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4037

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4037

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4037

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4037

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4037

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4037

commit: 7e0a15d

graphite-app · 2026-01-26T10:29:47Z

rivetkit-typescript/packages/traces/src/traces.ts

+import { AsyncLocalStorage } from "node:async_hooks";
+import { Buffer } from "node:buffer";
+import { randomBytes } from "node:crypto";
+import { performance } from "node:perf_hooks";
+import { decode as decodeCbor, encode as encodeCbor } from "cbor-x";
+import { pack, unpack } from "fdb-tuple";
+import {
+	CHUNK_VERSIONED,
+	CURRENT_VERSION,
+	READ_RANGE_VERSIONED,
+	encodeRecord,
+	type ActiveSpanRef,
+	type Attributes,
+	type Chunk,
+	type KeyValue,
+	type ReadRangeWire,
+	type Record as TraceRecord,
+	type RecordBody,
+	type SpanEnd,
+	type SpanEvent,
+	type SpanId,
+	type SpanLink,
+	type SpanRecordKey,
+	type SpanSnapshot,
+	type SpanStart,
+	type SpanStatus,
+	SpanStatusCode,
+	type SpanUpdate,
+	type StringId,
+	type TraceId,
+} from "../schemas/versioned.js";
+import {
+	anyValueFromJs,
+	hexFromBytes,
+	type OtlpExportTraceServiceRequestJson,
+	type OtlpKeyValue,
+	type OtlpResource,
+	type OtlpSpan,
+	type OtlpSpanEvent,
+	type OtlpSpanLink,
+	type OtlpSpanStatus,
+} from "./otlp.js";
+import type {
+	EndSpanOptions,
+	EventOptions,
+	ReadRangeOptions,
+	ReadRangeResult,
+	SpanHandle,
+	SpanStatusInput,
+	StartSpanOptions,
+	Traces,
+	TracesDriver,
+	TracesOptions,
+	UpdateSpanOptions,
+} from "./types.js";


The imports need to be sorted alphabetically. Biome linter expects imports to be sorted in a specific order. Run 'biome check --write .' to automatically fix this issue.

Spotted by Graphite Agent (based on CI logs)

Is this helpful? React 👍 or 👎 to let us know.

claude · 2026-01-28T20:01:34Z

PR Review: feat(rivetkit): traces

This PR introduces a new traces storage and API package (@rivetkit/traces) for RivetKit, implementing an efficient local tracing system with OTLP v1 JSON export capabilities. Overall, this is a well-architected implementation with comprehensive test coverage.

🎯 Architecture & Design

Strengths

Excellent design documentation: Both TRACES_SPEC.md and TRACES_DESIGN_NOTES.md provide clear rationale and implementation guidelines
Smart storage strategy: Using append-only records with periodic snapshots for long-running spans is the right approach for heavy write loads
Efficient encoding: VBARE + CBOR + string table interning provides good compression without sacrificing performance
Hybrid read path: Merging in-memory and on-disk data seamlessly handles unflushed records

Concerns

Active span memory management (traces.ts:580-602): The depth-based eviction strategy is good, but consider adding memory pressure monitoring. If you have 10,000 shallow spans with large attributes, you could still exhaust memory before hitting maxActiveSpans.
String table memory leak potential (traces.ts:259-268): The string table is per-chunk and never deduplicated across chunks. For very long-running traces with many unique string values, this could grow unbounded. Consider adding a max string table size check.

🐛 Potential Bugs

Critical

1. Race condition in write chain (traces.ts:559-567)

function enqueueWrite(pending: PendingChunk): void {
    writeChain = writeChain.then(async () => {
        await driver.set(pending.key, pending.bytes);
        const index = pendingChunks.indexOf(pending);
        if (index !== -1) {
            pendingChunks.splice(index, 1);
        }
    });
}

If flush() is called multiple times rapidly, pendingChunks could have duplicate entries added before any writes complete. The array splice only removes one entry. Consider using a Set or Map keyed by chunk identifier.

2. Time anchor drift (traces.ts:205-208, 220-225)
The time anchor is set once on initialization, but performance.now() can drift from wall clock time over long periods. For Rivet Actors that run for months, this could cause significant timestamp errors. Consider periodically re-anchoring (e.g., every hour) or using process.hrtime.bigint() for more stable relative timing.

3. Unsafe bigint to number conversion (traces.ts:1147-1153)

function toNumber(value: bigint): number {
    const asNumber = Number(value);
    if (!Number.isSafeInteger(asNumber)) {
        throw new Error("Value exceeds safe integer range");
    }
    return asNumber;
}

This is used for bucketStartSec which is Unix seconds. This will overflow around year 2255 (2^53 nanoseconds / 1e9). While far in the future, consider using bigint throughout or documenting this limitation.

Medium Priority

4. Missing flush error handling (traces.ts:777-784)

async function flush(): Promise<boolean> {
    const didFlush = flushChunk();
    if (didFlush) {
        resetChunkState(currentChunk.bucketStartSec);
    }
    await writeChain; // No error handling
    return didFlush;
}

If a driver.set() fails in the write chain, the error is uncaught. Should either propagate the error or log it and continue.

5. Attribute sanitization inconsistency (traces.ts:294-320)
Maps are sanitized recursively, but the function returns undefined for non-string Map keys, which causes the entire Map to be dropped. Consider logging a warning or counting dropped entries separately.

6. listRange implementation in ActorTracesDriver (traces-driver.ts:85-114)
The implementation loads ALL entries with the prefix, then filters in memory. This is inefficient and could cause OOM if there are millions of trace chunks. Should stream/paginate or leverage the underlying driver's range capabilities more efficiently.

⚡ Performance Considerations

Good Practices

String interning reduces redundant storage ✅
Batch writes minimize driver round-trips ✅
Snapshot intervals prevent unbounded replay ✅

Optimization Opportunities

1. CBOR encoding overhead (traces.ts:285-287, 437-441)
Every attribute is CBOR-encoded individually. For hot paths with many small attributes, batch encoding could reduce overhead:

// Instead of encoding each attribute separately
const encoded = encodeCbor({ attr1, attr2, attr3 });

2. Chunk size estimation (traces.ts:506-507, 520-523)
encodeRecord(record) is called twice when a chunk is flushed due to size - once to check size, then again after resetting. Consider caching the encoded result:

const encodedRecord = encodeRecord(record);
if (currentChunk.sizeBytes + encodedRecord.length > targetChunkBytes) {
    flushChunk();
    resetChunkState(recordBucketStart);
    // Reuse encodedRecord here instead of rebuilding body and re-encoding
}

3. Active span snapshot linear search (traces.ts:760-771)
activeSpanRefs.get() is O(1), but you're reconstructing the entire object on each snapshot update. Consider using a mutable structure.

🔒 Security Concerns

1. Unbounded attribute values (traces.ts:294-320)
There's no size limit on individual attribute values. A malicious/buggy user could pass huge strings or deeply nested objects, causing memory exhaustion or extremely large chunks that exceed maxChunkBytes. Add validation:

if (typeof value === 'string' && value.length > MAX_ATTRIBUTE_STRING_LENGTH) {
    return undefined; // or truncate
}

2. Denial of service via span creation (traces.ts:620-679)
While maxActiveSpans limits active spans, there's no rate limit on span creation. An attacker could create millions of short-lived spans, filling up disk storage. Consider adding:

Rate limiting per time window
Total span count limit per actor
Disk space quota checks

✅ Test Coverage

Excellent Coverage

The test file (traces.test.ts) has 750 lines of comprehensive tests covering:

Encoding/decoding round-trips ✅
Chunking and flushing logic ✅
Snapshot creation and recovery ✅
Time bucketing edge cases ✅
Active span eviction ✅

Missing Test Cases

Concurrent flush scenarios: Test rapid flush() calls to verify write chain ordering
Driver failure modes: Test behavior when driver.set() throws
Memory pressure: Test with maxActiveSpans filled and new spans created
Time anchor drift: Mock performance.now() drift over long durations
Large attribute values: Test behavior with extremely large/deep attributes
listRange pagination: Test with thousands of chunks

📝 Code Quality & Style

Adherence to CLAUDE.md

✅ No glob imports from anyhow (N/A - TypeScript)
✅ Structured logging with tracing (though this is a library, not an application)
✅ Use of bigint for nanosecond timestamps
⚠️ Comments could be improved - several complex functions lack doc comments (e.g., buildSpansFromRecords, findPreviousChunk)

Suggestions

Add JSDoc comments for public API surface (createTraces, SpanHandle, etc.)
Extract magic numbers - Several hardcoded constants aren't in the DEFAULT_* constants (e.g., line 77: SPAN_ID_BYTES = 8)
Type safety - Consider making SpanId and TraceId branded types instead of plain ArrayBuffer to prevent mixing them up

🎨 API Design

Well Designed

withSpan for context propagation is idiomatic ✅
SpanHandle.isActive() prevents use-after-end ✅
Explicit parent span option ✅

Suggestions

Error messages: Consider adding more context to errors. For example, "Span handle is not active" (line 612) could include the spanId for debugging.
Trace context injection: The spec notes this is "not supported in v1" (line 352 of TRACES_SPEC.md). Consider adding a placeholder API or note in types for future extensibility:

interface StartSpanOptions {
    // ... existing options
    // TODO(v2): Support continuing remote traces
    // injectTraceContext?: { traceId: Uint8Array; parentSpanId: Uint8Array };
}

📋 Minor Issues

Typo in TRACES_DESIGN_NOTES.md:2: "thi sin code" should be "this in code"
Inconsistent terminology: "Rivet Actors" vs "Rivet Actor" vs "actor" - the spec says to capitalize (per CLAUDE.md), but it's inconsistent in traces-driver.ts comments
Unused imports: otlp.ts imports decode as decodeCbor (line 2) but it's unused in that file

✨ Recommendations

High Priority

Fix the write chain race condition
Add error handling to flush() and write chain
Optimize ActorTracesDriver.listRange() to avoid loading all entries
Add attribute size limits to prevent DoS

Medium Priority

Periodically re-anchor time or document drift limitations
Add missing test cases (concurrent flush, driver failures, memory pressure)
Cache encoded records when checking chunk size to avoid double encoding
Add JSDoc comments to public APIs

Low Priority

Extract magic numbers to named constants
Consider branded types for SpanId/TraceId
Fix typos and terminology inconsistencies

🎉 Overall Assessment

This is a strong implementation with thoughtful design decisions. The architecture is well-suited for long-running Rivet Actors with heavy write loads. The main concerns are around error handling, potential race conditions, and DoS protection. With the critical issues addressed, this will be a solid foundation for tracing in RivetKit.

Recommendation: Request changes for critical issues, then approve.

claude · 2026-01-30T08:28:07Z

PR Review: feat(rivetkit): traces

Overview

This PR introduces a comprehensive distributed tracing system for RivetKit actors. The implementation adds a new @rivetkit/traces package with a custom, compact storage format optimized for long-running spans and efficient persistence.

Positive Aspects

Architecture & Design

Well-documented specification: The TRACES_SPEC.md provides excellent documentation of the storage model, API surface, and design decisions
Compact storage format: Using VBARE binary encoding with string tables and CBOR for attributes is a smart optimization
OTLP v1 compatibility: Export to standard OTLP JSON format ensures interoperability
Long-running span support: The snapshot mechanism (via SpanSnapshot records) is well-designed for actors that can run for months
Single-writer assumption: Simplifies the implementation appropriately for the actor model

Implementation Quality

Clean separation of concerns: The traces library is properly isolated from RivetKit core
Good test coverage: The test file shows comprehensive coverage of edge cases
Proper use of bigint for timestamps: Correctly handles nanosecond precision without overflow
Time anchoring: Smart use of performance.now() anchored to Date.now() for precise relative timing

Issues & Concerns

🔴 Critical Issues

1. Actor Instance Integration - Missing Error Handling

// rivetkit/src/actor/instance/mod.ts:886-889
#initializeTraces() {
    this.#traces = createTraces({
        driver: new ActorTracesDriver(this.driver, this.#actorId),
    });
}

Issue: No error handling for trace initialization. If trace initialization fails, the actor will crash without proper cleanup.

Recommendation: Wrap in try-catch and add resource configuration.

2. Trace Span Memory Leak Risk

The action execution at rivetkit/src/actor/instance/mod.ts:620-707 creates a span but doesn't guarantee cleanup in all error paths. If an exception occurs between span creation and the try block, the span won't be ended.

Recommendation: Move span creation inside the try block or use a more robust pattern.

3. Race Condition in Flush Chain

// traces/src/traces.ts:559-567
function enqueueWrite(pending: PendingChunk): void {
    writeChain = writeChain.then(async () => {
        await driver.set(pending.key, pending.bytes);
        const index = pendingChunks.indexOf(pending);
        if (index !== -1) {
            pendingChunks.splice(index, 1);
        }
    });
}

Issue: If multiple flushes happen rapidly, the pendingChunks array manipulation could be racy. Errors in driver.set will break the chain for all subsequent writes.

Recommendation: Add error handling to prevent breaking the write chain.

🟡 High Priority Issues

4. Unclosed Spans on Actor Shutdown

The onStop method at rivetkit/src/actor/instance/mod.ts:435-489 doesn't explicitly flush traces before shutdown. Active trace spans and unflushed chunks may be lost.

Recommendation: Add await this.#traces.flush() before final state save.

5. Missing Resource Configuration

Traces are initialized without resource metadata. According to the spec, resource data should be attached at export time for proper span identification.

Recommendation: Add resource configuration with actor.id, actor.name, actor.region attributes.

6. Inefficient Driver Implementation

// rivetkit/src/actor/instance/traces-driver.ts:85-114
async listRange(...) {
    const entries = await this.#driver.kvListPrefix(this.#actorId, this.#prefix);
    const filtered = entries.filter(...)
}

Issue: Fetches ALL trace keys then filters in memory. Very inefficient for large datasets.

Recommendation: Check if underlying driver supports native range queries or document performance implications.

🟢 Medium Priority Issues

7. Missing Trace Context Validation

emitTraceEvent silently returns if no span exists at rivetkit/src/actor/instance/mod.ts:267-279. This could hide bugs where events are emitted outside span contexts.

8. Logger Patching Side Effects

The logger monkey-patching at rivetkit/src/actor/instance/mod.ts:904-917 could cause unexpected behavior. The type casting as any[] is unsafe.

9. Snapshot Threshold Tuning

Fixed thresholds (256KB bytes, 300s interval) may not be optimal for all use cases. Consider making configurable.

🔵 Low Priority / Style Issues

10. Inconsistent Error Messages

Error at traces/src/traces.ts:507-518 should include actual size for debugging.

11. Magic Numbers

Add comments explaining constants like MAX_CHUNK_ID = 0xffff_ffff.

12. Silent CBOR Failures

Silent catch blocks when encoding attributes could make debugging difficult.

Security Considerations

✅ No major security issues identified

Input sanitization for attributes looks appropriate
No obvious injection vulnerabilities
Proper handling of binary data

Performance Considerations

Strengths:

Efficient binary encoding (VBARE + CBOR)
String table deduplication
Batched writes with flush chain
Snapshot mechanism for long-running spans

Concerns:

Memory usage: The maxActiveSpans limit (10,000) could still consume significant memory
KV list operations: Driver implementation fetches all keys which could be slow
Synchronous encoding: Record encoding happens synchronously which could block event loop

Test Coverage Assessment

Good coverage of basic functionality. Missing tests for:

❌ Concurrent span creation/ending
❌ Error scenarios (encoding failures, driver failures)
❌ Memory limits (maxActiveSpans enforcement)
❌ Actor shutdown with active spans
❌ Large span hierarchies (depth testing)

Recommendations Summary

Must Fix Before Merge:

Add error handling to trace initialization
Fix potential race condition in flush chain
Add trace flush to actor shutdown sequence
Add resource configuration to trace initialization

Should Fix Soon:

Improve driver listRange implementation or document performance implications
Add validation/logging for trace context issues
Review logger patching approach
Add tests for error scenarios and actor lifecycle integration

Nice to Have:

Make snapshot thresholds configurable
Improve error messages with more context
Add debug logging for dropped attributes
Expand test coverage

Overall Assessment

This is a well-designed and implemented feature with good documentation. The core tracing library is solid, but the integration with ActorInstance needs some refinement around error handling and lifecycle management.

Recommendation: Approve with required changes (items 1-4 above).

The design choices (custom binary format, snapshot mechanism, single-writer model) are appropriate for the use case of long-running actor spans. The implementation quality is generally high, with most issues being in the integration layer rather than the core traces library.

Great work on the comprehensive specification and test coverage! 🎉

This was referenced Jan 26, 2026

feat(rivetkit): workflows #3734

Open

feat(rivetkit): integrate workflows in to actors #4006

Open

workflow example #4032

Open

This was referenced Jan 26, 2026

feat(rivetkit): add icon & name customization #4033

Open

feat(frontend): traces ui #4038

Open

graphite-app bot reviewed Jan 26, 2026

View reviewed changes

jog1t mentioned this pull request Jan 27, 2026

refactor(rivetkit): make traces server only lib #4051

Open

NathanFlurry force-pushed the traces-storage branch from 846c70c to 7504a0b Compare January 28, 2026 01:12

NathanFlurry force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from 2bdce1c to 9524546 Compare January 28, 2026 01:12

jog1t force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from 9524546 to e9cb986 Compare January 28, 2026 19:55

jog1t force-pushed the traces-storage branch from 7504a0b to b18c3ae Compare January 28, 2026 19:55

jog1t mentioned this pull request Jan 28, 2026

feat(dashboard): traces timeline ui #4056

Draft

feat(rivetkit): traces

7e0a15d

NathanFlurry force-pushed the 01-21-feat_rivetkit_integrate_workflows_in_to_actors branch from e9cb986 to 7ff09e8 Compare January 30, 2026 08:19

NathanFlurry force-pushed the traces-storage branch from b18c3ae to 7e0a15d Compare January 30, 2026 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): traces #4037

feat(rivetkit): traces #4037

Uh oh!

NathanFlurry commented Jan 26, 2026

Uh oh!

NathanFlurry commented Jan 26, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

graphite-app bot Jan 26, 2026

Uh oh!

claude bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(rivetkit): traces #4037

Are you sure you want to change the base?

feat(rivetkit): traces #4037

Uh oh!

Conversation

NathanFlurry commented Jan 26, 2026

Uh oh!

NathanFlurry commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

pkg-pr-new bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graphite-app bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(rivetkit): traces

🎯 Architecture & Design

Strengths

Concerns

🐛 Potential Bugs

Critical

Medium Priority

⚡ Performance Considerations

Good Practices

Optimization Opportunities

🔒 Security Concerns

✅ Test Coverage

Excellent Coverage

Missing Test Cases

📝 Code Quality & Style

Adherence to CLAUDE.md

Suggestions

🎨 API Design

Well Designed

Suggestions

📋 Minor Issues

✨ Recommendations

High Priority

Medium Priority

Low Priority

🎉 Overall Assessment

Uh oh!

claude bot commented Jan 30, 2026

PR Review: feat(rivetkit): traces

Overview

Positive Aspects

Architecture & Design

Implementation Quality

Issues & Concerns

🔴 Critical Issues

1. Actor Instance Integration - Missing Error Handling

2. Trace Span Memory Leak Risk

3. Race Condition in Flush Chain

🟡 High Priority Issues

4. Unclosed Spans on Actor Shutdown

5. Missing Resource Configuration

6. Inefficient Driver Implementation

🟢 Medium Priority Issues

7. Missing Trace Context Validation

8. Logger Patching Side Effects

9. Snapshot Threshold Tuning

🔵 Low Priority / Style Issues

10. Inconsistent Error Messages

11. Magic Numbers

12. Silent CBOR Failures

Security Considerations

Performance Considerations

Strengths:

Concerns:

Test Coverage Assessment

Recommendations Summary

Must Fix Before Merge:

Should Fix Soon:

Nice to Have:

Overall Assessment

Uh oh!

NathanFlurry commented Jan 26, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 26, 2026 •

edited

Loading

claude bot commented Jan 28, 2026 •

edited

Loading