Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Jan 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 2, 2026 8:07pm
rivet-inspector Ignored Ignored Preview Jan 2, 2026 8:07pm
rivet-site Ignored Ignored Preview Jan 2, 2026 8:07pm
rivetkit-serverless Skipped Skipped Jan 2, 2026 8:07pm

Copy link
Member Author

NathanFlurry commented Jan 2, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from a1796cf to 4ca7516 Compare January 2, 2026 20:07
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless January 2, 2026 20:07 Inactive
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 4ca7516 to cff619f Compare January 15, 2026 18:21
@NathanFlurry NathanFlurry marked this pull request as ready for review January 15, 2026 18:21
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from cff619f to 3700d45 Compare January 15, 2026 18:54
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/3734 January 22, 2026 04:11
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 3700d45 to fbdbcd3 Compare January 22, 2026 04:11
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to actor-messages January 22, 2026 04:11
@NathanFlurry NathanFlurry changed the base branch from actor-messages to graphite-base/3734 January 22, 2026 05:36
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from fbdbcd3 to 246d677 Compare January 22, 2026 06:31
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to 01-21-feat_rivetkit_add_actor_run_handler January 22, 2026 06:31
@NathanFlurry NathanFlurry changed the base branch from 01-21-feat_rivetkit_add_actor_run_handler to graphite-base/3734 January 22, 2026 07:14
@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from 246d677 to a3fbdd7 Compare January 22, 2026 17:45
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3734 to 01-21-feat_rivetkit_add_actor_run_handler January 22, 2026 17:45
@claude
Copy link

claude bot commented Jan 22, 2026

PR Review: Workflow Engine for RivetKit

Overview

This PR introduces a comprehensive durable workflow engine for RivetKit TypeScript. This is a major addition (~9,257 lines) that implements a reentrant execution framework for building fault-tolerant, long-running workflows.

Summary Assessment

Overall: Strong implementation with production-ready patterns

This is high-quality code that demonstrates deep understanding of durable execution semantics. The implementation includes sophisticated features like:

  • Deterministic replay with history divergence detection
  • Automatic retry with exponential backoff
  • Rollback/compensation handlers
  • Race and join primitives for parallel execution
  • Message passing for external coordination
  • Loop state management with history trimming

Code Quality Analysis

Strengths ✅

  1. Excellent Error Handling Strategy

    • Custom error hierarchy is well-designed (CriticalError, RollbackError, EvictedError, etc.)
    • Clear distinction between retriable vs non-retriable errors
    • Proper error propagation through the execution stack
  2. Strong Type Safety

    • Comprehensive TypeScript types with good use of discriminated unions
    • Type-safe branch outputs in join/race operations
    • Well-defined interfaces for extensibility
  3. Deterministic Execution

    • History-based replay prevents non-deterministic behavior
    • Name registry pattern ensures consistent location keys
    • Input persistence for deterministic replays (lines 690-703 in index.ts)
  4. Memory Management

    • Loop history trimming to prevent unbounded growth (context.ts:718-763)
    • Lazy loading of entry metadata (storage.ts:165-189)
    • Dirty tracking for efficient flushes
  5. Testing Coverage

    • Tests cover both "yield" and "live" modes comprehensively
    • Edge cases like retry exhaustion, timeouts, and eviction are tested
    • Good use of test fixtures (CountingDriver)
  6. Documentation

    • Excellent inline comments explaining non-obvious behavior
    • Comprehensive QUICKSTART.md and architecture.md
    • Clear JSDoc annotations

Issues & Concerns 🔍

1. Critical: Race Condition in Message Consumption (High Priority)

Location: storage.ts:307-328 (consumeMessage)

export async function consumeMessage(
	storage: Storage,
	driver: EngineDriver,
	messageName: string,
): Promise<Message | null> {
	const index = storage.messages.findIndex(
		(message) => message.name === messageName,
	);
	if (index === -1) {
		return null;
	}

	const message = storage.messages[index];

	// Delete from driver first - if this fails, memory is unchanged
	await driver.delete(buildMessageKey(message.id));

	// Only remove from memory after successful driver deletion
	storage.messages.splice(index, 1);

	return message;
}

Problem: If the workflow crashes between deleting from KV and removing from memory, the message will be lost on next load (since it's deleted from KV but storage.messages still has it in memory).

Impact: Message loss in crash scenarios.

Recommendation: Consider a two-phase approach:

  • Mark message as "consumed" in KV first
  • Remove from memory
  • Delete from KV in background or during flush

Alternatively, accept this as "at-most-once" delivery and document it clearly.

2. Performance: Excessive Metadata Loading (Medium Priority)

Location: context.ts:355-359

Every retry attempt loads metadata from driver. If metadata is already loaded in storage.entryMetadata, this becomes redundant network calls. The loadMetadata function does cache, but consider loading all entry metadata upfront during loadStorage() for frequently-retried workflows.

3. Timeout Behavior May Be Surprising (Medium Priority)

Location: context.ts:464-480

The comment correctly notes that timeouts don't cancel underlying operations. This can lead to resource leaks if step operations don't respect ctx.abortSignal. Consider:

  • Documenting this more prominently in user-facing docs
  • Providing a helper to wrap operations with proper cancellation
  • Warning in logs when timeout occurs

4. Missing Input Validation (Low Priority)

Several functions lack input validation:

  • loop(): commitInterval must be > 0 (note in code but not enforced at context.ts:630)
  • listenN(): limit should be > 0
  • step(): timeout values could be validated

5. Inconsistent Error Messages (Low Priority)

Some errors are user-facing while others are internal. Consider standardizing error messages with actionable guidance.

Security Considerations 🔒

  1. No Sensitive Data Leakage: ✅ Good - error metadata filtering (index.ts:833-849) only includes serializable primitives

  2. Input Validation: ⚠️ Missing validation could allow malformed data into KV

  3. Resource Limits: ⚠️ No built-in limits on:

    • Number of loop iterations (could grow unbounded)
    • Message queue size
    • History entry count before trimming

    Consider adding configurable limits to prevent abuse.

Performance Considerations ⚡

  1. Batch Operations: ✅ Good use of batching in flush() (storage.ts:194-278)
  2. Ephemeral Steps: ✅ Excellent optimization (context.ts:417-424)
  3. Name Registry: ✅ Efficient deduplication
  4. Concerns:
    • list() operations load all entries into memory (storage.ts:122-128)
    • For workflows with thousands of entries, this could be expensive
    • Consider pagination or streaming for large histories

Test Coverage Analysis 🧪

Strengths:

  • Both "yield" and "live" modes tested
  • Good coverage of edge cases (timeouts, retries, eviction)
  • Parameterized tests reduce duplication

Gaps:

  • No tests for concurrent workflow execution
  • Limited testing of rollback scenarios
  • No chaos/fault injection tests (simulate KV failures)
  • Missing tests for history divergence recovery

Recommendation: Add tests for KV failures, partial batch writes, concurrent messages, and loop iteration limits.

Architecture Review 🏗️

Isolation Model

The driver design assumes external isolation (Durable Objects, actors). This is well-documented and appropriate.

Consideration: What happens if isolation is violated? (e.g., two processes run same workflow ID)

  • Could lead to history corruption
  • Consider adding optimistic locking or version checks

Serialization Strategy

Uses BARE schema for binary serialization. Good choice for compact storage, schema evolution, and type safety.

Note: Ensure BARE schema changes are backward compatible or migrations are handled.

Recommendations Summary

Must Fix (Before Merge)

  1. Document message delivery semantics - Is it at-most-once? exactly-once? Clarify in docs and potentially in error handling.

Should Fix (High Priority)

  1. Add resource limits (configurable max iterations, message queue size)
  2. Improve error messages with actionable guidance
  3. Add tests for concurrent scenarios and KV failures

Nice to Have (Medium Priority)

  1. Optimize metadata loading for retry-heavy workflows
  2. Add warnings when timeouts occur without cancellation support
  3. Add integration guide for implementing custom drivers
  4. Consider adding metrics/observability hooks

Documentation Improvements

  1. Add example of implementing a custom driver
  2. Document recovery scenarios (what happens if KV is corrupted?)
  3. Add migration guide for workflow code changes
  4. Document performance characteristics and scaling limits

Conclusion

This is excellent work and demonstrates a mature understanding of durable execution patterns. The code quality is high, error handling is sophisticated, and the API is well-designed.

Recommendation: APPROVE with minor fixes

The identified issues are mostly documentation gaps and potential edge cases. The core implementation is solid and ready for production use with proper driver implementations.

Key Action Items:

  1. Clarify message delivery semantics (documentation)
  2. Add resource limits to prevent unbounded growth
  3. Expand test coverage for failure scenarios
  4. Document driver implementation guide

Great job on this implementation! 🎉


Reviewed using Rivet CLAUDE.md guidelines and TypeScript best practices.

@NathanFlurry NathanFlurry force-pushed the 01-02-feat_rivetkit_workflows branch from a3fbdd7 to b4b061f Compare January 24, 2026 23:35
@NathanFlurry NathanFlurry force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 2fe5b7d to 329640c Compare January 24, 2026 23:35
@jog1t jog1t force-pushed the 01-21-feat_rivetkit_add_actor_run_handler branch from 4e8ca02 to f1e4e30 Compare January 28, 2026 19:55
@jog1t jog1t force-pushed the 01-02-feat_rivetkit_workflows branch from 8a67460 to a5391dd Compare January 28, 2026 19:55
@claude
Copy link

claude bot commented Jan 28, 2026

Pull Request Review: Workflow Engine for RivetKit

This PR introduces a comprehensive durable workflow engine to RivetKit. The implementation is well-architected with strong TypeScript types, comprehensive test coverage, and excellent documentation.

Strengths

Architecture & Design

  • Clean separation of concerns with isolated KV namespaces per workflow instance
  • Well-designed driver abstraction that allows multiple backend implementations
  • Deterministic replay mechanism through history tracking
  • Strong error handling with specialized error types for different failure modes

Code Quality

  • Comprehensive TypeScript typing throughout
  • Clear separation between public API and internal implementation
  • Good use of comments explaining complex logic
  • Consistent code style and patterns

Test Coverage

  • Extensive test suite covering multiple scenarios
  • Tests run in both yield and live modes
  • Good edge case coverage

Documentation

  • Excellent QUICKSTART.md with clear examples
  • Detailed architecture.md explaining design decisions
  • Multiple focused docs on specific features

Issues & Recommendations

High Priority

  1. No Upper Bound Validation for Loop Config (context.ts:630) - Add validation for commitInterval and historyKeep parameters

  2. Timeout Does Not Cancel Underlying Operations (context.ts:464-480) - Document this limitation prominently in public API docs

  3. Missing Input Validation - Add validation for sleep duration, listenN limit, and step timeout parameters

Medium Priority

  1. Console Warning in Production Code (context.ts:1747-1750) - Consider adding optional logger to driver interface

  2. Error Serialization Limited (index.ts:815-862) - Consider supporting arrays and nested objects for richer error context

  3. Potential Performance Issue in validateComplete (context.ts:245-267) - Consider optimizing for workflows with many entries

Low Priority

  1. Magic Numbers - Use named constants for hardcoded values like retry delays

Security & Performance

  • Uses crypto.randomUUID() for secure ID generation
  • Batch writes for efficiency
  • Lazy metadata loading
  • Loop history pruning prevents unbounded growth

Final Verdict

This is high-quality code with excellent architecture. The implementation is production-ready with minor improvements recommended.

Recommendation: APPROVE

Should Fix Soon:

  • Add validation for loop config parameters
  • Add input validation for negative durations/limits
  • Consider replacing console.warn with configurable logging

Estimated Effort: ~4-6 hours for all improvements

Great work! The workflow engine is a solid foundation for building reliable distributed systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants