Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Dec 5, 2025

Problem

When a client on one node subscribes to a contract stored on a different node, UPDATE notifications fail to reach the subscriber. The root cause is in the subscription propagation logic: when intermediate nodes forward SeekNode messages, they incorrectly replace the original subscriber's pub_key with their own identity.

This causes the final contract holder to register the wrong peer as the subscriber, so when updates are broadcast, they're sent to intermediate nodes instead of the actual subscriber.

Example Flow (Before Fix)

  1. Client on node-b sends RequestSub with its pub_key
  2. node-a receives and forwards as SeekNode but uses its own pub_key
  3. Contract holder registers node-a as subscriber (wrong!)
  4. UPDATE broadcasts go to node-a, never reaching node-b

Solution

Preserve the original subscriber's pub_key when forwarding SeekNode messages:

  1. In SeekNode forwarding (line ~750): Use subscriber.pub_key() instead of this_peer.pub_key() when the current node doesn't have the contract and forwards to the next hop.

  2. In ReturnSub retry logic (line ~860): When retrying after a ReturnSub { subscribed: false }, use the upstream_subscriber's pub_key if we're an intermediate node.

The subscriber's address is still set to Unknown (to be filled from transport layer), but the identity is preserved through the entire chain.

Testing

Re-enabled test_multiple_clients_subscription which validates cross-node subscription propagation:

  • Client 1 on node-a PUTs a contract and subscribes
  • Client 2 on node-a subscribes
  • Client 3 on node-b (different node) subscribes
  • An UPDATE is sent and all 3 clients should receive notifications

The test passes when network setup succeeds. It has some timing-related flakiness due to multi-node startup coordination, but when connections are established correctly, the cross-node notification now works.

Fixes

Closes #2220

🤖 Generated with Claude Code

When intermediate nodes forward SeekNode messages during subscription propagation,
they were incorrectly replacing the original subscriber's pub_key with their own.
This caused the final contract holder to register the wrong peer as the subscriber,
preventing UPDATE notifications from reaching the actual subscriber.

Changes:
- In SeekNode forwarding (line ~750): Use subscriber.pub_key() instead of this_peer.pub_key()
- In ReturnSub retry logic (line ~860): Use upstream_subscriber's pub_key when available

The fix ensures the original subscriber's identity is preserved through the entire
chain of intermediate peers until the contract holder registers the subscription.

Fixes #2220

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity requested review from Copilot and iduartgomez December 5, 2025 17:01
Copilot finished reviewing on behalf of sanity December 5, 2025 17:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in cross-node subscription propagation where intermediate nodes were incorrectly replacing the original subscriber's public key with their own when forwarding SeekNode messages, causing UPDATE notifications to be sent to the wrong peer.

Key Changes:

  • Preserves original subscriber identity by using subscriber.pub_key() instead of this_peer.pub_key() when forwarding SeekNode messages
  • Implements proper retry logic that checks for upstream_subscriber to distinguish between intermediate and originating nodes
  • Re-enables test_multiple_clients_subscription test that validates cross-node subscription propagation

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
crates/core/src/operations/subscribe.rs Fixed subscriber identity preservation in SeekNode forwarding logic (line 754) and ReturnSub retry logic (lines 860-872); removed unused this_peer variable; added clarifying comments
crates/core/tests/operations.rs Updated test comments to document the fix for issue #2220 and explain known timing-related flakiness; removed #[ignore] attribute to re-enable test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sanity and others added 2 commits December 5, 2025 11:15
Three fixes for subscription reliability:

1. Push-before-send in handle_op_result (operations/mod.rs)
   - Push operation state BEFORE sending message to prevent race
   - Response can arrive before send() returns on fast networks

2. State transition fix in subscribe.rs
   - When forwarding RequestSub as SeekNode, transition to AwaitingResponse
   - Previously kept ReceivedRequest state, causing invalid_transition errors

3. try_send to send().await in handler.rs
   - try_send can fail silently under backpressure
   - Use blocking send to ensure registration completes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
When a BroadcastTo update arrives from another node and modifies the
contract state, local WebSocket clients that subscribed to this contract
were not being notified. The upsert_contract_state delta path was missing
the send_update_notification() call.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity
Copy link
Collaborator Author

sanity commented Dec 5, 2025

@iduartgomez I've been investigating the cross-node subscription flakiness in test_multiple_clients_subscription and found what I believe is a fundamental architectural issue.

Current Status

The fixes in this PR address several race conditions:

  • Preserving subscriber identity when forwarding SeekNode
  • State transition bugs (RequestSub forward should set AwaitingResponse)
  • Missing send_update_notification in delta path

These help but the test still fails ~70% of the time.

Root Cause Analysis

The subscription tree doesn't reliably align with the UPDATE propagation path. Here's what happens:

  1. Client 3 on node-b subscribes to a contract PUT on node-a
  2. node-bRequestSubgateway
  3. gateway uses k_closest_potentially_caching() to find a contract holder
  4. gateway may forward SeekNode to node-b (because node-b cached contract during GET)
  5. node-b handles SeekNode locally, registers subscription on itself, returns ReturnSub
  6. Client 3 gets success response

When UPDATE happens:

  • Client 1 on node-a sends UPDATE through gateway
  • gateway calls get_broadcast_targets_update()subscribers_of(key)
  • If gateway didn't register node-b as subscriber, UPDATE doesn't propagate

The race condition timing (from logs):

19:05:31.987306Z - ReturnSub sent (subscription completing)
19:05:31.988725Z - UPDATE request sent (1.4ms later!)

Even with a 5s delay before UPDATE, the test only improves to ~33% pass rate, suggesting the issue isn't just timing but the subscription tree structure itself.

Questions for You

  1. Is my understanding correct? Should the subscription tree guarantee that all intermediate nodes (gateway) have the full subscriber list for UPDATE propagation?

  2. Proposed fixes - which approach makes sense for Freenet's architecture?

    a) Confirmation-based subscription: Don't return SubscribeResponse until all intermediate nodes confirm registration via a new SubscribeConfirm message

    b) Broader UPDATE broadcast: Change get_broadcast_targets_update() to also query k_closest_potentially_caching() instead of only subscribers_of()

    c) Deterministic subscription location: Route subscriptions to the contract's ideal location (based on hash), not just any cached copy holder

  3. Should this PR be merged as-is? The fixes are correct (they address real bugs) even though they don't fully solve Re-enable ignored tests: cross-node subscription and multi-node operations #2220. The test would remain #[ignore] on main.

I've documented detailed findings in issue #2220 comment.

[AI-assisted - Claude]

@iduartgomez
Copy link
Collaborator

I havent read this throughtfully, but shouldnt peers be subscribed to the inmediate upstream becauer they dont hold a direct connection with the other peers? I dont know if I am understanding the proposal here.

@sanity
Copy link
Collaborator Author

sanity commented Dec 5, 2025

Update: Root Cause Identified

After discussion with @sanity, we've identified the fundamental issue:

The invariant "caching a contract = being subscribed to it" is violated.

When a peer receives a PUT and caches a contract via seed_contract(), it:

  1. Adds the contract to seeding_contract (marks it as cached)
  2. Does not establish an upstream subscription to receive updates

This means a peer can have a cached contract but no path to receive UPDATEs. When a client on that peer later subscribes, the subscription is handled locally (because has_contract() returns true), creating a disconnected subtree.

The fix in PR #2002 (proximity-based update forwarding) is the proper solution - it ensures neighbors who cache a contract receive updates regardless of subscription tree topology.

Recommendation:

[AI-assisted - Claude]

@sanity
Copy link
Collaborator Author

sanity commented Dec 5, 2025

@iduartgomez To answer your question directly (this is Claude, not Ian):

You're correct that peers subscribe to their immediate upstream - they don't have direct connections to all peers. The subscription tree design is sound.

The issue I identified is that caching ≠ subscribing in the current implementation:

When a peer receives a PUT and caches via seed_contract(), it only adds to seeding_contract but does NOT establish an upstream subscription. So a peer can have has_contract() = true but no path to receive UPDATEs.

When a client on that peer later calls subscribe, the peer handles it locally (line 466 in subscribe.rs checks has_contract()), creating a disconnected subtree.

Ian clarified that caching and subscribing should be synonymous - if you cache a contract you should receive updates. The current code violates this invariant.

PR #2002 (proximity-based forwarding) addresses this by ensuring neighbors who cache a contract receive updates regardless of subscription tree topology. We think that's the right fix rather than trying to patch the subscription routing.

[AI-assisted - Claude]

@sanity
Copy link
Collaborator Author

sanity commented Dec 6, 2025

Closing in favor of PR #2228

After systematic investigation, I've identified why the test_multiple_clients_subscription test fails:

Root Cause

When Client 3 on node-b subscribes to a contract, the subscription succeeds locally (node-b sends ReturnSub to Client 3), but node-b is never registered as a subscriber at the UPDATE source (gateway/node-a). When the gateway broadcasts UPDATE messages, node-b isn't in the broadcast targets, so the UPDATE never reaches node-b.

This is the fundamental issue we discussed earlier: the subscription tree doesn't extend from node-b back to the UPDATE origin. Caching a contract doesn't automatically make a peer part of the subscription tree.

Evidence from logs

  • 8 UPDATE messages total, 0 on node-b
  • All UPDATE messages only reach the gateway
  • Client 3's subscription succeeds but UPDATE never arrives

Why this PR's fixes are insufficient

This PR addressed:

  1. SeekNode identity preservation
  2. Subscribe state transitions
  3. send_update_notification calls

But none of these fix the core problem: node-b needs to receive UPDATE messages in the first place.

The proper fix: PR #2228

PR #2228 (proximity-based update forwarding) addresses this by:

  1. Tracking which neighbors cache which contracts (ProximityCacheManager)
  2. Broadcasting UPDATEs to cached neighbors, not just explicit subscribers

This ensures peers who cache a contract receive updates regardless of subscription tree topology.

Recommendation

Merge PR #2228 and enable the test there. The fixes in this PR (SeekNode identity, state transitions) may still be valuable but should be evaluated in the context of PR #2228's architecture.

[AI-assisted - Claude]

@sanity sanity closed this Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable ignored tests: cross-node subscription and multi-node operations

3 participants