-
Notifications
You must be signed in to change notification settings - Fork 2.2k
htlcswitch: add InitAttempt for idempotent external dispatch #10049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
htlcswitch: add InitAttempt for idempotent external dispatch #10049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the reliability of HTLC payments by preventing duplicate attempts, particularly in scenarios involving remote clients and the switchrpc sub-server. It introduces an AttemptStore interface and modifies the SendHTLC method to ensure that only one HTLC is forwarded for a given attempt ID. The changes include persisting the intent to dispatch an HTLC before sending it to the network and introducing a PendingNetworkResult message type.
Highlights
- Duplicate Payment Prevention: This PR introduces a mechanism to prevent duplicate payment attempts when using the
switchrpcgRPC sub-server andSendOnion/TrackOnionRPCs. It addresses potential fund losses due to unintentional retries. - Attempt Store Interface: An
AttemptStoreinterface is defined to manage HTLC payment attempt results, supporting both local and remote lifecycle controllers. - Durable Persistence of Intent: The Switch now durably persists the intent to dispatch an HTLC before sending it, adding a checkpoint to prevent duplicates. This involves writing to disk prior to sending the HTLC.
- SendHTLC() Modification: The
SendHTLC()method is updated to be safe for multiple calls with the same attempt ID. The Switch'snetworkResultStoreis modified to enable this. InitAttemptMethod: A newInitAttempt(attemptID)method is added to theAttemptStoreinterface. This method updates the attempt result store to durably mark the existence of an attempt before sending it to the network.- Pending Network Result: Introduces a
PendingNetworkResultmessage type to mark an attempt as initiated but not yet settled/failed.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
An excellent pull request that introduces a robust mechanism to prevent duplicate HTLC dispatches. The approach of using a placeholder in the networkResultStore to mark an attempt as "in-flight" is a clever way to achieve this without significant changes to the on-disk format.
My review has identified one critical issue in the implementation of the duplicate check. Addressing this point will help ensure the stability and correctness of this important feature.
149221d to
d46a710
Compare
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for exploring this @calvinrzachman! The direction looks good.
I think something is still missing, answering the question: who deletes results and when (including how long would we like duplicate protection to last)? At the moment it is not safe to restart the client or server if we want to allow simultaneous use of SendOnion (external) and SendPaymentV2 (internal), because there are two routers that call CleanStore at restart. So for example, a restart of the server would delete all the in-flight attempts initiated by another client. A minimally working version of this would require a flag to turn off the server's router to prevent its calls to CleanStore, right?
To solve this we probably need to extend the switch rpc to also implement NextID
Line 1163 in fb68f36
| NextPaymentID: sequencer.NextID, |
NextID such that the client's control tower has unique attempt ids).
In order to solve the multi client CleanStore problem, we would need to extend the switchrpc.SendOnion/TrackOnion call with a client id (you mentioned this idea elsewhere). This way we can set the first byte of the attempt id used as a key for the server's network results store as the client id, to avoid initial database changes (we could later extend the key and really prefix the client id to have a larger space and to enable more clients) and leave the attempt id used in the switch untouched (no prefixing). That way we can have CleanStore(clientID, keepIDs) to only clean up the results concerned with the client. Those are just some initial thoughts and they probably require a second thorough analysis, but hopefully gives some more directions.
|
@bitromortac yes, we can certainly expand our switchrpc server to support some method of attempt information cleanup. Two potential behaviors for deletion I have considered thus far are:
There are some differences to the two approaches, but for now we have an implementation of a CleanStore RPC which follows the approach lnd has taken historically for deletion from the Switch's attempt store here: calvinrzachman#19 You're quite right that it is not safe to have multiple HTLC dispatching entities independently calling CleanStore without coordination. The CleanStore approach to deletion requires ID space separation. It cannot work with multiple clients sharing the same flat attempt ID space. There needs to be an identity-aware mapping of attempt IDs - the Switch must know:
Though I don't have this entirely flushed out, I think it might be possible for multiple routers to share an ID space if:
Yes, exactly! We can disable lnd's typical cleanup behavior which occurs on Router startup via |
d46a710 to
cd958d3
Compare
cd958d3 to
79619b1
Compare
38d07d9 to
4d22856
Compare
|
|
||
| return nil | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that CleanStore doesn't use the mutexes, probably because of the assumption that it's only called at startup. I'm not sure if we should add them and cancel any subscriptions. We want to have continuous maintenance of result clean up in the future, so maybe it's worth to at least add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds like a good explanation for why it might not have originally been included. I can make a comment. We could update CleanStore to use the mutexes when we add the switchrpc.CleanStore method in a future PR as then it would be subject to being called concurrently with other operations of the result store?
htlcswitch/payment_result.go
Outdated
| // ErrPaymentIDNotFound is returned to signal that the result is not yet | ||
| // available. | ||
| // | ||
| // NOTE: This method does not currently acquire the result subscription mutex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should add it (the attemptIDMtx)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon looking a little closer. I think we only need to use the mutex when modifying the database or the subscribers map.
// We get a mutex for this attempt ID. This is needed to ensure
// consistency between the database state and the subscribers in case
// of concurrent calls.
I think your point about CleanStore is the true place where the mutex will eventually be necessary.
4d22856 to
12a31fb
Compare
|
@ellemouton: review reminder |
c7bfccb to
ad468ea
Compare
|
I think it would be nice to mention what SendHTLC is currently only safe to run and be called from one Router and more that that (attemptID collision) |
3e42cbb to
8e1879a
Compare
|
Almost, two final comments left:
|
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks close, we need to check that errors originating from InitAttempt are handled correctly (see comment).
htlcswitch/payment_result.go
Outdated
| return pending, nil | ||
| } | ||
|
|
||
| // FailAttempt transitions an initialized attempt from PENDING to FAILED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it more self-documenting by calling it FailPendingAttempt.
htlcswitch/switch.go
Outdated
| log.Errorf("unable to initialize attempt id=%d: %v", | ||
| attemptID, err) | ||
|
|
||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to wrap this error with ErrAmbiguousInitCheck (or similar), to be able to handle this with retries in the (external) router. Otherwise we would fail the attempt in the tower db when getting a result db error, but could still have an inflight HTLC, to lead to a duplicate send? We probably just need to wrap the database error within InitAttempts read operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch 🎉 We have considered many an edge case, but I neglected to consider this last case where the durable write of intent to dispatch, which offers the duplicate protection, itself fails due to serialization or DB error.
Updated to include a new error to mark this scenario. Left as a separate commit for now. If you think it looks fine, then I'll squash into the existing commits before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave the separate commits to explicitly document the new error handling, just place those commits after the idempotency commit.
htlcswitch/switch.go
Outdated
| // specified attempt ID. This method is the core dispatch logic and does NOT | ||
| // perform InitAttempt or FailAttempt. It is intended to be called by | ||
| // idempotent wrappers. | ||
| func (s *Switch) DispatchHTLC(firstHop lnwire.ShortChannelID, attemptID uint64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agree with the strategy of exporting DispatchHTLC, using it in SendOnion.
8e1879a to
d0fcb14
Compare
|
@ziggie1984 I updated the error name to We do have unit tests for the |
Preperatory refactor to allow for future alteration of the store backing the Switch.
d0fcb14 to
d5b02a8
Compare
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @calvinrzachman, I didn't find any more edge cases. It could make sense to split one commit and I have a few other small remarks and questions.
| t.Fatalf("expected ErrPaymentIDNotFound, got %v", err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should split this commit up as it contains many changes (can see it from the bullet points in the commit message as well). You separated the tests, which shows you how you could have split this commit into smaller pieces, the first one that touches methods from InitAttempt duplicate prevention and a second one that calls the other methods in TestNetworkResultStoreFailAndFetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split up the commit! Let me know if it looks better.
htlcswitch/switch.go
Outdated
| // First, we initialize the attempt in our persistent store. This serves | ||
| // as a durable record of our intent to send and gates the attempt id | ||
| // for concurrent callers, allowing them to safely retry. | ||
| err := s.attemptStore.InitAttempt(attemptID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention that this must always be the first call in this method to not break idempotency (no calls that can return before this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in mention of this. let me know if you like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we have opted to keep SendHTLC flow unchanged, I will update the SendOnion implementation to include a note in this same spirit 👀
htlcswitch/switch_test.go
Outdated
| // is back from the network. | ||
| err = s.SendHTLC(aliceLink.ShortChanID(), attemptID, update) | ||
| require.ErrorIs(t, err, ErrDuplicateAdd, "expected reuse after result "+ | ||
| "to fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also document that after a call to switch.CleanStore it's possible to send the attempt like in the attempt store tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that same verification here that the duplicate prevention is lifted after the CleanStore removes the attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we have opted to keep SendHTLC flow unchanged, I will include this verification that the ID is freed for re-use in an analogous test with SendOnion and CleanStore when we expand switchrpc to contain a CleanStore (deletion) endpoint.
routing/router.go
Outdated
| // forward a fully encoded payment to the first hop in the route | ||
| // denoted by its public key. A non-nil error is to be returned if the | ||
| // payment was unsuccessful. | ||
| // payment was unsuccessful. A return value of ErrDuplicateAdd signals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add here that callers make sure to not collide with their attempt ids, that they must be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we have opted to keep SendHTLC flow unchanged, I will update the SendOnion implementation to include a note in this same spirit 👀
htlcswitch/payment_result.go
Outdated
| return pending, nil | ||
| } | ||
|
|
||
| // FailAttempt transitions an initialized attempt from PENDING to FAILED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the rename unless y'all oppose it, I think it makes the usage clearer.
htlcswitch/switch.go
Outdated
| log.Errorf("unable to initialize attempt id=%d: %v", | ||
| attemptID, err) | ||
|
|
||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave the separate commits to explicitly document the new error handling, just place those commits after the idempotency commit.
This commit adds the `InitAttempt` method to the `AttemptStore` interface. This persists the intent to dispatch an HTLC before it is sent to the network, creating a durable record that acts as an idempotency key. This allows for safe, idempotent retries from clients, as subsequent calls for the same attempt ID will be rejected. A new `pending` message type to represent an HTLC that has been initialized but not yet dispatched serves as a placeholder in the durable store. A new error, `ErrAttemptResultPending`, is also introduced to more accurately communicate the state of an attempt. This allows callers to distinguish between an unknown attempt and one that is pending but has not yet received a final result.
d5b02a8 to
d79ffd0
Compare
This commit adds two key methods for managing the lifecycle of payment
attempts:
- `FailPendingAttempt`: Provides a synchronous mechanism to roll back an
initialized attempt to a terminal `FAILED` state. This is crucial for
preventing orphaned attempts when pre-dispatch validation fails within
the `SendHTLC` logic.
- `FetchPendingAttempts`: Allows the switch to query for any attempts
that were initialized but not actually delivered to the network. This
enables a robust cleanup routine on startup to resolve orphaned
payments that could result from a node crash.
This allows the Switch to communicate to a caller whether the status of an HTLC dispatch for a given attempt ID is ambiguous. This can occur when an attempt to initialize an attempt (via InitAttempt) itself fails. This is the key we use to provide idempotence or protection against duplicate processing so that we can offer "at most once" semantics when processing htlc dispatch requests. If this InitAttempt call itself fails, then we cannot definitively know if the attempt is not in-flight, or if it was successfully processed by a prior request. Callers may need to be updated to explicitly handle this new ambiguous scenario.
We update to use FailPendingAttempts and perform all rollback on error from within SendOnion itself. We rely on the core "unsafe" SendHTLC dispatch logic which remains unchanged.
If the Switch crashes or restarts, an HTLC attempt can be left in an inconsistent state, causing callers awaiting a final result via GetAttemptResult to hang indefinitely. This commit expands the startup cleanup routine to handle two types of these "orphaned" attempts: 1. Initialized attempts that exist in the result store but have no corresponding circuit in the CircuitMap. 2. Attempts with a "half-open" circuit, where a circuit was committed but the HTLC was never handed off to the outgoing link. These orphaned attempts are now proactively failed on startup. This ensures that all payments reach a definitive outcome, preventing hangs and allowing local or remote clients to safely retry.
d79ffd0 to
300e16f
Compare
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉🎉
|
|
||
| ## Functional Enhancements | ||
|
|
||
| * Introduced a new `AttemptStore` interface within `htlcswitch`, and expanded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to include a link to the PR such that CI can pass, see the other items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to include the PR link 🔗
| - The `ChannelRouter`'s payment life-cycle management may subscribe for a result | ||
| of an attempt that was never actually dispatched if we crash after | ||
| `CommitCircuits` records the Switch's intent to forward but before the HTLC is | ||
| included in the commitment transaction of the outgoing link. This change also | ||
| introduces a new `cleanupOrphanedAttempts` procedure *within the Switch* that | ||
| runs on startup. This routine addresses this inconsistent or dangling circuit | ||
| state by finding any attempts that were left in a pending state after a node | ||
| crash and safely fails them, fixing a long-standing latent bug where local | ||
| payments could become permanently stuck. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is relevant here, I'd remove it. If there was no pending result in the result store, the cleanup routine won't unblock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
300e16f to
03fa01c
Compare
ziggie1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉, great work. Guess we thought of all edge cases.
| const ( | ||
| // pendingHtlcMsgType is a custom message type used to represent a | ||
| // pending HTLC in the network result store. | ||
| pendingHtlcMsgType lnwire.MessageType = 32768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to guard against this, only msg of a certain type make it to the result store, so I think we are fine.
248001d
into
lightningnetwork:elle-base-branch-payment-service
Background
As part of a larger effort to support offloading path-finding and payment life-cycle management to an external entity, we are introducing a new
switchrpcgRPC sub-server #9489. This will allow a remotely instantiatedChannelRouteror other RPC client to orchestrate payments across a set of remotelndinstances.To safely support this, we need to prevent duplicate payment attempts and resulting unintentional loss of funds when using the forthcoming switchrpc server and
SendOnionRPC. The two primary categories of duplicate payment risk are:htlcswitchfails to enforce "at-most-once" dispatch for a givenattemptID.ChannelRouteror RPC client misinterprets an ambiguous error, incorrectly assumes an attempt has failed, and launches a new, duplicative attempt via another attempt ID.Communication over a network is unreliable; a remote client can time out and lack certainty on whether its request was processed, lost, or whether the server's acknowledgement of the request was lost. To navigate this, a remote client will need to employ principles of reliable communication in the form of: client supplied request IDs, retries, acknowledgements, idempotent APIs and sound error handling.
Change Description
This PR is a first foundational step. It does not introduce the RPC itself, but instead adds the underlying storage mechanism that makes idempotence possible. We introduce a new
AttemptStoreinterface and expand itskvdb-backed implementation, thenetworkResultStore, to include two new methods. This allows us to durably persist a record of intent to dispatch an HTLC prior to that HTLC being forwarded.The core primitives this new store provides are:
s.attemptStore.InitAttempt(attemptID): This serves as a "durable write of intent" or "write-ahead log," checkpointing the existence of an attempt prior to dispatch. It creates a record in a newPENDINGstate, guaranteeing that any subsequent call for the same ID will be rejected as a duplicate.s.attemptStore.FailPendingAttempt(attemptID): This provides a mechanism to transition an initialized attempt to a terminally failed state. This prevents accrual of initialized attempts which are not actually dispatched. Such orphanedPENDINGattempts would otherwise causeGetAttemptResultsubscribers to hang waiting on a result which will never arrive.NOTE: An effort is made in this branch to "preserve the core" htlc dispatch logic. To minimize the risk of regressions for existing
lndusers, this work will leave the critical-path logic ofhtlcswitch.SendHTLCand therouting.paymentLifecyclecompletely untouched. All new logic will be implemented at theswitchrpcboundary. It also appears possible to achieve our goal without modifying the on-disk structure of thenetworkResulttype. The idea is to overload the use of thelnwire.Msgfield to encode a third state besidesSettle/Fail—namely,PENDING.With the introduction of the
PENDINGstate, we update the primary user of the attempt store,GetAttemptResult. Previously, a tracking request that raced ahead of a full HTLC dispatch could receive a misleadingErrPaymentIDNotFound. Now, whenGetAttemptResultencounters an attempt in thePENDINGstate, the store returnshtlcswitch.ErrAttemptResultPending. This signal is used to correctly subscribe to the future outcome rather than returning an error, ensuring the tracker safely waits for the definitive result. This makes the entire tracking lifecycle robust against dispatch request race conditions which external routers may otherwise encounter.Future Work
switchrpcsub-server with aSendOnionRPC that uses theAttemptStoreto provide an idempotent API.Steps to Test
htlcswitchandroutingunit tests pass.networkResultStorehave been added to verify theInitAttempt,FailPendingAttempt, andcleanupOrphanedAttemptslogic.