-
Notifications
You must be signed in to change notification settings - Fork 26
Fix deadlock and backfill funding amounts #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Thanks for assigning @joostjager as a reviewer! |
c0b5607 to
e57ab25
Compare
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR a response to a user report?
| funding_amount_sats, \ | ||
| announcement_signed \ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO NOTHING", &[ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO UPDATE SET funding_amount_sats = $2", &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this guarantee that everything will be backfilled and the described unwrap cannot fail anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not fully, but it should make it much more robust. Surprisingly I'd never seen it, but other folks have 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully, but it is safe still? I was looking for the mentioned unwrap, but wasn't completely sure which one it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its here -
rapid-gossip-sync-server/src/lookup.rs
Line 163 in b4e8434
| let funding_sats = current_announcement_row.get::<_, i64>("funding_amount_sats") as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a safe get then for the final 1% and maybe just skip the row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda confused why its coming up now to begin with - the commit was landed months ago and I assume most users have updated, so this should just get the stragglers working again. We can revisit it if there's still issues.
The DB persistence loop is responsible for taking new gossip off of the persistence queue and writing it to Postgres. If it gets stalled, the logic pushing gossip onto the event queue might also stall. In the next commit, we'll move the gossip persistence logic into a separate tokio runtime to avoid deadlocks, but in order for that to work it has to be the case that the gossip persistence task can never block waiting on a task on the main runtime, which would be possible via the `NetworkGraph` mutexes. Thus, here, we move the graph writing to its own task in the main runtime.
When we have many peers but relatively few threads, and postgres writes fall behind enough to block the async stream of gossip, we can end up deadlocking on the `PeerManager` `peers` thread. Specifically, because `PeerManager::process_events` holds a `peers` (read) lock while fetching messages (during which time we `block_in_place` trying to push gossip to the async stream), any other tasks which try to touch the `peers` lock in write mode (e.g. connecting to a peer or processing a disconnection) we'll hang until some postgres writes finish. If we get enough threads blocked or we block the tokio reactor and the postgres writes will stall, fully deadlocking. To avoid this, we here move the postgres writes to their own tokio runtime, ensuring that that reactor cannot be blocked.
In c02f2ec we added `funding_amount_sats` to `channel_announcements`, making it `NOT NULL` for new DBs, but `NULL` for old ones. When querying, however, we (indirectly) `unwrap`ed the values. Here, we try to backfill by looking at the local network graph on startup and updating the existing entries, rather than ignoring conflicting writes.
Yes, see discord. |
736dc32 to
406cf7f
Compare
5c6eb95 to
b9204d3
Compare
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few small non-blocking remarks.
| GossipPersister::new(self.logger.clone()).await; | ||
| log_info!(self.logger, "Starting gossip db persistence listener"); | ||
| tokio::spawn(async move { persister.persist_gossip().await; }); | ||
| let runtime = Builder::new_multi_thread() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It disappointing that quite a few issues around async seem to require another runtime 😞 Maybe copy the explanation in the commit message to a code comment here. Don't think it'll be obvious otherwise why this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, async is really such a strongly-colored function :(
| funding_amount_sats, \ | ||
| announcement_signed \ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO NOTHING", &[ | ||
| ) VALUES ($1, $2, $3) ON CONFLICT (short_channel_id) DO UPDATE SET funding_amount_sats = $2", &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a safe get then for the final 1% and maybe just skip the row?
|
Pushed an additional comment. |
No description provided.