-
Notifications
You must be signed in to change notification settings - Fork 400
http2: Move all socket operations to the session thread #4368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
No diff with the --ignore-all-space option.
The connection may be closed before either stream 1 or 3 is done sending a CONTINUATION frame.
This is a minimal API not attempting to keep track of the eventfd counter, and only caring about threads signalling progress on one end, and another thread being able to notice that progress can be made. This is an alternative to condvars when waiting for progress will content with polling concurrently waiting for progress elsewhere. Since locks/condvars don't compose well with file descriptors, the VEFD API offers an alternative that can integrate an existing poll on other file descriptors.
The caller is responsible for passing a req for stream 0 depending on a prior knowledge setup of the session or an opportunistic upgrade from an HTTP/1 session. The signature reflects that h2_sess takes ownership of req.
The signature of h2_new_req() changed to reflect that.
The h2_ou_rel() function is fail-safe, and passing a phony return value further down is misleading.
This clarifies what is being released without looking at the signature.
It becomes unusable at the call site, so the signature should reflect that.
The signature is changed to reflect that the h2_req is no longer usable after calling the function.
This is to be better prepared for handling errors at the call site.
Waiting for the PRISM request immediately after sending the 101 (switch protocols) is a prerequisite to ensure a proper order of events. A subsequent change will ensure that the first frame sent by Varnish is the initial SETTINGS frame.
The code calling h2_rxframe() assumes that the workspace will have been released upon its return. This was not done in the case of early return due to goaway having been sent.
The H/2 RFC requires that the server sends the settings frame during an oppurtunistic upgrade before any stream frames. But we scheduled the stream==1 (H/1 received upgraded request that will get the response over H/2) thread before sending the settings. This technically (though unlikely) means that it could win the race to get the send mutex, and start sending frames before the settings. To clean this up, this patch delays the scheduling of the stream==1 request thread until after the settings have been sent. Also it changes things so that failures to schedule the stream==1 will result in a ENHANCE_YOUR_CALM error being sent to the client. A test case is updated to expect the error message.
The H2 spec uses the same bit for frame flags of the same meaning, regardless of the frame they are used in. Simplify and reduce number of headers by defining each only once.
At this point, we alrerady have a complete frame.
It can be used anywhere we expect short writes using writev(). It's hosted by cache_session.c until there is enough momentum to give VIOV its own C file.
This makes cache_http2_proto.c smaller.
The t_window field in particular could be confused with the naming convention for timestamps. Using rx_ and tx_ prefixes removes the ambiguity.
This is not just any rx thread, and in the future the h2_sess thread will be in charge of sending frames too. While at it, properly check the h2 argument, properly compare threads, and add an h2_req counterpart since any thread working with h2 that is not the session thread, must be a request stream thread.
And make it operate on the r2_req only, from which the h2_sess can be reached.
This event FD will be used to signal from client threads to the session thread when sending data.
This large commit switches H2 sockets to be using non-blocking sockets instead of blocking operations. It also moves all IO handling into the session thread (where previously write IO was performed both by the session thread and request threads, using a mutex to give exclusive access to the writing). The non-blocking IO is handled as reentrant functions that will write what they can when called. A poll() is used in order to monitor the socket state and notice when further progress can be made. Request threads will queue their payload writes for to the session thread, which will pick them up and execute both the framing and the writes. The request thread is then signalled when the payload has been transmitted. Because the change to non-blocking on the socket throws the rug from beneath most of the core H2 functionality, this is one large commit that changes everything in one go. A send buffer for control frames is allocated at the beginning of the H2 workspace for stream 0. HTC Pipe-lining can no longer afford a complete rollback, and was changed to figure how to reset or roll back the HTC workspace at HTC_RxInit() time, or leave it alone. The HTC owner is in charge of providing a workspace snapshot when needed.
We have been failing to enforce a MUST requirement of the H2 spec that states that a PRISM MUST be followed by a SETTINGS frame (which may be empty). Note that this also goes for H1->H2 upgrades, even though the settings were also transferred as an H1 header. Relevant RFC references: rfc7540,l,579,637 rfc7540,l,482,485 This change comes with a need to adjust several of our test cases that failed to adhere to this requirement. The test case changes mostly consist of not doing the preface manually, allowing varnishtest to do the needful.
This centralizes the logic around h2 stream state changes to one function.
When we are sending a GOAWAY frame during session shutdown, we have to make sure not to cancel the current large frame if any. Fix this by changing the send framework shutdown into a two stage operation. Prior to sending the GOAWAY we cancel any queued large frames, then we attempt to get the GOAWAY out the door, and finally cancel the current large frame if still present. Better diff with the --ignore-all-space option.
An assertion was wrongly being made that when killing any request, the hpack lock would have been released. This of course only is true when we are killing the request holding the hpack lock. Better diff with the --ignore-all-space option.
If we end up killing a req that is holding the hpack lock, we will have lost the compression state. This change makes sure to set a connection error if this happens, which in turn will ensure that the connection is closed.
The per-frame handler functions are not allowed to call h2_kill_req() from within its handlers. This is because the caller will be holding a pointer to it, and h2_kill_req() would free the req if it happens to not currently be scheduled. This patch fixes h2_rx_rst_stream() so that it adheres to this rule.
This unsets the `struct h2_req` magic value as held in the req's workspace allocation when releasing the req. This ensures that a dangling `struct h2_req` pointer would fail its magic check after the req has been released.
The window update handling code would interpret (r2==NULL) as meaning the connection window always. This is faulty logic, as r2 will be NULL also for a stream that has been closed. The RFC clearly states that a window update for a closed stream must be ignored. Fix this by testing on the received stream number being zero rather than (r2==NULL) to determine connection window updates. One way this issue can become a handling problem is that multiple faulty increases of the connection window can make us overflow the connection window maximum size, causing us to send a flow control error to the client.
It is no longer needed to hold the lock when manipulating the open streams list. This list is after the redesign only ever accessed by the session thread.
Avoid the data race on `r2->scheduled` by holding the session mutex when testing it.
|
Hello from a bus in the middle of nowhere. It was brought to my attention that someone was "not terribly happy about" me doing a "and don't do anything else, while I'm away on vacation" salute. Apologies for giving this impression, I'll try to clarify what I wrote on a late Friday night since I was not sure I would get a reliable-enough internet access today to submit it.
What I mean here is not "don't do anything else" but rather, if you can help it, "do anything else" after merging. In other words, if you register review comments that could be applied afterwards, let's apply them afterwards.
If you must insist on making changes before merging, "do anything else" as new commits at the end of the patch series.
I could have submitted a smaller patch series, but that would make coordination harder for us, so in order to facilitate that kind of contribution (already gone to production) from us, it was easier to retain the same list of applicable patches. Please do "anything else" while I'm on vacation, except introducing structural changes in the middle of the patch series. 🫡 |
|
|
||
| #include "cache_varnishd.h" | ||
|
|
||
| uintptr_t const ws_pipeline_rollback = (uintptr_t)&ws_pipeline_rollback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be const uintptr_t ws_pipeline_rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tripped on not-quite-constant symbols in this form:
// .h
extern const char *sentinel_value;
// .c
const char *sentinel_value = "literal";So to avoid this, I took the habit of adding the const modifier to the symbol when it is meant to never change, instead of its type. While I agree that in practice it does not make a difference here for this non-pointer type, it better conveys (to me) the intent of making ws_pipeline_rollback read-only.
See 0cfb214 for a recent example.
mbgrydeland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone through all the patches, and I believe this is a faithful port of the code. Only a couple of minor comments made.
| /*--------------------------------------------------------------------*/ | ||
|
|
||
| void | ||
| HTC_RxInit(struct http_conn *htc, struct ws *ws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a hard time groking the logic wrt to the pipeline_snap usage. Wouldn't it be cleaner to add a snapshot argument to HTC_RxInit(), and have each call site give the right argument. This argument looks to me like it could be done unconditionally on each of the call sites, making the code easier to follow and cleaner IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, and that would avoid increasing the size of struct http_conn.
| r2->magic = 0; | ||
| req->transport_priv = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these two lines need to be moved after Req_Cleanup(). This due to Req_Cleanup() calling VDP_Fini(), which I guess may for some VDP call on code expecting to see the complete structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I forgot about the increased scope of Req_Cleanup() in this branch.
|
I tried to apply this patch and then rebase master (trunk) and this seems to solve the issue we had with H2 where we had to set |
|
Thank you for your feedback. In order to move this forward, I plan to revert the latest VSV patches at the beginning of the patch series, and add the patches we originally wrote for this new architecture at the end. But I'm not starting without a sign that this will receive attention. |
|
Not sure if its related to this patch but it could be .. after running it for an extended period we got a crash |
|
There is something confusing here:
These two items seem to be contradicting each other. |
Nevermind: #4396 |
Bugwash: @nigoroll Thinks that he will be able to look at it this month. |
This is a port of a change initially made by @mbgrydeland first in Varnish Enterprise for planning reasons.
It's a major change of the architecture:
h2_sessthreadh2_reqthreads is done with a file descriptorI was the main reviewer of the change, and I personally extracted several commits (@mbgrydeland did too based on my reviews) in order to reduce the size of the main commit as much as possible:
After porting the change to this branch, I wish I had more aggressively extracted small changes into self-contained commits because the amount of conflicts to resolve made the port of this commit alone several days of work.
There were a lot of explicit (line) conflicts, and a fair amount of implicit commits because at this point trunk and Varnish Enterprise (based on 6.0) diverged in many small ways:
varnishscoreboardutilityBetween these major changes (and probably others I'm forgetting) and lots of small core changes accumulated over the years, there was pretty much always a merge conflict for all the commits I cherry-picked. I snuck two original commits at the beginning, to reduce noise when conflicts happened during the port.
I caught two minor bugs in this port, and besides the workspace conflict, things went rather smoothly. It was just tedious to replay, so tedious that I frequently switched to other (important) tasks, which added delays. But I ran the test suite for each commit, stressing them with load, the workspace emulator paired with sanitizers.
The original change went to production, facing real world h2 traffic, that's why the patch series ends with a bunch of bug fixes. For similar workloads it's putting overall a lower load on the system, and it's more reactive. If you wish to use OpenSSL, you can't perform reads and writes concurrently, so moving all socket operations to the session thread also removes a good deal of contention for HTTPS workloads.
I'm requesting @mbgrydeland's review so he can make sure that the port is faithful to our original work. I'm also requesting @walid-git's review because he rebased his trailer implementation (Varnish Enterprise counterpart of #4125) on this architectural change (and eventually he should do the same for #4125). It was reportedly easier to implement with this new architecture.
I have one demand.
Leave structural changes aside if they can wait until after merging this. If they must really happen as part of this change, it will require new commits. I didn't apply the bug fixes in the commit introducing the bugs because keeping the same list of commits will help us track what has or has not been ported either way (among commits relevant in both projects).
I'm otherwise open to squashing whitespace improvements or general polish in the right commits, but I will be away for a couple weeks.