-
Notifications
You must be signed in to change notification settings - Fork 400
backend: add VSCs for backend errors #4337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
nigoroll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides specific comments, I am not sure if adding global counters makes much sense. If users see these increase, won't they immediately ask "yeah, but which backend?". So unless there is some specific background which might need explanations, I would suggest to skip one iteration and go straight to per-backend counters.
include/tbl/sess_close.h
Outdated
| SESS_CLOSE(VCL_FAILURE, vcl_failure, 1, "VCL failure") | ||
| SESS_CLOSE(RAPID_RESET, rapid_reset, 1, "HTTP2 rapid reset") | ||
| SESS_CLOSE(BANKRUPT, bankrupt, 1, "HTTP2 credit bankruptcy") | ||
| #ifndef SESS_CLOSE_BACKEND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is the best option.
Should we maybe rather have SESS_CLOSE_C() for the client-only closure reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be a little bit nicer.
It would mimic something we already do in include/tbl/obj_attr.h.
Let me have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: 07f58bb
bin/varnishd/cache/cache_backend.c
Outdated
| */ | ||
|
|
||
| static void | ||
| vbe_close_acct(const struct pfd *pfd, stream_close_t reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this patch only half baked? This commit only adds one call to vbe_close_acct and for this call, incrementing bc_tx_proxy looks wrong, because reason is not SC_NULL. Also I wonder why bc_tx_proxy should need special casing, why do we suddenly care about the call site, when otherwise the reason is just that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made vbe_close_acct into a function to mimic ses_close_acct. It being called from one call-site was intended.
reason can be SC_NULL because this if statement is an OR, see https://github.com/varnishcache/varnish-cache/pull/4337/files#diff-fe1fb2709016ca15daea3e926b7cec272966ee397c5bafe50d6967023dd9f4a7R401
If you remove this handling, the test tests/o00002.vtc will run into an assert:
*** v2 debug|Error: Child (1249377) Panic at: Thu, 22 May 2025 09:20:05 GMT
14 *** v2 debug|Wrong turn at cache/cache_backend.c:381:
15 *** v2 debug|Wrong event in vbe_close_acct
16 *** v2 debug|version = varnish-trunk revision f15259b053765f675a10b40e0e6d91a0847ac065, vrt api = 21.0
17 *** v2 debug|ident = Linux,6.1.0-34-amd64,x86_64,-jnone,-sdefault,-sdefault,-hcritbit,epoll
18 *** v2 debug|now = 392362.874551 (mono), 1747905603.630655 (real)
19 *** v2 debug|Backtrace:
20 *** v2 debug| ip=0x564002279265 sp=0x7f60b036a390 <VBT_format+0x35>
21 *** v2 debug| ip=0x5640021acdb3 sp=0x7f60b036a3b0 <pan_backtrace+0x33>
22 *** v2 debug| ip=0x5640021acaf7 sp=0x7f60b036a3d0 <pan_ic+0x377>
23 *** v2 debug| ip=0x5640022782db sp=0x7f60b036a550 <VAS_Fail+0x4b>
24 *** v2 debug| ip=0x56400216dd91 sp=0x7f60b036a590 <vbe_close_acct+0x3a1>
25 *** v2 debug| ip=0x56400216b5ab sp=0x7f60b036a5b0 <vbe_dir_finish+0x3cb>
26 *** v2 debug| ip=0x564002181419 sp=0x7f60b036a610 <VDI_Finish+0x249>
27 *** v2 debug| ip=0x56400218e318 sp=0x7f60b036a6c0 <vbf_cleanup+0x158>
28 *** v2 debug| ip=0x5640021932f2 sp=0x7f60b036a6e0 <vbf_stp_fetchend+0x1f2>
29 *** v2 debug| ip=0x56400218f364 sp=0x7f60b036a730 <vbf_fetch_thread+0x5f4>
30 *** v2 debug| ip=0x5640021edb47 sp=0x7f60b036a810 <Pool_Work_Thread+0x7c7>
31 *** v2 debug| ip=0x5640021ed243 sp=0x7f60b036a8a0 <WRK_Thread+0x333>
32 *** v2 debug| ip=0x5640021ececb sp=0x7f60b036b430 <pool_thread+0xcb>
33 *** v2 debug| ip=0x7f60b92a81f5 sp=0x7f60b036b460 <pthread_condattr_setpshared+0x515>
34 *** v2 debug| ip=0x7f60b932889c sp=0x7f60b036b500 <__xmknodat+0x23c>
GDB:
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007fdbc20a9f4f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x00007fdbc205afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007fdbc2045472 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000558df4ab5d0c in pan_ic (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at cache/cache_panic.c:791
#5 0x0000558df4b812db in VAS_Fail (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at vas.c:107
#6 0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381
#7 0x0000558df4a745ab in vbe_dir_finish (ctx=0x7fdbbf5a6610, d=0x7fdbc181d7b0) at cache/cache_backend.c:403
#8 0x0000558df4a8a419 in VDI_Finish (bo=0x7fdbb667d460) at cache/cache_director.c:195
#9 0x0000558df4a97318 in vbf_cleanup (bo=0x7fdbb667d460) at cache/cache_fetch.c:134
#10 0x0000558df4a9c2f2 in vbf_stp_fetchend (wrk=0x7fdbbf5a7340, bo=0x7fdbb667d460) at cache/cache_fetch.c:748
#11 0x0000558df4a98364 in vbf_fetch_thread (wrk=0x7fdbbf5a7340, priv=0x7fdbb667d460) at cache/cache_fetch.c:1123
#12 0x0000558df4af6b47 in Pool_Work_Thread (pp=0x7fdbbf5fd000, wrk=0x7fdbbf5a7340) at cache/cache_wrk.c:496
#13 0x0000558df4af6243 in WRK_Thread (qp=0x7fdbbf5fd000, stacksize=81920, thread_workspace=2048) at cache/cache_wrk.c:157
#14 0x0000558df4af5ecb in pool_thread (priv=0x7fdbbe0000e0) at cache/cache_wrk.c:529
#15 0x00007fdbc20a81f5 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007fdbc212889c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) frame 6
#6 0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381
381 WRONG("Wrong event in vbe_close_acct");
(gdb) p reason
$1 = (stream_close_t) 0x558df4c25600 <SC_NULL>
(gdb)As you can see reason is SC_NULL.
The reason for why SC_NULL require special handling is because that value is not handled from the table (include/tbl/sess_close.h).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, you are right, I overlooked || bp->proxy_header != 0.
So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.
So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in sess_close.h, be adjusted in vbe_dir_finish() and handled like any other in vbe_close_acct().
So why is vbe_close_acct() not called from vbe_dir_getfd()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, you are right, I overlooked
|| bp->proxy_header != 0.So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.
Correct, no worries it caught me off-guard as well when first producing the patch.
So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in
sess_close.h, be adjusted invbe_dir_finish()and handled like any other invbe_close_acct().
I agree that we should not let reason be SC_NULL here, I wanted to fix it but also did not want to introduce too many changes in this PR. I can include a commit that change this behavior if you like.
So why is
vbe_close_acct()not called fromvbe_dir_getfd()?
So, whenever a pfd is returned by vbe_dir_getfd() the caller is expected to call vbe_dir_finish() once they are finished using it.
For all the other case where we return NULL (and not pfd), we do not set a valid htc->doclose attribute.
Instead, vbe_dir_getfd() sometimes do not hold pfd yet, or it got one but closes it down before returning NULL.
Then, the return value of vbe_dir_getfd() ensure that we fail the fetch (or do not start the pipe).
I could be convinced to change how vbe_dir_getfd() operates but I guess others have an opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to fix it but also did not want to introduce too many changes in this PR.
We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add vbe_close_acct() and it would all fall into place.
I could be convinced to change how
vbe_dir_getfd()
I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I will make an attempt at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to fix it but also did not want to introduce too many changes in this PR.
We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add
vbe_close_acct()and it would all fall into place.
So, I had an attempt at this. I am starting to understand why things are laid out as they are today.
The reason why we can not use a SC for PROXY is due to a couple of reasons:
- First,
SC_NULLis used as a sink multiple places. Ifbo->htc->docloseisSC_NULLwe know that the fetch has not failed yet. Changing this would mean that we would always have to take PROXY into account. No big deal, but are we sure we want to go this route? - Another reason is that we can not latch another reason on top of
bo->htc->doclose = SC_TX_PROXY;. Example, how do we report a fetch has failed due to a timeout (SC_RX_TIMEOUT) when we are dealing with a PROXY transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced to change how
vbe_dir_getfd()I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in
vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.
| VSLb(bo->vsl, SLT_FetchError, "Received junk"); | ||
| htc->doclose = SC_RX_JUNK; | ||
| break; | ||
| case HTC_S_CLOSE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular change lgtm
Good point, this was actually mentioned to me. I forgot to take that into account, but I think that make sense. |
I am not entirely convinced that the double accounting pays off, but I can see that the argument has some relevance for practical purposes. |
See: a3fdba2
|
nigoroll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this is OK, but I would like to ask @asadsa92 to make life easier for reviewers. Could you please
- rebase the PR
- sqash the to-be-squased commit
- separate commits for moving code from changing code?
I would then like to have a final look, but I hope that will be a piece of cake then
bin/varnishd/cache/cache_backend.c
Outdated
| return (NULL); | ||
| if (bp->proxy_header != 0) { | ||
| err = VPX_Send_Proxy(*fdp, bp->proxy_header, bo->sp); | ||
| if (err < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving the block a level down is worse c-style.
We should avoid nesting if possible and there is no performance to gain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I have reverted the nesting, see 725e716
bin/varnishd/cache/cache_backend.c
Outdated
| */ | ||
|
|
||
| static void | ||
| vbe_close_acct(const struct pfd *pfd, stream_close_t reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please go back to the commit where you added vbe_close_acct and move it to this place rather than adding it in the wrong place and then moving it again this additional commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
| WRONG("Wrong event in vbe_close_acct"); | ||
| } | ||
|
|
||
| static void v_matchproto_(vdi_finish_f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a commit before where you move vbe_dir_finish as-is without modification such that we can see clearly in this commit what the changes are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I found that I had misplaced the opening curly for the function. See 268b182
bin/varnishd/cache/cache_backend.c
Outdated
|
|
||
| static void | ||
| vbe_close_acct(const struct pfd *pfd, stream_close_t reason) | ||
| vbe_close_acct(const struct pfd *pfd, struct backend *bp, stream_close_t reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will likely trigger a "could be const" flexelint notice.
We should pass the vsc pointer, not bp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your 2nd review, we are getting there 😄 |
For backends not all stream_close_t values make sense. Therefore, do not allow these to be specified as `doclose` attribute for failed backend transaction. No change is needed for existing errors. We should ideally split them up, but this is not possible as some functions are used by both sessions and fetches. One example of such function: - stream_close_t http_DoConnection(struct http *hp, stream_close_t sc_close); This patch is best viewed with --ignore-all-space Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for backends. We have up until now not exposed `htc->doclose` (backend errors) to any counters as we do for `req->doclose` (session errors). That is unfortunate as these errors code are useful when trubleshooting fetches, so expose them as `bc_*` counters which is a subset of the `sc_*` counters. Also add counters for backend closes: `backend_closed` and `backend_closed_err`. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
HTC_S_EOF is returned when the backend close the connection on us. Since RxStuff() no longer reports `HTC_S_CLOSE` we can reuse it for EOF. To be consistent with sessions (client disconnect) use the `SC_REM_CLOSE`. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
This new debug bit enable us to test proxy send error for VRT backends. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
No modifications are done to the vbe_dir_finish function. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
There is no need to special case VPX_Send_Proxy with its own error handling. We can instead reuse vbe_dir_finish. We will now report BackendClosed tag in VSL as well (from vbe_dir_finish). To accomplish this we need to initialize the HTC struct a little bit earlier. Finally, we change the doclose reason to be an error transmission (SC_TX_ERROR). Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
Change vbe_close_acct to update counters for the specific backend as well as the existing global counters. This addition allows a user to easily pinpoint the problematic backend. As some backends can come and go (dynamic backends), we should maintain the global counters as is. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
a3fdba2 to
8a872b0
Compare
|
@nigoroll Thanks once again for your review, this should be ready now. $ git range-diff 8a219c644..a3fdba2c9 720b939c2.. |
Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for backends.
We have up until now not exposed
htc->doclose(backend errors) to any counters as we do forreq->doclose(session errors).That is unfortunate as these errors code are useful when trubleshooting fetches, so expose them as
bc_*counters which is a subset of thesc_*counters.Also add counters for backend closes:
backend_closedandbackend_closed_err.