backend: add VSCs for backend errors #4337

asadsa92 · 2025-05-22T08:51:07Z

Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for backends.

We have up until now not exposed htc->doclose (backend errors) to any counters as we do for req->doclose (session errors).

That is unfortunate as these errors code are useful when trubleshooting fetches, so expose them as bc_* counters which is a subset of the sc_* counters.

Also add counters for backend closes: backend_closed and backend_closed_err.

nigoroll

Besides specific comments, I am not sure if adding global counters makes much sense. If users see these increase, won't they immediately ask "yeah, but which backend?". So unless there is some specific background which might need explanations, I would suggest to skip one iteration and go straight to per-backend counters.

include/tbl/sess_close.h

nigoroll · 2025-05-22T09:00:11Z

include/tbl/sess_close.h

-SESS_CLOSE(VCL_FAILURE,	  vcl_failure,	1,	"VCL failure")
-SESS_CLOSE(RAPID_RESET,	  rapid_reset,  1,      "HTTP2 rapid reset")
-SESS_CLOSE(BANKRUPT,	  bankrupt,	1,      "HTTP2 credit bankruptcy")
+#ifndef SESS_CLOSE_BACKEND


I am not sure if this is the best option.

Should we maybe rather have SESS_CLOSE_C() for the client-only closure reasons?

Yes, I think that would be a little bit nicer.
It would mimic something we already do in include/tbl/obj_attr.h.
Let me have a look.

See: 07f58bb

nigoroll · 2025-05-22T09:05:23Z

bin/varnishd/cache/cache_backend.c

+ */
+
+static void
+vbe_close_acct(const struct pfd *pfd, stream_close_t reason)


Is this patch only half baked? This commit only adds one call to vbe_close_acct and for this call, incrementing bc_tx_proxy looks wrong, because reason is not SC_NULL. Also I wonder why bc_tx_proxy should need special casing, why do we suddenly care about the call site, when otherwise the reason is just that?

I made vbe_close_acct into a function to mimic ses_close_acct. It being called from one call-site was intended.

reason can be SC_NULL because this if statement is an OR, see https://github.com/varnishcache/varnish-cache/pull/4337/files#diff-fe1fb2709016ca15daea3e926b7cec272966ee397c5bafe50d6967023dd9f4a7R401

If you remove this handling, the test tests/o00002.vtc will run into an assert:

*** v2 debug|Error: Child (1249377) Panic at: Thu, 22 May 2025 09:20:05 GMT 14 *** v2 debug|Wrong turn at cache/cache_backend.c:381: 15 *** v2 debug|Wrong event in vbe_close_acct 16 *** v2 debug|version = varnish-trunk revision f15259b053765f675a10b40e0e6d91a0847ac065, vrt api = 21.0 17 *** v2 debug|ident = Linux,6.1.0-34-amd64,x86_64,-jnone,-sdefault,-sdefault,-hcritbit,epoll 18 *** v2 debug|now = 392362.874551 (mono), 1747905603.630655 (real) 19 *** v2 debug|Backtrace: 20 *** v2 debug| ip=0x564002279265 sp=0x7f60b036a390 <VBT_format+0x35> 21 *** v2 debug| ip=0x5640021acdb3 sp=0x7f60b036a3b0 <pan_backtrace+0x33> 22 *** v2 debug| ip=0x5640021acaf7 sp=0x7f60b036a3d0 <pan_ic+0x377> 23 *** v2 debug| ip=0x5640022782db sp=0x7f60b036a550 <VAS_Fail+0x4b> 24 *** v2 debug| ip=0x56400216dd91 sp=0x7f60b036a590 <vbe_close_acct+0x3a1> 25 *** v2 debug| ip=0x56400216b5ab sp=0x7f60b036a5b0 <vbe_dir_finish+0x3cb> 26 *** v2 debug| ip=0x564002181419 sp=0x7f60b036a610 <VDI_Finish+0x249> 27 *** v2 debug| ip=0x56400218e318 sp=0x7f60b036a6c0 <vbf_cleanup+0x158> 28 *** v2 debug| ip=0x5640021932f2 sp=0x7f60b036a6e0 <vbf_stp_fetchend+0x1f2> 29 *** v2 debug| ip=0x56400218f364 sp=0x7f60b036a730 <vbf_fetch_thread+0x5f4> 30 *** v2 debug| ip=0x5640021edb47 sp=0x7f60b036a810 <Pool_Work_Thread+0x7c7> 31 *** v2 debug| ip=0x5640021ed243 sp=0x7f60b036a8a0 <WRK_Thread+0x333> 32 *** v2 debug| ip=0x5640021ececb sp=0x7f60b036b430 <pool_thread+0xcb> 33 *** v2 debug| ip=0x7f60b92a81f5 sp=0x7f60b036b460 <pthread_condattr_setpshared+0x515> 34 *** v2 debug| ip=0x7f60b932889c sp=0x7f60b036b500 <__xmknodat+0x23c>

GDB:

(gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x00007fdbc20a9f4f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 0x00007fdbc205afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007fdbc2045472 in __GI_abort () at ./stdlib/abort.c:79 #4 0x0000558df4ab5d0c in pan_ic (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at cache/cache_panic.c:791 #5 0x0000558df4b812db in VAS_Fail (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at vas.c:107 #6 0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381 #7 0x0000558df4a745ab in vbe_dir_finish (ctx=0x7fdbbf5a6610, d=0x7fdbc181d7b0) at cache/cache_backend.c:403 #8 0x0000558df4a8a419 in VDI_Finish (bo=0x7fdbb667d460) at cache/cache_director.c:195 #9 0x0000558df4a97318 in vbf_cleanup (bo=0x7fdbb667d460) at cache/cache_fetch.c:134 #10 0x0000558df4a9c2f2 in vbf_stp_fetchend (wrk=0x7fdbbf5a7340, bo=0x7fdbb667d460) at cache/cache_fetch.c:748 #11 0x0000558df4a98364 in vbf_fetch_thread (wrk=0x7fdbbf5a7340, priv=0x7fdbb667d460) at cache/cache_fetch.c:1123 #12 0x0000558df4af6b47 in Pool_Work_Thread (pp=0x7fdbbf5fd000, wrk=0x7fdbbf5a7340) at cache/cache_wrk.c:496 #13 0x0000558df4af6243 in WRK_Thread (qp=0x7fdbbf5fd000, stacksize=81920, thread_workspace=2048) at cache/cache_wrk.c:157 #14 0x0000558df4af5ecb in pool_thread (priv=0x7fdbbe0000e0) at cache/cache_wrk.c:529 #15 0x00007fdbc20a81f5 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #16 0x00007fdbc212889c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) frame 6 #6 0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381 381 WRONG("Wrong event in vbe_close_acct"); (gdb) p reason $1 = (stream_close_t) 0x558df4c25600 <SC_NULL> (gdb)

As you can see reason is SC_NULL.

The reason for why SC_NULL require special handling is because that value is not handled from the table (include/tbl/sess_close.h).

Sorry, yes, you are right, I overlooked || bp->proxy_header != 0.

So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.

So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in sess_close.h, be adjusted in vbe_dir_finish() and handled like any other in vbe_close_acct().

So why is vbe_close_acct() not called from vbe_dir_getfd()?

Sorry, yes, you are right, I overlooked || bp->proxy_header != 0.

So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.

Correct, no worries it caught me off-guard as well when first producing the patch.

So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in sess_close.h, be adjusted in vbe_dir_finish() and handled like any other in vbe_close_acct().

I agree that we should not let reason be SC_NULL here, I wanted to fix it but also did not want to introduce too many changes in this PR. I can include a commit that change this behavior if you like.

So why is vbe_close_acct() not called from vbe_dir_getfd()?

So, whenever a pfd is returned by vbe_dir_getfd() the caller is expected to call vbe_dir_finish() once they are finished using it.

For all the other case where we return NULL (and not pfd), we do not set a valid htc->doclose attribute.
Instead, vbe_dir_getfd() sometimes do not hold pfd yet, or it got one but closes it down before returning NULL.
Then, the return value of vbe_dir_getfd() ensure that we fail the fetch (or do not start the pipe).

I could be convinced to change how vbe_dir_getfd() operates but I guess others have an opinion on this.

I wanted to fix it but also did not want to introduce too many changes in this PR.

We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add vbe_close_acct() and it would all fall into place.

I could be convinced to change how vbe_dir_getfd()

I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.

Great, I will make an attempt at this.

I wanted to fix it but also did not want to introduce too many changes in this PR.

We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add vbe_close_acct() and it would all fall into place.

So, I had an attempt at this. I am starting to understand why things are laid out as they are today.
The reason why we can not use a SC for PROXY is due to a couple of reasons:

First, SC_NULL is used as a sink multiple places. If bo->htc->doclose is SC_NULL we know that the fetch has not failed yet. Changing this would mean that we would always have to take PROXY into account. No big deal, but are we sure we want to go this route?

Another reason is that we can not latch another reason on top of bo->htc->doclose = SC_TX_PROXY;. Example, how do we report a fetch has failed due to a timeout (SC_RX_TIMEOUT) when we are dealing with a PROXY transaction?

I could be convinced to change how vbe_dir_getfd()

I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.

See: 59c15f0 and 02ae608

nigoroll · 2025-05-22T09:07:18Z

bin/varnishd/http1/cache_http1_fetch.c

 			VSLb(bo->vsl, SLT_FetchError, "Received junk");
 			htc->doclose = SC_RX_JUNK;
 			break;
-		case HTC_S_CLOSE:


this particular change lgtm

asadsa92 · 2025-05-22T11:32:39Z

Besides specific comments, I am not sure if adding global counters makes much sense. If users see these increase, won't they immediately ask "yeah, but which backend?". So unless there is some specific background which might need explanations, I would suggest to skip one iteration and go straight to per-backend counters.

Good point, this was actually mentioned to me. I forgot to take that into account, but I think that make sense.
However, I think we should keep the global counters such that further breakdown is available for the interested users.
The global counter is a quick way to tell if looking into backends make sense.

nigoroll · 2025-05-22T12:17:10Z

The global counter is a quick way to tell if looking into backends make sense.

I am not entirely convinced that the double accounting pays off, but I can see that the argument has some relevance for practical purposes.

asadsa92 · 2025-06-10T12:49:07Z

The global counter is a quick way to tell if looking into backends make sense.

I am not entirely convinced that the double accounting pays off, but I can see that the argument has some relevance for practical purposes.

See: a3fdba2
A snip from the commit message:

As some backends can come and go (dynamic backends), we should maintain the global counters as is.

nigoroll

So I think this is OK, but I would like to ask @asadsa92 to make life easier for reviewers. Could you please

rebase the PR
sqash the to-be-squased commit
separate commits for moving code from changing code?
I would then like to have a final look, but I hope that will be a piece of cake then

nigoroll · 2025-10-01T13:11:54Z

bin/varnishd/cache/cache_backend.c

-		return (NULL);
+	if (bp->proxy_header != 0) {
+		err = VPX_Send_Proxy(*fdp, bp->proxy_header, bo->sp);
+		if (err < 0) {


moving the block a level down is worse c-style.
We should avoid nesting if possible and there is no performance to gain

I agree, I have reverted the nesting, see 725e716

include/tbl/sess_close.h

nigoroll · 2025-10-01T13:22:53Z

bin/varnishd/cache/cache_backend.c

+ */
+
+static void
+vbe_close_acct(const struct pfd *pfd, stream_close_t reason)


please go back to the commit where you added vbe_close_acct and move it to this place rather than adding it in the wrong place and then moving it again this additional commit

Thanks, done.

nigoroll · 2025-10-01T13:23:37Z

bin/varnishd/cache/cache_backend.c

+	WRONG("Wrong event in vbe_close_acct");
+}
+
+static void v_matchproto_(vdi_finish_f)


please add a commit before where you move vbe_dir_finish as-is without modification such that we can see clearly in this commit what the changes are.

Good idea, I found that I had misplaced the opening curly for the function. See 268b182

nigoroll · 2025-10-01T13:24:28Z

bin/varnishd/cache/cache_backend.c


 static void
-vbe_close_acct(const struct pfd *pfd, stream_close_t reason)
+vbe_close_acct(const struct pfd *pfd, struct backend *bp, stream_close_t reason)


this will likely trigger a "could be const" flexelint notice.
We should pass the vsc pointer, not bp

this will likely trigger a "could be const" flexelint notice.

This was no longer relevant because I ended up passing the vsc pointer as you suggested.
But, I made stream_close_t reason const: 836454a

We should pass the vsc pointer, not bp

Agree, see 8a872b0

asadsa92 · 2025-10-06T11:30:33Z

So I think this is OK, but I would like to ask @asadsa92 to make life easier for reviewers. Could you please

Thanks for your 2nd review, we are getting there 😄
Yes, I will come back to you with a polished PR.

For backends not all stream_close_t values make sense. Therefore, do not allow these to be specified as `doclose` attribute for failed backend transaction. No change is needed for existing errors. We should ideally split them up, but this is not possible as some functions are used by both sessions and fetches. One example of such function: - stream_close_t http_DoConnection(struct http *hp, stream_close_t sc_close); This patch is best viewed with --ignore-all-space Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for backends. We have up until now not exposed `htc->doclose` (backend errors) to any counters as we do for `req->doclose` (session errors). That is unfortunate as these errors code are useful when trubleshooting fetches, so expose them as `bc_*` counters which is a subset of the `sc_*` counters. Also add counters for backend closes: `backend_closed` and `backend_closed_err`. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

HTC_S_EOF is returned when the backend close the connection on us. Since RxStuff() no longer reports `HTC_S_CLOSE` we can reuse it for EOF. To be consistent with sessions (client disconnect) use the `SC_REM_CLOSE`. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

This new debug bit enable us to test proxy send error for VRT backends. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

No modifications are done to the vbe_dir_finish function. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

There is no need to special case VPX_Send_Proxy with its own error handling. We can instead reuse vbe_dir_finish. We will now report BackendClosed tag in VSL as well (from vbe_dir_finish). To accomplish this we need to initialize the HTC struct a little bit earlier. Finally, we change the doclose reason to be an error transmission (SC_TX_ERROR). Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

Change vbe_close_acct to update counters for the specific backend as well as the existing global counters. This addition allows a user to easily pinpoint the problematic backend. As some backends can come and go (dynamic backends), we should maintain the global counters as is. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

asadsa92 · 2025-10-07T13:06:29Z

@nigoroll Thanks once again for your review, this should be ready now.
To view the full diff compared to the previous version:

$ git range-diff 8a219c644..a3fdba2c9 720b939c2..

asadsa92 self-assigned this May 22, 2025

nigoroll requested changes May 22, 2025

View reviewed changes

nigoroll self-requested a review June 30, 2025 13:31

nigoroll requested changes Oct 1, 2025

View reviewed changes

asadsa92 added 7 commits October 7, 2025 11:21

varnishd: introduce debug bit for PROXY_ERROR

cc612f0

This new debug bit enable us to test proxy send error for VRT backends. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

backend: move vbe_dir_finish for what comes next

268b182

No modifications are done to the vbe_dir_finish function. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>

asadsa92 force-pushed the vbe_close_acct branch from a3fdba2 to 8a872b0 Compare October 7, 2025 12:52

asadsa92 requested a review from nigoroll October 7, 2025 15:10

backend: add VSCs for backend errors #4337

Are you sure you want to change the base?

backend: add VSCs for backend errors #4337

Conversation

asadsa92 commented May 22, 2025

Uh oh!

nigoroll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asadsa92 May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nigoroll May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asadsa92 commented May 22, 2025

Uh oh!

nigoroll commented May 22, 2025

Uh oh!

asadsa92 commented Jun 10, 2025

Uh oh!

nigoroll left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asadsa92 commented Oct 6, 2025

Uh oh!

asadsa92 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asadsa92 May 22, 2025 •

edited

Loading

nigoroll May 22, 2025 •

edited

Loading

asadsa92 commented Oct 7, 2025 •

edited

Loading