Skip to content

Conversation

@asadsa92
Copy link
Contributor

Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for backends.

We have up until now not exposed htc->doclose (backend errors) to any counters as we do for req->doclose (session errors).

That is unfortunate as these errors code are useful when trubleshooting fetches, so expose them as bc_* counters which is a subset of the sc_* counters.

Also add counters for backend closes: backend_closed and backend_closed_err.

@asadsa92 asadsa92 self-assigned this May 22, 2025
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides specific comments, I am not sure if adding global counters makes much sense. If users see these increase, won't they immediately ask "yeah, but which backend?". So unless there is some specific background which might need explanations, I would suggest to skip one iteration and go straight to per-backend counters.

SESS_CLOSE(VCL_FAILURE, vcl_failure, 1, "VCL failure")
SESS_CLOSE(RAPID_RESET, rapid_reset, 1, "HTTP2 rapid reset")
SESS_CLOSE(BANKRUPT, bankrupt, 1, "HTTP2 credit bankruptcy")
#ifndef SESS_CLOSE_BACKEND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is the best option.

Should we maybe rather have SESS_CLOSE_C() for the client-only closure reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be a little bit nicer.
It would mimic something we already do in include/tbl/obj_attr.h.
Let me have a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: 07f58bb

*/

static void
vbe_close_acct(const struct pfd *pfd, stream_close_t reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this patch only half baked? This commit only adds one call to vbe_close_acct and for this call, incrementing bc_tx_proxy looks wrong, because reason is not SC_NULL. Also I wonder why bc_tx_proxy should need special casing, why do we suddenly care about the call site, when otherwise the reason is just that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made vbe_close_acct into a function to mimic ses_close_acct. It being called from one call-site was intended.

reason can be SC_NULL because this if statement is an OR, see https://github.com/varnishcache/varnish-cache/pull/4337/files#diff-fe1fb2709016ca15daea3e926b7cec272966ee397c5bafe50d6967023dd9f4a7R401

If you remove this handling, the test tests/o00002.vtc will run into an assert:

***  v2    debug|Error: Child (1249377) Panic at: Thu, 22 May 2025 09:20:05 GMT 
  14 ***  v2    debug|Wrong turn at cache/cache_backend.c:381:                       
  15 ***  v2    debug|Wrong event in vbe_close_acct                                  
  16 ***  v2    debug|version = varnish-trunk revision f15259b053765f675a10b40e0e6d91a0847ac065, vrt api = 21.0
  17 ***  v2    debug|ident = Linux,6.1.0-34-amd64,x86_64,-jnone,-sdefault,-sdefault,-hcritbit,epoll
  18 ***  v2    debug|now = 392362.874551 (mono), 1747905603.630655 (real)           
  19 ***  v2    debug|Backtrace:                                                     
  20 ***  v2    debug|  ip=0x564002279265 sp=0x7f60b036a390 <VBT_format+0x35>        
  21 ***  v2    debug|  ip=0x5640021acdb3 sp=0x7f60b036a3b0 <pan_backtrace+0x33>     
  22 ***  v2    debug|  ip=0x5640021acaf7 sp=0x7f60b036a3d0 <pan_ic+0x377>           
  23 ***  v2    debug|  ip=0x5640022782db sp=0x7f60b036a550 <VAS_Fail+0x4b>          
  24 ***  v2    debug|  ip=0x56400216dd91 sp=0x7f60b036a590 <vbe_close_acct+0x3a1>   
  25 ***  v2    debug|  ip=0x56400216b5ab sp=0x7f60b036a5b0 <vbe_dir_finish+0x3cb>   
  26 ***  v2    debug|  ip=0x564002181419 sp=0x7f60b036a610 <VDI_Finish+0x249>       
  27 ***  v2    debug|  ip=0x56400218e318 sp=0x7f60b036a6c0 <vbf_cleanup+0x158>      
  28 ***  v2    debug|  ip=0x5640021932f2 sp=0x7f60b036a6e0 <vbf_stp_fetchend+0x1f2> 
  29 ***  v2    debug|  ip=0x56400218f364 sp=0x7f60b036a730 <vbf_fetch_thread+0x5f4> 
  30 ***  v2    debug|  ip=0x5640021edb47 sp=0x7f60b036a810 <Pool_Work_Thread+0x7c7> 
  31 ***  v2    debug|  ip=0x5640021ed243 sp=0x7f60b036a8a0 <WRK_Thread+0x333>       
  32 ***  v2    debug|  ip=0x5640021ececb sp=0x7f60b036b430 <pool_thread+0xcb>       
  33 ***  v2    debug|  ip=0x7f60b92a81f5 sp=0x7f60b036b460 <pthread_condattr_setpshared+0x515>
  34 ***  v2    debug|  ip=0x7f60b932889c sp=0x7f60b036b500 <__xmknodat+0x23c>       

GDB:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007fdbc20a9f4f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fdbc205afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fdbc2045472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x0000558df4ab5d0c in pan_ic (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at cache/cache_panic.c:791
#5  0x0000558df4b812db in VAS_Fail (func=0x558df4badbe0 "vbe_close_acct", file=0x558df4bad312 "cache/cache_backend.c", line=381, cond=0x558df4badbd1 "Wrong event in vbe_close_acct", kind=VAS_WRONG) at vas.c:107
#6  0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381
#7  0x0000558df4a745ab in vbe_dir_finish (ctx=0x7fdbbf5a6610, d=0x7fdbc181d7b0) at cache/cache_backend.c:403
#8  0x0000558df4a8a419 in VDI_Finish (bo=0x7fdbb667d460) at cache/cache_director.c:195
#9  0x0000558df4a97318 in vbf_cleanup (bo=0x7fdbb667d460) at cache/cache_fetch.c:134
#10 0x0000558df4a9c2f2 in vbf_stp_fetchend (wrk=0x7fdbbf5a7340, bo=0x7fdbb667d460) at cache/cache_fetch.c:748
#11 0x0000558df4a98364 in vbf_fetch_thread (wrk=0x7fdbbf5a7340, priv=0x7fdbb667d460) at cache/cache_fetch.c:1123
#12 0x0000558df4af6b47 in Pool_Work_Thread (pp=0x7fdbbf5fd000, wrk=0x7fdbbf5a7340) at cache/cache_wrk.c:496
#13 0x0000558df4af6243 in WRK_Thread (qp=0x7fdbbf5fd000, stacksize=81920, thread_workspace=2048) at cache/cache_wrk.c:157
#14 0x0000558df4af5ecb in pool_thread (priv=0x7fdbbe0000e0) at cache/cache_wrk.c:529
#15 0x00007fdbc20a81f5 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007fdbc212889c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) frame 6
#6  0x0000558df4a76d91 in vbe_close_acct (pfd=0x7fdbb4a00000, reason=0x558df4c25600 <SC_NULL>) at cache/cache_backend.c:381
381		WRONG("Wrong event in vbe_close_acct");
(gdb) p reason
$1 = (stream_close_t) 0x558df4c25600 <SC_NULL>
(gdb)

As you can see reason is SC_NULL.

The reason for why SC_NULL require special handling is because that value is not handled from the table (include/tbl/sess_close.h).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, you are right, I overlooked || bp->proxy_header != 0.

So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.

So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in sess_close.h, be adjusted in vbe_dir_finish() and handled like any other in vbe_close_acct().

So why is vbe_close_acct() not called from vbe_dir_getfd()?

Copy link
Contributor Author

@asadsa92 asadsa92 May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, you are right, I overlooked || bp->proxy_header != 0.

So what is happening here is that we do not want to recycle a connection if it was opened with a proxy header because otherwise a request could be sent "under" a wrong proxy header.

Correct, no worries it caught me off-guard as well when first producing the patch.

So I can follow now the argument that this is in fact a special close reason, but I would still think it should get its own entry in sess_close.h, be adjusted in vbe_dir_finish() and handled like any other in vbe_close_acct().

I agree that we should not let reason be SC_NULL here, I wanted to fix it but also did not want to introduce too many changes in this PR. I can include a commit that change this behavior if you like.

So why is vbe_close_acct() not called from vbe_dir_getfd()?

So, whenever a pfd is returned by vbe_dir_getfd() the caller is expected to call vbe_dir_finish() once they are finished using it.

For all the other case where we return NULL (and not pfd), we do not set a valid htc->doclose attribute.
Instead, vbe_dir_getfd() sometimes do not hold pfd yet, or it got one but closes it down before returning NULL.
Then, the return value of vbe_dir_getfd() ensure that we fail the fetch (or do not start the pipe).

I could be convinced to change how vbe_dir_getfd() operates but I guess others have an opinion on this.

Copy link
Member

@nigoroll nigoroll May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to fix it but also did not want to introduce too many changes in this PR.

We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add vbe_close_acct() and it would all fall into place.

I could be convinced to change how vbe_dir_getfd()

I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I will make an attempt at this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to fix it but also did not want to introduce too many changes in this PR.

We have PRs with a lot of changes. So this is more abut structure. For example, I think you could introduce the new SC to the old code, then add vbe_close_acct() and it would all fall into place.

So, I had an attempt at this. I am starting to understand why things are laid out as they are today.
The reason why we can not use a SC for PROXY is due to a couple of reasons:

  • First, SC_NULL is used as a sink multiple places. If bo->htc->doclose is SC_NULL we know that the fetch has not failed yet. Changing this would mean that we would always have to take PROXY into account. No big deal, but are we sure we want to go this route?
  • Another reason is that we can not latch another reason on top of bo->htc->doclose = SC_TX_PROXY;. Example, how do we report a fetch has failed due to a timeout (SC_RX_TIMEOUT) when we are dealing with a PROXY transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced to change how vbe_dir_getfd()

I do not know what else you might have in mind, but I do think that we should account for all the cases where we close connections, including the "proxy write failure" in vbe_dir_getfd(). I have not thought about the details, but I do not think we can leave out a case.

See: 59c15f0 and 02ae608

VSLb(bo->vsl, SLT_FetchError, "Received junk");
htc->doclose = SC_RX_JUNK;
break;
case HTC_S_CLOSE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular change lgtm

@asadsa92
Copy link
Contributor Author

Besides specific comments, I am not sure if adding global counters makes much sense. If users see these increase, won't they immediately ask "yeah, but which backend?". So unless there is some specific background which might need explanations, I would suggest to skip one iteration and go straight to per-backend counters.

Good point, this was actually mentioned to me. I forgot to take that into account, but I think that make sense.
However, I think we should keep the global counters such that further breakdown is available for the interested users.
The global counter is a quick way to tell if looking into backends make sense.

@nigoroll
Copy link
Member

The global counter is a quick way to tell if looking into backends make sense.

I am not entirely convinced that the double accounting pays off, but I can see that the argument has some relevance for practical purposes.

@asadsa92
Copy link
Contributor Author

The global counter is a quick way to tell if looking into backends make sense.

I am not entirely convinced that the double accounting pays off, but I can see that the argument has some relevance for practical purposes.

See: a3fdba2
A snip from the commit message:

As some backends can come and go (dynamic backends), we should maintain the global counters as is.

@nigoroll nigoroll self-requested a review June 30, 2025 13:31
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is OK, but I would like to ask @asadsa92 to make life easier for reviewers. Could you please

  • rebase the PR
  • sqash the to-be-squased commit
  • separate commits for moving code from changing code?
    I would then like to have a final look, but I hope that will be a piece of cake then

return (NULL);
if (bp->proxy_header != 0) {
err = VPX_Send_Proxy(*fdp, bp->proxy_header, bo->sp);
if (err < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving the block a level down is worse c-style.
We should avoid nesting if possible and there is no performance to gain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I have reverted the nesting, see 725e716

*/

static void
vbe_close_acct(const struct pfd *pfd, stream_close_t reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please go back to the commit where you added vbe_close_acct and move it to this place rather than adding it in the wrong place and then moving it again this additional commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

WRONG("Wrong event in vbe_close_acct");
}

static void v_matchproto_(vdi_finish_f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a commit before where you move vbe_dir_finish as-is without modification such that we can see clearly in this commit what the changes are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I found that I had misplaced the opening curly for the function. See 268b182


static void
vbe_close_acct(const struct pfd *pfd, stream_close_t reason)
vbe_close_acct(const struct pfd *pfd, struct backend *bp, stream_close_t reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will likely trigger a "could be const" flexelint notice.
We should pass the vsc pointer, not bp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will likely trigger a "could be const" flexelint notice.

This was no longer relevant because I ended up passing the vsc pointer as you suggested.
But, I made stream_close_t reason const: 836454a

We should pass the vsc pointer, not bp

Agree, see 8a872b0

@asadsa92
Copy link
Contributor Author

asadsa92 commented Oct 6, 2025

So I think this is OK, but I would like to ask @asadsa92 to make life easier for reviewers. Could you please

Thanks for your 2nd review, we are getting there 😄
Yes, I will come back to you with a polished PR.

For backends not all stream_close_t values make sense.
Therefore, do not allow these to be specified as `doclose` attribute for
failed backend transaction. No change is needed for existing errors.

We should ideally split them up, but this is not possible as some
functions are used by both sessions and fetches.

One example of such function:
- stream_close_t http_DoConnection(struct http *hp, stream_close_t sc_close);

This patch is best viewed with --ignore-all-space

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
Abstract vbe_close_acct, this is supposed to mimic the ses_close_acct for
backends.

We have up until now not exposed `htc->doclose` (backend errors) to any counters
as we do for `req->doclose` (session errors).

That is unfortunate as these errors code are useful when trubleshooting fetches,
so expose them as `bc_*` counters which is a subset of the `sc_*` counters.

Also add counters for backend closes: `backend_closed` and `backend_closed_err`.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
HTC_S_EOF is returned when the backend close the connection on us.
Since RxStuff() no longer reports `HTC_S_CLOSE` we can reuse it for EOF.
To be consistent with sessions (client disconnect) use the `SC_REM_CLOSE`.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
This new debug bit enable us to test proxy send error for VRT backends.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
No modifications are done to the vbe_dir_finish function.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
There is no need to special case VPX_Send_Proxy with its own error
handling. We can instead reuse vbe_dir_finish. We will now report
BackendClosed tag in VSL as well (from vbe_dir_finish). To accomplish this we
need to initialize the HTC struct a little bit earlier. Finally, we
change the doclose reason to be an error transmission (SC_TX_ERROR).

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
Change vbe_close_acct to update counters for the specific backend
as well as the existing global counters. This addition allows a user to
easily pinpoint the problematic backend. As some backends can come and
go (dynamic backends), we should maintain the global counters as is.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
@asadsa92
Copy link
Contributor Author

asadsa92 commented Oct 7, 2025

@nigoroll Thanks once again for your review, this should be ready now.
To view the full diff compared to the previous version:

$ git range-diff 8a219c644..a3fdba2c9 720b939c2..

@asadsa92 asadsa92 requested a review from nigoroll October 7, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants