Skip to content

Conversation

@nigoroll
Copy link
Member

TL;DR: This patch series solves scalability issues of the single threaded object expiry by offloading object core de-referencing to worker threads.

The test scenario

To test the SLASH/ storage engines, I regularly run "burn in" tests on two reasonably big machines (500gb RAM, 48 cores, TBs of NVMes). The most important test tries to stress the most problematic cases all the time: objects are either very short-lived or very long lived, have random size and random ttl, basically there are only cache misses, and "way too many" passes, bans and purges. It has evolved quite a bit and been an invaluable tool to find all kinds of issues, mostly races, before users do.

The issue

The test has exposed an issue where the number of objects goes through the roof as soon as LRU nuking starts, ultimately leading to an out-of-memory situation (there have also been other relevant issues and most prominently one memory leak, but the issue addressed here was nevertheless real and root-caused in Varnish-Cache. See here for more details)

The problem is illustrated by this graph of VSCs gathered during a test run - NB: the y axis has log scale!:

exp_nuke

Plotted here are various VSCs, either directly or as a derivative Requests Per Second rps(x). FELLOW.fellow.g_mem_obj is a counter which gets atomically updated by the storage engine, giving a reliable number of objects actually alive in the memory cache. The other counters are all Varnish-Cache standard.

We would expect MAIN.n_object to always match FELLOW.fellow.g_mem_obj more or less, but as we can see, n_object goes from roughly 2^20 (~1M) to roughly 2^24 (~16M) and there is no indication for it to decrease. It just stopped growing when the process died.

We can also see how the n_expired rate is pretty stable at first (and due to the random but uniform nature of the test we can assume that the number of objects expiring actually is stable in relation to the number of objects added), but some time after nuking kicks in (black n_lru_nuked) graph, not only goes the number of objects through the roof, but also the n_expired rate almost comes to a complete stop (again, this is log scale, 2^5 is, as we all know, 32).

A suspicion

The immediate suspicion was that this issue could be related to the fact that the EXP thread is single threaded, but in retrospect, I would like to first give some ...

Background on the EXP thread

The exp thread uses a binheap (very interesting background read for those who do not know it) to keep a logical list of object core references by expire time. Consequently, all cacheable objects need to go though the EXP thread by design: When they enter the cache, EXP needs to register them on the binheap, until they either get removed externally or expire. So, the expire thread is a natural bottleneck. While we could support multiple expire threads, it is not the actual maintenance of the binheap which is expensive, but most of all object deletion as a consequence of expiry or removal

This PR

Herein I suggest to make a number of changes to make a single EXP thread scale up significantly:

Restructure the code to batch operations on the inbox and expired objects

Rather than handling one inbox / expired object at a time, we collect them in an array and lists, and then work these outside the expire lock.

Make sure that both removal and expiry are always handled

Bbefore, removal had precedence over expiry, which could lead to the object pileup described before

Offload de-referencing of object cores

We offload the actual de-referencing (which, most of the time, implies object deletion) to worker threads, which implicitly provides backpressure by using up system resources and worker threads.

All details are documented in the individual commit messages

Potential follow up: hcb_cleaner has the same problem in principle, but not sure yet how relevant, will continue testing.

@bsdphk
Copy link
Contributor

bsdphk commented Jun 18, 2025

My first thought, and this is not a replacement for this PR, is that it was a mistake to have a central EXP, and that really should be STV methods.

I think all the places which call EXP() also know the stevedore, so that shouldn't be too much havoc, and it would give STV's the option of implementing EXP integrated with their storage layout policy.

@nigoroll
Copy link
Member Author

I agree with the previous comment, multi-EXP would be helpful and would allow for specific optimizations. I see this as largely orthogonal to this PR, as we still need an EXP implementation for SML, to which this PR implements improvements, and also other stevedore implementations will likely start by copying the default implementation.

@nigoroll nigoroll added the a=RunByUPLEX This PR is being run by UPLEX in production label Jul 1, 2025
@nigoroll nigoroll force-pushed the exp_inbox_batch branch 2 times, most recently from 505cfd3 to d9691f5 Compare July 2, 2025 08:56
Before this change, we would strictly work the expire thread's inbox one item at
a time: lock, take it off the list, update the exp flags, unlock, work it. This
requires one lock/unlock cycle per object core in the exp inbox.

We now create a batch array of items: Under the lock, we do all the work as
before, but remember up to 1024 objcore pointers and their original exp flags in
an array, which we then work outside the lock, increasing overall efficiency
through less lock contention and better cache locality.

Why an array of items, why not repurpose the inbox (exp_list)? In fact, we will
do that in another commit, but at this point, the list is not the right data
structure: The inbox semantics require to (atomically, under a lock) update the
exp_flags to signify an oc being removed or the commands being processed. Yet to
process the oc later, we need the original flags. So we use an array of pointers
to the oc together with the original flags. (one alternative would be a list
head for every possible flag combination, but that is probably overly complex).

This changes the order of events slightly: exp_mail_it() inserts OC_EF_REMOVE
jobs at the head of the list, because they will immediately free resources and
are thus more important than additions to the expire binheap; also mailed object
cores have a change of getting removed before even being added while waiting in
the inbox.

With batching, we no longer follow this priority strictly: By taking the first
(up to) 1024 items from the inbox before re-checking it, we might process inbox
commands which we would not have before. This is not (yet) a change in behavior
for the case when the expire thread is under pressure, because the inbox will
eventually contain more than 1024 OC_EF_REMOVE items at the start, and hence the
expire thread will only work these.

The choice of array size 1024 is arbitrary: It is deemed large enough for
batching to have an effect, yet small enough to matter wrt available stack space
(on 64bit, the array of this size needs 16KB).

Also, we now unconditionally call exp_expire(), because either we have processed
the inbox and need a new tnext for potentially blocking in the next iteration,
or we did not, waited for the next due time and do need to process the binheap
anyway.
This is in preparation of offloading exp_remove_oc():

The inbox list (exp_list linkage in struct oc) is free to use for OC_EF_REMOVE
commands once we have processed the inbox, because the respective object core
will be removed from cache and the exp_flags set to zero signal to exp_mailit()
to keep out.

So we now use the list to collect ocs which need to be dereferenced. Object
cores with OC_EF_REMOVE and no OC_EF_INSERT set (that is, those on the
binheap), will also continue to be collected on the batch array to have them
deleted from the binheap.

We also add a bit of abstraction around the list head, which will be extended in
follow-up commits. For now, it serves for tracking the number of elements on the
list and asserting on it.

So once the batch is processed, we end up with a neat list of objects to
de-reference.
Otherwise, for big batches (think: 100Ks of objects) the n_object counter can
get out of sync at human scales.
Yes, I know this is against our cstyle, but this is very useful to avoid
confusion with item->flags.
Now that we have a collection of objcores in a list, we can easily hand it off
to another worker thread to run all the de-references. This step is the first
building block towards exp scalability beyond the capacity of a single thread.

We now hand the list of objects to be removed to a worker thread, once an
(arbitrary) threshold of 128 objects to be removed is reached. If the expire
inbox keeps growing, we keep tasking more and more worker threads, which will
apply back pressure by allocating more system resources to expiry, and
ultimately by no threads being available for entry of new objects into the
cache.
This moves the HSH_kill() to after binheap removal, which makes no difference,
other than prepare for batching, were it will be run outside the critical
section.
Similar to the changes for cache removal batching, we now gather all object
cores to be expired in another head of the deref struct in one go within a
single critical region (as before, exp_list is free when the respective object
is not in the inbox) and run the derefs in the existing exp_deref_work()
function, which optionally gets offloaded to a separate worker thread.

For efficiency, we also change handling of expired object cores with
OC_EF_POSTED: Previously, they would just be ignored, only to be handled by the
exp_inbox(). But now more than ever it does not make sense to keep them around,
so we just remove them from the inbox and add them to our list to expire.

(diff best viewed ignoring whitespace (-b option))
When the exp_expire() finds the expired object also in the inbox, it would still
count it as expired. We now avoid the double count by excluding objects with
OC_EF_REMOVE set.

And we also reduce traffic on the global counter by using a local variable for
the loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a=RunByUPLEX This PR is being run by UPLEX in production

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants