Make object expiry scalable #4349

nigoroll · 2025-06-16T17:22:00Z

TL;DR: This patch series solves scalability issues of the single threaded object expiry by offloading object core de-referencing to worker threads.

The test scenario

To test the SLASH/ storage engines, I regularly run "burn in" tests on two reasonably big machines (500gb RAM, 48 cores, TBs of NVMes). The most important test tries to stress the most problematic cases all the time: objects are either very short-lived or very long lived, have random size and random ttl, basically there are only cache misses, and "way too many" passes, bans and purges. It has evolved quite a bit and been an invaluable tool to find all kinds of issues, mostly races, before users do.

The issue

The test has exposed an issue where the number of objects goes through the roof as soon as LRU nuking starts, ultimately leading to an out-of-memory situation (there have also been other relevant issues and most prominently one memory leak, but the issue addressed here was nevertheless real and root-caused in Varnish-Cache. See here for more details)

The problem is illustrated by this graph of VSCs gathered during a test run - NB: the y axis has log scale!:

Plotted here are various VSCs, either directly or as a derivative Requests Per Second rps(x). FELLOW.fellow.g_mem_obj is a counter which gets atomically updated by the storage engine, giving a reliable number of objects actually alive in the memory cache. The other counters are all Varnish-Cache standard.

We would expect MAIN.n_object to always match FELLOW.fellow.g_mem_obj more or less, but as we can see, n_object goes from roughly 2^20 (~1M) to roughly 2^24 (~16M) and there is no indication for it to decrease. It just stopped growing when the process died.

We can also see how the n_expired rate is pretty stable at first (and due to the random but uniform nature of the test we can assume that the number of objects expiring actually is stable in relation to the number of objects added), but some time after nuking kicks in (black n_lru_nuked) graph, not only goes the number of objects through the roof, but also the n_expired rate almost comes to a complete stop (again, this is log scale, 2^5 is, as we all know, 32).

A suspicion

The immediate suspicion was that this issue could be related to the fact that the EXP thread is single threaded, but in retrospect, I would like to first give some ...

Background on the EXP thread

The exp thread uses a binheap (very interesting background read for those who do not know it) to keep a logical list of object core references by expire time. Consequently, all cacheable objects need to go though the EXP thread by design: When they enter the cache, EXP needs to register them on the binheap, until they either get removed externally or expire. So, the expire thread is a natural bottleneck. While we could support multiple expire threads, it is not the actual maintenance of the binheap which is expensive, but most of all object deletion as a consequence of expiry or removal

This PR

Herein I suggest to make a number of changes to make a single EXP thread scale up significantly:

Restructure the code to batch operations on the inbox and expired objects

Rather than handling one inbox / expired object at a time, we collect them in an array and lists, and then work these outside the expire lock.

Make sure that both removal and expiry are always handled

Bbefore, removal had precedence over expiry, which could lead to the object pileup described before

Offload de-referencing of object cores

We offload the actual de-referencing (which, most of the time, implies object deletion) to worker threads, which implicitly provides backpressure by using up system resources and worker threads.

All details are documented in the individual commit messages

Potential follow up: hcb_cleaner has the same problem in principle, but not sure yet how relevant, will continue testing.

bsdphk · 2025-06-18T13:35:07Z

My first thought, and this is not a replacement for this PR, is that it was a mistake to have a central EXP, and that really should be STV methods.

I think all the places which call EXP() also know the stevedore, so that shouldn't be too much havoc, and it would give STV's the option of implementing EXP integrated with their storage layout policy.

nigoroll · 2025-06-19T13:20:25Z

I agree with the previous comment, multi-EXP would be helpful and would allow for specific optimizations. I see this as largely orthogonal to this PR, as we still need an EXP implementation for SML, to which this PR implements improvements, and also other stevedore implementations will likely start by copying the default implementation.

Before this change, we would strictly work the expire thread's inbox one item at a time: lock, take it off the list, update the exp flags, unlock, work it. This requires one lock/unlock cycle per object core in the exp inbox. We now create a batch array of items: Under the lock, we do all the work as before, but remember up to 1024 objcore pointers and their original exp flags in an array, which we then work outside the lock, increasing overall efficiency through less lock contention and better cache locality. Why an array of items, why not repurpose the inbox (exp_list)? In fact, we will do that in another commit, but at this point, the list is not the right data structure: The inbox semantics require to (atomically, under a lock) update the exp_flags to signify an oc being removed or the commands being processed. Yet to process the oc later, we need the original flags. So we use an array of pointers to the oc together with the original flags. (one alternative would be a list head for every possible flag combination, but that is probably overly complex). This changes the order of events slightly: exp_mail_it() inserts OC_EF_REMOVE jobs at the head of the list, because they will immediately free resources and are thus more important than additions to the expire binheap; also mailed object cores have a change of getting removed before even being added while waiting in the inbox. With batching, we no longer follow this priority strictly: By taking the first (up to) 1024 items from the inbox before re-checking it, we might process inbox commands which we would not have before. This is not (yet) a change in behavior for the case when the expire thread is under pressure, because the inbox will eventually contain more than 1024 OC_EF_REMOVE items at the start, and hence the expire thread will only work these. The choice of array size 1024 is arbitrary: It is deemed large enough for batching to have an effect, yet small enough to matter wrt available stack space (on 64bit, the array of this size needs 16KB). Also, we now unconditionally call exp_expire(), because either we have processed the inbox and need a new tnext for potentially blocking in the next iteration, or we did not, waited for the next due time and do need to process the binheap anyway.

This is in preparation of offloading exp_remove_oc(): The inbox list (exp_list linkage in struct oc) is free to use for OC_EF_REMOVE commands once we have processed the inbox, because the respective object core will be removed from cache and the exp_flags set to zero signal to exp_mailit() to keep out. So we now use the list to collect ocs which need to be dereferenced. Object cores with OC_EF_REMOVE and no OC_EF_INSERT set (that is, those on the binheap), will also continue to be collected on the batch array to have them deleted from the binheap. We also add a bit of abstraction around the list head, which will be extended in follow-up commits. For now, it serves for tracking the number of elements on the list and asserting on it. So once the batch is processed, we end up with a neat list of objects to de-reference.

Otherwise, for big batches (think: 100Ks of objects) the n_object counter can get out of sync at human scales.

Yes, I know this is against our cstyle, but this is very useful to avoid confusion with item->flags.

Now that we have a collection of objcores in a list, we can easily hand it off to another worker thread to run all the de-references. This step is the first building block towards exp scalability beyond the capacity of a single thread. We now hand the list of objects to be removed to a worker thread, once an (arbitrary) threshold of 128 objects to be removed is reached. If the expire inbox keeps growing, we keep tasking more and more worker threads, which will apply back pressure by allocating more system resources to expiry, and ultimately by no threads being available for entry of new objects into the cache.

This moves the HSH_kill() to after binheap removal, which makes no difference, other than prepare for batching, were it will be run outside the critical section.

Similar to the changes for cache removal batching, we now gather all object cores to be expired in another head of the deref struct in one go within a single critical region (as before, exp_list is free when the respective object is not in the inbox) and run the derefs in the existing exp_deref_work() function, which optionally gets offloaded to a separate worker thread. For efficiency, we also change handling of expired object cores with OC_EF_POSTED: Previously, they would just be ignored, only to be handled by the exp_inbox(). But now more than ever it does not make sense to keep them around, so we just remove them from the inbox and add them to our list to expire. (diff best viewed ignoring whitespace (-b option))

When the exp_expire() finds the expired object also in the inbox, it would still count it as expired. We now avoid the double count by excluding objects with OC_EF_REMOVE set. And we also reduce traffic on the global counter by using a local variable for the loop.

nigoroll added the a=RunByUPLEX This PR is being run by UPLEX in production label Jul 1, 2025

nigoroll force-pushed the exp_inbox_batch branch 2 times, most recently from 505cfd3 to d9691f5 Compare July 2, 2025 08:56

nigoroll force-pushed the exp_inbox_batch branch from d9691f5 to edbbdfa Compare September 15, 2025 13:16

nigoroll added 11 commits September 16, 2025 09:43

cache_expire: trivial refactor - move oc removal to own function

1bf6e7e

cache_expire: trivial refactor - inbox head as named struct

fe0398a

cache_expire: push stats every once in a while

a9a2585

Otherwise, for big batches (think: 100Ks of objects) the n_object counter can get out of sync at human scales.

cache_expire: limit scope of flags variable

3655924

Yes, I know this is against our cstyle, but this is very useful to avoid confusion with item->flags.

cache_expire: refactor - move expire deref code to own function

2e31985

cache_expire: move HSH_Kill() to exp_expire_oc()

b249810

This moves the HSH_kill() to after binheap removal, which makes no difference, other than prepare for batching, were it will be run outside the critical section.

nigoroll force-pushed the exp_inbox_batch branch from edbbdfa to 465d472 Compare September 16, 2025 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make object expiry scalable #4349

Make object expiry scalable #4349

nigoroll commented Jun 16, 2025

Uh oh!

bsdphk commented Jun 18, 2025

Uh oh!

nigoroll commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make object expiry scalable #4349

Are you sure you want to change the base?

Make object expiry scalable #4349

Conversation

nigoroll commented Jun 16, 2025

The test scenario

The issue

A suspicion

Background on the EXP thread

This PR

Restructure the code to batch operations on the inbox and expired objects

Make sure that both removal and expiry are always handled

Offload de-referencing of object cores

Uh oh!

bsdphk commented Jun 18, 2025

Uh oh!

nigoroll commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants