Add ConcurrentLruCache2 #36071

codeclymb · 2025-12-24T07:23:39Z

PR Draft: ConcurrentLruCache2

What / Why

(JMH summary) ConcurrentLruCache2Benchmark (Throughput, Threads=8, capacity=100, missRate=0.1):
ConcurrentLruCache 136,826 ops/s → ConcurrentLruCache2 1,237,818 ops/s
(~9.0× / ≈ 9.05× throughput improvement).
Introduce ConcurrentLruCache2: a performance-oriented LRU alternative that reduces read/write contention via
wider striping (next power-of-two of availableProcessors), padded per-stripe counters, and a pending-based drain strategy.

Key changes

Read path (ReadOperations)
- Use wider striping (buffers sized to the next power-of-two of availableProcessors, removing the previous max=4 cap).
- Replace AtomicLongArray-based counters with per-stripe counter objects, and apply padding
  (PaddedAtomicLong / PaddedLong) to mitigate false sharing on hot counter-update paths.
Write path (WriteOperations)
- Use a pending counter to defer/trigger drains, reducing drain attempts and lock contention during write bursts.
API / behavior
- get returns null on a miss (no automatic loader). Callers populate via put.
- Provide setEvictionListener receiving Entry(key, value) on eviction (default: no-op).

Performance bottlenecks (existing) and improvements (this PR)

Bottleneck: `AtomicLongArray`-based counter updates

The existing implementation uses AtomicLongArray for recordedCount/processedCount.
Because it is backed by a contiguous primitive array, hot per-stripe counter updates may pay extra overhead and can be more
sensitive to cache-line interactions.

Improvement: `AtomicLongArray` → per-stripe counter objects

ConcurrentLruCache2 uses per-stripe counter objects (AtomicLong[]) instead of AtomicLongArray,
reducing dependence on a contiguous primitive array layout and aiming to lower overhead/contended updates.

Bottleneck: false sharing from contiguous counter layout

In the existing ConcurrentLruCache, ReadOperations tracks per-stripe progress via
recordedCount/processedCount/readCount. When these are stored in contiguous primitive arrays,
adjacent stripes can share cache lines.
Even when threads update different stripe indices, cache-line invalidation/ownership transfers can cause
cache-line bouncing (false sharing), commonly reflected as higher backend stalls and CPI.

Validation: macOS CPU Performance Counters

We compared metrics before vs after applying padding using macOS CPU PMU counters:

Metric	Without padding	With padding	Delta
Cycles	241,653,360,488	235,880,915,249	−2.39%
Instructions	93,512,843,345	205,698,798,904	+119.9%
CPI	2.5842	1.1467	−55.6%
ARM_STALL_BACKEND	211,074,700,742	182,085,792,527	−13.7%
ARM_STALL_BACKEND / Cycles	0.8735	0.7719	−11.6%
ARM_L1D_CACHE_REFILL	865,475,980	1,237,841,014	+43.0%
ARM_L1D_CACHE_REFILL / Instructions	0.009255	0.0060178	−35.0%

Metrics (short notes)

CPI (Cycles per Instruction): average cycles per retired instruction; tends to increase with waiting/coordination overhead.
ARM_STALL_BACKEND: cycles where the pipeline backend is stalled; can increase with coherence/ownership waits.
ARM_STALL_BACKEND / Cycles: fraction of total cycles spent stalled in the backend.
ARM_L1D_CACHE_REFILL: number of L1D cache refills; churn can increase with invalidation/refill activity.
Observation: after padding, CPI, ARM_STALL_BACKEND/Cycles, and ARM_L1D_CACHE_REFILL/Instructions decreased, which is
consistent with reduced cache-line interference on the hot path.

Improvement: padded counters to mitigate false sharing

Switch from contiguous AtomicLongArray/long[] usage to per-stripe padded objects
(PaddedAtomicLong, PaddedLong) to reduce cache-line collisions between frequently-updated counters, targeting
lower stalls on the recordRead and drain-check paths.

Bottleneck: limited striping in `ReadOperations`

The existing ReadOperations uses min(4, nextPowerOfTwo(availableProcessors)) (i.e., at most 4 stripes),
increasing the chance of multiple threads sharing the same buffers/counters under higher thread counts.

Improvement: expand `ReadOperations` striping

ConcurrentLruCache2 sets the number of buffers to the next power-of-two of availableProcessors
(removing the max=4 cap), spreading threads across more stripes and reducing contention in record/drain-check paths.

Bottleneck: drains attempted on every write

The existing ConcurrentLruCache sets drainStatus = REQUIRED and attempts a drain on each write (e.g. put),
which can lead to frequent drain attempts and lock contention during write bursts.

Improvement: pending-based drain (`WriteOperations`)

ConcurrentLruCache2 tracks pending write tasks; when pending is below a threshold, drains can be deferred to avoid
unnecessary drain attempts.
Each drain processes a bounded amount of work, aiming to reduce drain/lock contention during bursts.

Compatibility and migration

ConcurrentLruCache2 is an additional implementation with a different operational model; it does not replace ConcurrentLruCache.

Operational model:
- ConcurrentLruCache: generator-based miss → automatic generate + populate
- ConcurrentLruCache2: manual population (get miss → null; caller decides whether/when to put)
API (population): ConcurrentLruCache2 exposes put as public so callers can control population.
No automatic generation: generator-based flows should keep using ConcurrentLruCache, or call an external loader and then put.
Capacity 0: if created with capacity 0, get always returns null and entries inserted via put are immediately evicted
(effectively disabling caching).
Null-handling: callers must handle null from get; stored values must be non-null.
Eviction listener: default is no-op; when configured, the listener receives Entry(key, value) on eviction/removal/clear.
Choosing between implementations:
- keep ConcurrentLruCache for auto-loader needs
- choose ConcurrentLruCache2 for manual population + eviction hook + lower contention

Tests

./gradlew :spring-core:test (JDK 25)
JMH:
- JAVA_HOME=/path/to/jdk25 ./gradlew :spring-core:jmhJar
- $JAVA_HOME/bin/java -jar spring-core/build/libs/*-jmh.jar "org.springframework.util.ConcurrentLruCache2Benchmark.*"

Signed-off-by: seungjong bae <bcj0114@gmail.com>

Add ConcurrentLruCache2

d37982d

Signed-off-by: seungjong bae <bcj0114@gmail.com>

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Dec 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ConcurrentLruCache2 #36071

Add ConcurrentLruCache2 #36071

codeclymb commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add ConcurrentLruCache2 #36071

Are you sure you want to change the base?

Add ConcurrentLruCache2 #36071

Conversation

codeclymb commented Dec 24, 2025

PR Draft: ConcurrentLruCache2

What / Why

Key changes

Performance bottlenecks (existing) and improvements (this PR)

Bottleneck: AtomicLongArray-based counter updates

Improvement: AtomicLongArray → per-stripe counter objects

Bottleneck: false sharing from contiguous counter layout

Validation: macOS CPU Performance Counters

Metrics (short notes)

Improvement: padded counters to mitigate false sharing

Bottleneck: limited striping in ReadOperations

Improvement: expand ReadOperations striping

Bottleneck: drains attempted on every write

Improvement: pending-based drain (WriteOperations)

Compatibility and migration

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bottleneck: `AtomicLongArray`-based counter updates

Improvement: `AtomicLongArray` → per-stripe counter objects

Bottleneck: limited striping in `ReadOperations`

Improvement: expand `ReadOperations` striping

Improvement: pending-based drain (`WriteOperations`)