Skip to content

Conversation

@codeclymb
Copy link

PR Draft: ConcurrentLruCache2

What / Why

  • (JMH summary) ConcurrentLruCache2Benchmark (Throughput, Threads=8, capacity=100, missRate=0.1):
    ConcurrentLruCache 136,826 ops/sConcurrentLruCache2 1,237,818 ops/s
    (~9.0× / ≈ 9.05× throughput improvement).
  • Introduce ConcurrentLruCache2: a performance-oriented LRU alternative that reduces read/write contention via
    wider striping (next power-of-two of availableProcessors), padded per-stripe counters, and a pending-based drain strategy.

Key changes

  • Read path (ReadOperations)
    • Use wider striping (buffers sized to the next power-of-two of availableProcessors, removing the previous max=4 cap).
    • Replace AtomicLongArray-based counters with per-stripe counter objects, and apply padding
      (PaddedAtomicLong / PaddedLong) to mitigate false sharing on hot counter-update paths.
  • Write path (WriteOperations)
    • Use a pending counter to defer/trigger drains, reducing drain attempts and lock contention during write bursts.
  • API / behavior
    • get returns null on a miss (no automatic loader). Callers populate via put.
    • Provide setEvictionListener receiving Entry(key, value) on eviction (default: no-op).

Performance bottlenecks (existing) and improvements (this PR)

Bottleneck: AtomicLongArray-based counter updates

  • The existing implementation uses AtomicLongArray for recordedCount/processedCount.
    Because it is backed by a contiguous primitive array, hot per-stripe counter updates may pay extra overhead and can be more
    sensitive to cache-line interactions.

Improvement: AtomicLongArray → per-stripe counter objects

  • ConcurrentLruCache2 uses per-stripe counter objects (AtomicLong[]) instead of AtomicLongArray,
    reducing dependence on a contiguous primitive array layout and aiming to lower overhead/contended updates.

Bottleneck: false sharing from contiguous counter layout

  • In the existing ConcurrentLruCache, ReadOperations tracks per-stripe progress via
    recordedCount/processedCount/readCount. When these are stored in contiguous primitive arrays,
    adjacent stripes can share cache lines.
  • Even when threads update different stripe indices, cache-line invalidation/ownership transfers can cause
    cache-line bouncing (false sharing), commonly reflected as higher backend stalls and CPI.

Validation: macOS CPU Performance Counters

  • We compared metrics before vs after applying padding using macOS CPU PMU counters:
Metric Without padding With padding Delta
Cycles 241,653,360,488 235,880,915,249 −2.39%
Instructions 93,512,843,345 205,698,798,904 +119.9%
CPI 2.5842 1.1467 −55.6%
ARM_STALL_BACKEND 211,074,700,742 182,085,792,527 −13.7%
ARM_STALL_BACKEND / Cycles 0.8735 0.7719 −11.6%
ARM_L1D_CACHE_REFILL 865,475,980 1,237,841,014 +43.0%
ARM_L1D_CACHE_REFILL / Instructions 0.009255 0.0060178 −35.0%
Metrics (short notes)
  • CPI (Cycles per Instruction): average cycles per retired instruction; tends to increase with waiting/coordination overhead.

  • ARM_STALL_BACKEND: cycles where the pipeline backend is stalled; can increase with coherence/ownership waits.

  • ARM_STALL_BACKEND / Cycles: fraction of total cycles spent stalled in the backend.

  • ARM_L1D_CACHE_REFILL: number of L1D cache refills; churn can increase with invalidation/refill activity.

  • Observation: after padding, CPI, ARM_STALL_BACKEND/Cycles, and ARM_L1D_CACHE_REFILL/Instructions decreased, which is
    consistent with reduced cache-line interference on the hot path.

Improvement: padded counters to mitigate false sharing

  • Switch from contiguous AtomicLongArray/long[] usage to per-stripe padded objects
    (PaddedAtomicLong, PaddedLong) to reduce cache-line collisions between frequently-updated counters, targeting
    lower stalls on the recordRead and drain-check paths.

Bottleneck: limited striping in ReadOperations

  • The existing ReadOperations uses min(4, nextPowerOfTwo(availableProcessors)) (i.e., at most 4 stripes),
    increasing the chance of multiple threads sharing the same buffers/counters under higher thread counts.

Improvement: expand ReadOperations striping

  • ConcurrentLruCache2 sets the number of buffers to the next power-of-two of availableProcessors
    (removing the max=4 cap), spreading threads across more stripes and reducing contention in record/drain-check paths.

Bottleneck: drains attempted on every write

  • The existing ConcurrentLruCache sets drainStatus = REQUIRED and attempts a drain on each write (e.g. put),
    which can lead to frequent drain attempts and lock contention during write bursts.

Improvement: pending-based drain (WriteOperations)

  • ConcurrentLruCache2 tracks pending write tasks; when pending is below a threshold, drains can be deferred to avoid
    unnecessary drain attempts.
  • Each drain processes a bounded amount of work, aiming to reduce drain/lock contention during bursts.

Compatibility and migration

ConcurrentLruCache2 is an additional implementation with a different operational model; it does not replace ConcurrentLruCache.

  • Operational model:
    • ConcurrentLruCache: generator-based miss → automatic generate + populate
    • ConcurrentLruCache2: manual population (get miss → null; caller decides whether/when to put)
  • API (population): ConcurrentLruCache2 exposes put as public so callers can control population.
  • No automatic generation: generator-based flows should keep using ConcurrentLruCache, or call an external loader and then put.
  • Capacity 0: if created with capacity 0, get always returns null and entries inserted via put are immediately evicted
    (effectively disabling caching).
  • Null-handling: callers must handle null from get; stored values must be non-null.
  • Eviction listener: default is no-op; when configured, the listener receives Entry(key, value) on eviction/removal/clear.
  • Choosing between implementations:
    • keep ConcurrentLruCache for auto-loader needs
    • choose ConcurrentLruCache2 for manual population + eviction hook + lower contention

Tests

  • ./gradlew :spring-core:test (JDK 25)
  • JMH:
    • JAVA_HOME=/path/to/jdk25 ./gradlew :spring-core:jmhJar
    • $JAVA_HOME/bin/java -jar spring-core/build/libs/*-jmh.jar "org.springframework.util.ConcurrentLruCache2Benchmark.*"

Signed-off-by: seungjong bae <bcj0114@gmail.com>
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: waiting-for-triage An issue we've not yet triaged or decided on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants