Improve histogram, summary performance under contention by striping observationCount#1794
Improve histogram, summary performance under contention by striping observationCount#1794jack-berg wants to merge 1 commit intoprometheus:mainfrom
Conversation
8cd7e50 to
4c2146c
Compare
…bservationCount Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
4c2146c to
ad1e7cb
Compare
| long count = observationCount.incrementAndGet(); | ||
| AtomicLong observationCountForThread = | ||
| stripedObservationCounts[ | ||
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; |
There was a problem hiding this comment.
One thing to consider is whether there is a better algorithm for accessing one of these striped instances. With this algorithm based on thread id, its conceivable that the thread ids could be arranged in such a way that there end up being "hot spots" (i.e. AtomicLong instances that end up receiving a disproportionate amount of the updates). LongAdder solves this by using a random number generate to select which "cell" to access.
There was a problem hiding this comment.
An alternative to using the thread id to strip is to use ThreadLocalRandom.current().nextInt() % stripedObservationCounts.length. Is there any issue with using ThreadLocalRandom here?
There was a problem hiding this comment.
Pull request overview
This pull request improves the performance of Histogram and Summary metrics under high contention by implementing striping for the observationCount field in the Buffer class. The optimization replaces a single AtomicLong with an array of AtomicLong instances, distributing contention across multiple counters based on thread ID.
Changes:
- Introduced
stripedObservationCountsarray sized by available processors to distribute contention - Modified
append()method to select a stripe based on thread ID for lock-free counting - Updated
run()method to aggregate counts across all stripes during collection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| AtomicLong observationCountForThread = | ||
| stripedObservationCounts[ | ||
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; |
There was a problem hiding this comment.
Casting thread ID to int before taking modulo can cause issues. Thread.getId() returns a long value, and casting it to int can result in negative values due to overflow, which will cause ArrayIndexOutOfBoundsException when used as an array index after modulo operation with negative dividend. Use Math.abs() on the casted int value or use a different approach like (int) (threadId & Integer.MAX_VALUE) to ensure a non-negative index, or better yet, use Long.hashCode(Thread.currentThread().getId()) which properly handles the conversion.
| AtomicLong observationCountForThread = | |
| stripedObservationCounts[ | |
| ((int) Thread.currentThread().getId()) % stripedObservationCounts.length]; | |
| int stripeIndex = | |
| (Long.hashCode(Thread.currentThread().getId()) & Integer.MAX_VALUE) | |
| % stripedObservationCounts.length; | |
| AtomicLong observationCountForThread = stripedObservationCounts[stripeIndex]; |
There was a problem hiding this comment.
Yeah yeah should and was planning to fix before merging
| expectedBufferSize = | ||
| (int) ((observationCount.getAndSet(0) & ~bufferActiveBit) - expectedCount); | ||
| for (AtomicLong observationCount : stripedObservationCounts) { | ||
| expectedBufferSize += (int) (observationCount.getAndSet(0) & ~bufferActiveBit); |
There was a problem hiding this comment.
Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness. The masking operation (~bufferActiveBit) already produces the correct count value as a long.
There was a problem hiding this comment.
Existing but will fix
| } else { | ||
| expectedBufferSize = (int) (observationCount.addAndGet(bufferActiveBit) - expectedCount); | ||
| for (AtomicLong observationCount : stripedObservationCounts) { | ||
| expectedBufferSize += (int) observationCount.addAndGet(bufferActiveBit); |
There was a problem hiding this comment.
Unnecessary and potentially lossy cast to int. The observationCount values are being cast to int before being added to expectedBufferSize (which is a long). This cast truncates the value to 32 bits, which could cause incorrect calculations if any individual stripe's count exceeds Integer.MAX_VALUE. Although unlikely in practice due to striping, the cast should be removed to maintain correctness.
| try { | ||
| // Signal that the buffer is active. | ||
| Long expectedCount = observationCount.getAndAdd(bufferActiveBit); | ||
| Long expectedCount = 0L; |
There was a problem hiding this comment.
The variable 'expectedCount' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Long'.
| Long expectedCount = 0L; | |
| long expectedCount = 0L; |
There was a problem hiding this comment.
Existing but will fix
Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well!
The idea here is recognizing that
Buffer.observationCountis the bottleneck under contention. In contrast to the other histogram / summaryLongAdderfields,Buffer.observationCountisAtomicLongwhich performs much worse thanLongAdderunder high contention. Its necessary that the type isAtomicLongbecause the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes).However, we can "have our cake and eat it to" by striping
Buffer.observationCountinto many instances, such that the contention on any instance is reduced. This is actually whatLongAdderdoes under the covers. This implementation stripes it intoRuntime.getRuntime().availableProcessors()instances, and usesThread.currentThread().getId()) % stripedObservationCounts.lengthto select which instance any particular record thread should use.Performance increase is substantial. Here's the before and after of
HistogramBenchmarkon my machine (Apple M4 Mac Pro w/ 48gb RAM):Before:
After: