adds per scan tracing statistics #6010

keith-turner · 2025-12-09T00:24:11Z

Added per scan tracing statistics for the following :

The count of compressed bytes read from DFS (when all data is read from cache this will be zero).
The count of uncompressed bytes read from DFS or cache
The count of bytes returned (after iterators filter)
The count of key/values read before any filtering
The count of key/values returned after filtering
The count of seeks done
Statistics for cache hits, missed, and bypasses. A bypass is when a rfile block is read w/o using the cache.

The statistics are included in a tracing span that wraps reading each each batch of key values in tablet or scan server. So if a scan reads 10 batches of key values, then 10 spans will be emitted for tracing data. Each span will included the statistics for that batch.

Added per scan tracing statistics for the following : * The count of compressed bytes read from DFS (when all data is read from cache this will be zero). * The count of uncompressed bytes read from DFS or cache * The count of bytes returned (after iterators filter) * The count of key/values read before any filtering * The count of key/values returned after filtering * The count of seeks done * Statistics for cache hits, missed, and bypasses. A bypass is when a rfile block is read w/o using the cache. The statistics are included in a tracing span that wraps reading each each batch of key values in tablet or scan server. So if a scan reads 10 batches of key values, then 10 spans will be emitted for tracing data. Each span will included the statistics for that batch.

keith-turner · 2025-12-09T00:51:26Z

I have been testing this using Jaeger. The following json was pulled from Jaeger for a single span and it shows some of the new tags in this PR. It shows a scan batch read 560,364 bytes from DFS, which turned into 1,109,757 uncompressed bytes read and finally returned 54 bytes to the user after filtering. It shows that 78,343 key values were read before any filtering was done by the scan and after filtering only 3 key values were returned. The data cache was turned off for the table, the stats show that 12 rfile data blocks were read bypassing the cache. The index cache was turned on, for the 1 rfile index block read it shows up as a cache hit.

                   "tags": [
                        {
                            "key": "accumulo.bytes.read",
                            "type": "int64",
                            "value": 1109757
                        },
                        {
                            "key": "accumulo.bytes.read.file",
                            "type": "int64",
                            "value": 560364
                        },
                        {
                            "key": "accumulo.bytes.returned",
                            "type": "int64",
                            "value": 54
                        },
                        {
                            "key": "accumulo.cache.data.bypasses",
                            "type": "int64",
                            "value": 12
                        },
                        {
                            "key": "accumulo.cache.data.hits",
                            "type": "int64",
                            "value": 0
                        },
                        {
                            "key": "accumulo.cache.data.misses",
                            "type": "int64",
                            "value": 0
                        },
                        {
                            "key": "accumulo.cache.index.bypasses",
                            "type": "int64",
                            "value": 0
                        },
                        {
                            "key": "accumulo.cache.index.hits",
                            "type": "int64",
                            "value": 1
                        },
                        {
                            "key": "accumulo.cache.index.misses",
                            "type": "int64",
                            "value": 0
                        },
                        {
                            "key": "accumulo.entries.read",
                            "type": "int64",
                            "value": 78343
                        },
                        {
                            "key": "accumulo.entries.returned",
                            "type": "int64",
                            "value": 3
                        },
                        {
                            "key": "accumulo.executor",
                            "type": "string",
                            "value": "default"
                        },
                        {
                            "key": "accumulo.extent",
                            "type": "string",
                            "value": "1;04fe;04de"
                        },
                        {
                            "key": "accumulo.seeks",
                            "type": "int64",
                            "value": 1
                        },
                        {
                            "key": "accumulo.server",
                            "type": "string",
                            "value": "localhost:9997"
                        },
                        {
                            "key": "accumulo.table.id",
                            "type": "string",
                            "value": "1"
                        },
                        {
                            "key": "otel.scope.name",
                            "type": "string",
                            "value": "org.apache.accumulo"
                        },
                        {
                            "key": "otel.scope.version",
                            "type": "string",
                            "value": "2.1.5-SNAPSHOT"
                        },
                        {
                            "key": "span.kind",
                            "type": "string",
                            "value": "internal"
                        },
                        {
                            "key": "thread.id",
                            "type": "int64",
                            "value": 1215
                        },
                        {
                            "key": "thread.name",
                            "type": "string",
                            "value": "Client: localhost:58832 User: root Start: 1765238204805 Tablet: 1;04fe;04de"
                        }
                    ],

Wanted to see if its possible to pull the data from all spans from Jaeger and sum them up. This allows seeing the stats for a single scan as it hits many servers across the cluster. Wrote the following script to do this.

#!/bin/bash

YSERVER=$1
TRACE_ID=$2

trace_tmp="$(mktemp -t trace.XXXXXXXX.json)"

curl -s http://$YSERVER:16686/api/traces/$TRACE_ID?prettyPrint=false > $trace_tmp

echo "Sum of all traces"
cat $trace_tmp | jq -r '.data[0].spans[] | .tags | .[] | [.key,.type,.value] | @csv' | grep int64 | grep -v thread.id | awk -F ',' '{a[$1] += $3} END{for (i in a) print "\t" i, a[i]}' | sort

Ran the script against a traced a scan that filtered data on 128 tablets across 4 tservers. That worked well and the 333 entries returned is how many key values that scan actually returned, so that was a good test. The json from Jaeger had at least 128 spans, like the example above, that needed to be rolled up.

$ ./sum_scan_traces.sh localhost c22f939eb5dc1ba294b6aeae7941ebb5
Sum of all traces
	"accumulo.bytes.read" 162803236
	"accumulo.bytes.read.file" 82245238
	"accumulo.bytes.returned" 5972
	"accumulo.cache.data.bypasses" 1694
	"accumulo.cache.data.hits" 0
	"accumulo.cache.data.misses" 0
	"accumulo.cache.index.bypasses" 0
	"accumulo.cache.index.hits" 191
	"accumulo.cache.index.misses" 0
	"accumulo.entries.read" 9988264
	"accumulo.entries.returned" 333
	"accumulo.seeks" 128

keith-turner · 2025-12-09T01:05:21Z

There are a few TODOs still in the PR. Some of them I need to open issues for and remove. Some of them are questions I had while working on this.

dlmarion · 2025-12-11T12:43:23Z

How is the traceId created? Is it something created by OpenTelemetry, or is it something that we create? I created #2956 a while back in an effort to help users correlate scan information across the servers. Allowing the client to set the id, and then using that id in the log messages and traces could help users single out issues or performance information for a single scan.

dlmarion · 2025-12-11T12:48:09Z

"accumulo.bytes.read" 162803236
"accumulo.bytes.read.file" 82245238
"accumulo.bytes.returned" 5972
"accumulo.cache.data.bypasses" 1694
"accumulo.cache.data.hits" 0
"accumulo.cache.data.misses" 0
"accumulo.cache.index.bypasses" 0
"accumulo.cache.index.hits" 191
"accumulo.cache.index.misses" 0
"accumulo.entries.read" 9988264
"accumulo.entries.returned" 333
"accumulo.seeks" 128

Suggest adding scan to the names where it makes sense. For example, accumulo.entries.read, could be relevant for compactions as well. Maybe it doesn't matter in the tracing because it's per thread of execution? It would certainly matter for any metrics emitted as those will be aggregated.

keith-turner · 2025-12-11T21:39:24Z

How is the traceId created? Is it something created by OpenTelemetry, or is it something that we create? I created #2956 a while back in an effort to help users correlate scan information across the servers. Allowing the client to set the id, and then using that id in the log messages and traces could help users single out issues or performance information for a single scan.

Seems like when a new span is created and there is no active trace on the thread that opentelemetry will create a new random trace id. That trace id can be retrieved from the span. If a user create a spans on the client side that wraps their scan then that trace id will be available to anything server side that creates a span and ask for the trace id.

dlmarion · 2025-12-15T17:44:43Z

How is the traceId created? Is it something created by OpenTelemetry, or is it something that we create? I created #2956 a while back in an effort to help users correlate scan information across the servers. Allowing the client to set the id, and then using that id in the log messages and traces could help users single out issues or performance information for a single scan.

Seems like when a new span is created and there is no active trace on the thread that opentelemetry will create a new random trace id. That trace id can be retrieved from the span. If a user create a spans on the client side that wraps their scan then that trace id will be available to anything server side that creates a span and ask for the trace id.

Ok, so a client application that also uses OpenTelemetry could start a span before calling the Accumulo client. The span will have a random generated trace id, but they could set an attribute on the span that they create with information from their application that will help them identify the thread of execution on the server side. I looked at one resource that says Jaeger "Tags" in their UI map to the OpenTelemetry attributes.

keith-turner · 2025-12-15T18:34:53Z

Ok, so a client application that also uses OpenTelemetry could start a span before calling the Accumulo client.

I added an IT in 2b79fd9 that spins up an external process that traces two scans (in the class ScanTraceClient). When the test runs it sees the two trace ids for each scan.

keith-turner · 2025-12-15T20:34:06Z

Suggest adding scan to the names where it makes sense. For example, accumulo.entries.read, could be relevant for compactions as well. Maybe it doesn't matter in the tracing because it's per thread of execution? It would certainly matter for any metrics emitted as those will be aggregated.

Made that change in ae2abb7. These attributes are only used in a span related to scan, so they are scoped to scans in that way. Its probably good to have good attribute names across all span types though.

core/src/main/java/org/apache/accumulo/core/trace/ScanInstrumentation.java

dlmarion · 2025-12-15T21:53:54Z

core/src/main/java/org/apache/accumulo/core/util/CountingInputStream.java

Not sure what you modified here exactly, but commons-io also has a counting input stream. See https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BoundedInputStream.html.

I can look into that. I added two methods, one to reset the count to zero and another to get the wrapped input stream.

Do not see a way to reset the count w/ BoundedInputStream

dlmarion · 2025-12-15T22:06:19Z

pom.xml

+      <dependency>
+        <groupId>io.opentelemetry.javaagent</groupId>
+        <artifactId>opentelemetry-javaagent</artifactId>
+        <!-- This version was selected because it aligns with the version of open telemetry that Accumulo 2.1 is using. -->
+        <version>2.14.0</version>


I wonder if we should add this to the assemble pom and distribute it (assuming it's allowed). That would alleviate having to try and keep the Terraform code in accumulo-testing up to date (apache/accumulo-testing#293 for example).

Was not sure if we should use the latest version of the agent. I looked at these release notes and found a version that corresponded to the version Accumulo is currently using. However I am not sure if that matters. There are many later versions of the agent.

dlmarion · 2025-12-15T22:09:05Z

server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/ScanDataSource.java

+  private static final AttributeKey<Long> ENTRIES_READ_KEY =
+      AttributeKey.longKey("accumulo.scan.entries.read");
+  private static final AttributeKey<Long> SEEKS_KEY = AttributeKey.longKey("accumulo.scan.seeks");


I wonder if we should place these in one location, like we do with the Metrics class.

It might be good to make them public too, I saw that the same strings are used in an IT, where it could reference the variable.

Made that change in 7ae602a

test/src/main/java/org/apache/accumulo/test/tracing/ScanTracingIT.java

…gIT.java Co-authored-by: Dave Marion <dlmarion@apache.org>

dlmarion

Looks good

CachableBlockFile.Reader.getBCFile() always read the RFile root block from the index cache even if would not use it. Modified the method to only read from the index cache when the data is actually needed. Noticed this while working on apache#6010, saw for every data block read from an an rfile it would also read from the index cache. Ran mvn verify -Psunny w/ these changes.

CachableBlockFile.Reader.getBCFile() always read the RFile root block from the index cache even if would not use it. Modified the method to only read from the index cache when the data is actually needed. Noticed this while working on apache#6010, saw for every data block read from an an rfile it would also read from the index cache. Ran mvn verify -Psunny w/ these changes. # Conflicts: # core/src/main/java/org/apache/accumulo/core/file/blockfile/impl/CachableBlockFile.java

CachableBlockFile.Reader.getBCFile() always read the RFile root block from the index cache even if would not use it. Modified the method to only read from the index cache when the data is actually needed. Noticed this while working on apache#6010, saw for every data block read from a rfile it would also read from the index cache. Adjusted ScanTracingIT because before this change there was an rfile index cache read per rfile data block read. After this change the number of rfile index reads align w/ the number of rfiles opened, which in this case aligns with the number of tablets in the test tables. The existing behavior was likely not a performance problem if the data was in cache, but could potentially cause thread contention and extra CPU usage. So its probably good to fix, also nice to make the tracing stats align better w/ expectations.

CachableBlockFile.Reader.getBCFile() always read the RFile root block from the index cache even if would not use it. Modified the method to only read from the index cache when the data is actually needed. Noticed this while working on #6010, saw for every data block read from a rfile it would also read from the index cache. Adjusted ScanTracingIT because before this change there was an rfile index cache read per rfile data block read. After this change the number of rfile index reads align w/ the number of rfiles opened, which in this case aligns with the number of tablets in the test tables. The existing behavior was likely not a performance problem if the data was in cache, but could potentially cause thread contention and extra CPU usage. So its probably good to fix, also nice to make the tracing stats align better w/ expectations.

keith-turner added this to the 2.1.5 milestone Dec 9, 2025

fix javadoc

ba54725

keith-turner added 6 commits December 9, 2025 16:44

report data scan stats from deep copies

d68f51e

refactor where intrumentation happens

a83bfbf

fix bug

7884dd6

remove accidental checkin

3c168ff

format code

5ca9a9d

fix bug

0deafbe

adds integration test scan tracing

2b79fd9

keith-turner added 5 commits December 15, 2025 18:55

fix deps

cfb3a5d

fix build

30a252e

remove unused code

49c6dce

fix checkstyle violations

2f95fc9

use scan prefix

ae2abb7

keith-turner added 2 commits December 15, 2025 20:48

test different scan executors

693a268

do some TODOs

5acf15c

dlmarion reviewed Dec 15, 2025

View reviewed changes

core/src/main/java/org/apache/accumulo/core/trace/ScanInstrumentation.java Show resolved Hide resolved

dlmarion reviewed Dec 15, 2025

View reviewed changes

remove todo that has issue

20bf890

dlmarion reviewed Dec 15, 2025

View reviewed changes

adds noop impl

0a04cae

keith-turner and others added 4 commits December 15, 2025 17:45

Update test/src/main/java/org/apache/accumulo/test/tracing/ScanTracin…

059ea2c

…gIT.java Co-authored-by: Dave Marion <dlmarion@apache.org>

add import

48c4a87

share constants

7ae602a

test server span attribute

026d8a3

dlmarion approved these changes Dec 16, 2025

View reviewed changes

This was referenced Dec 16, 2025

Investigate why numbers seen in scan trace do not match configuration #6022

Open

Optionally cache compressed data #6023

Open

removed TODOs after opening issues

44e68a1

keith-turner merged commit f844dc0 into apache:2.1 Dec 16, 2025
8 checks passed

keith-turner deleted the scan-tracing2 branch December 16, 2025 19:53

keith-turner mentioned this pull request Dec 16, 2025

avoids unneeded meta index read #6024

Merged

adds per scan tracing statistics #6010

adds per scan tracing statistics #6010

Conversation

keith-turner commented Dec 9, 2025

Uh oh!

keith-turner commented Dec 9, 2025

Uh oh!

keith-turner commented Dec 9, 2025

Uh oh!

dlmarion commented Dec 11, 2025

Uh oh!

dlmarion commented Dec 11, 2025

Uh oh!

keith-turner commented Dec 11, 2025

Uh oh!

dlmarion commented Dec 15, 2025

Uh oh!

keith-turner commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented Dec 15, 2025

Uh oh!

Uh oh!

dlmarion Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

dlmarion Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlmarion Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

dlmarion Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

keith-turner Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dlmarion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keith-turner commented Dec 15, 2025 •

edited

Loading

keith-turner Dec 15, 2025 •

edited

Loading