Add long running traces to flare report, allow flare files to be downloaded with JMX #9874

deejgregor · 2025-10-28T17:00:32Z

What Does This Do

This does two main things:

Adds long running traces to the flare report.
Allow flare dumps and individual files from flares to be downloaded with JMX.

There are some other small additions, as well, each in its own commit. If some of this isn't desirable and should be rebased out or should be split into a separate PR, I'm happy to do so--just let me know. I would really like to at least get the long running traces added to the flare report.

Motivation

While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were end()ed during tests. dd.trace.debug=true and dd.trace.experimental.long-running.enabled=true could be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. When dd.trace.experimental.long-running.enabled=true is used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.

I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.

Additional Notes

An easy way to enable this for testing is to add these arguments to a JVM with the APM tracer:

    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

You can use this with jmxterm as shown in the examples below.

Example output:

$ echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" |  \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
         -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
[{"service":"pending-traces-test","name":"step-3","resource":"step-3","trace_id":1110088093037488208,"span_id":3740396906142869284,"parent_id":6982939151275616389,"start":1761670337688000209,"duration":0,"error":0,"metrics":{"step.number":3,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-2","resource":"step-2","trace_id":1110088093037488208,"span_id":6468860803773086654,"parent_id":6982939151275616389,"start":1761670337582715042,"duration":0,"error":0,"metrics":{"step.number":2,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-1","resource":"step-1","trace_id":1110088093037488208,"span_id":1210573307183346962,"parent_id":6982939151275616389,"start":1761670337477268167,"duration":0,"error":0,"metrics":{"step.number":1,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}}]

$ echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
Archive:  /tmp/flare.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      71  Defl:N       46  35% 10-28-2025 09:54 8963e853  flare_info.txt
      26  Defl:N       26   0% 10-28-2025 09:54 39f97d4e  tracer_version.txt
    9229  Defl:N     3316  64% 10-28-2025 09:54 f4c7920b  initial_config.txt
     487  Defl:N      231  53% 10-28-2025 09:54 f0284361  jvm_args.txt
      75  Defl:N       66  12% 10-28-2025 09:54 886a98a0  classpath.txt
     144  Defl:N       73  49% 10-28-2025 09:54 433c143d  library_path.txt
     307  Defl:N      170  45% 10-28-2025 09:54 773992bb  dynamic_config.txt
    1196  Defl:N      374  69% 10-28-2025 09:54 7396b38c  tracer_health.txt
      47  Defl:N       42  11% 10-28-2025 09:54 700f06af  span_metrics.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  pending_traces.txt
    2448  Defl:N      500  80% 10-28-2025 09:54 8b69071d  instrumenter_state.txt
      71  Defl:N       70   1% 10-28-2025 09:54 c84166ad  instrumenter_metrics.txt
     923  Defl:N      272  71% 10-28-2025 09:54 1f7f39aa  long_running_traces.txt
     213  Defl:N      130  39% 10-28-2025 09:54 eed91e78  dynamic_instrumentation.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  tracer.log
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  jmxfetch.txt
--------          -------  ---                            -------
   15237             5322  65%                            16 files

Outstanding items

Add an integration test that exercises JMX functionality when dd.telemetry.jmx.enabled=true.
Limit the number of long running traces added to the flare report, like is already done for the pending trace buffer ( MAX_DUMPED_TRACES = 50).
Other updates from the list below?

This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:

Trace dump refactor in preparation for adding long running traces -- This doesn't need to be kept in its own commit. I kept it separate for now to make review a little easier.
Add long_running_traces.json to flare report -- Note: this adds synchronized to a few methods (see commit comment for details).
Track long running traces when agent does not support long running feature -- This could be dropped, but if so, I'd highly suggest keeping the warning message (it would need some rewording). Note: if features.supportsLongRunning() is false, the traces are kept in the TRACKED state, compared to the NOT_TRACKED state previously.
Add JMX MBean for getting tracer flare files -- Note: see if the JMX MBean ObjectName and operation names sound good. I kept the existing add* methods as-is, but this could be simplified by refactoring the add* methods into Reporter instances (with a new signature that passes a few more arguments to addReportToFlare). I think this refactoring would be a good change to make--let me know and I'll happily do that. I also considered not making the zip file an intermediary, and if you like, I could look at what that change might be, as well.
LongRunningTracesTracker: add metric for traces dropped due to sampling priority -- This could be dropped. I'm not sure if this is an important metric to track.
PendingTraceBuffer: Keep track of how often we write around the buffer -- This does seem like a valuable metric to track.

Note: I had a few fixups that I've merged into the above commits.

Contributor Checklist

Format the title according the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any useful labels
Don't use close, fix or any linking keywords when referencing an issue.
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, move, or deletion
Update the public documentation in case of new configuration flag or behavior

Jira ticket: [PROJ-IDENT]

aw-dd · 2025-10-29T20:37:31Z

Jira card for context: APMS-17557

sarahchen6 · 2025-11-21T14:54:44Z

dd-trace-core/src/main/java/datadog/trace/core/LongRunningTracesTracker.java

+      for (int i = 0; i < limit; i++) {
+        writer.write(traces.get(i).getSpans());
+      }
+      return writer.getDumpJson();


nit: WDYT about this to match other getTracesAsJson() methods that return "[]" rather than an empty string when the json is empty?

Suggested change

return writer.getDumpJson();

String json = writer.getDumpJson();

return json.isEmpty() ? "[]" : json;

(along with a corresponding change in the the "getTracesAsJson with no traces" test)

Oh, good catch! I was meaning to change this, but actually going the other way. I was thinking it would be best to just return an empty string when there are no records for a few reasons:

The existing implementation for pending traces serializes each pending trace as its own JSON record, with a newline between records (JSON Lines style). In this case, it's fine to have an empty string when there are no records.

I think it's slightly more correct to have an empty string when there are no pending/long-running traces instead of using []. [] suggests a single pending/long-running trace with no pending spans (uncommon, but it can happen, particularly with pending traces once all the spans are finished but before it's processed in the queue).

Doesn't change the existing functionality.

I think [] is a relic of my early days working on this before I understood the existing functionality--I had one heck of a time trying to actually see anything in the pending buffer.

Ah that makes sense. Sounds good!

Actually, this code wasn't even used, so I just removed it. :) This was from the earlier pending traces/long running traces MBeans I had that have since been removed.

sarahchen6 · 2025-11-21T14:56:28Z

dd-trace-core/src/test/groovy/datadog/trace/core/monitor/HealthMetricsTest.groovy

    when:
-    healthMetrics.onLongRunningUpdate(3,10,1)
+    healthMetrics.onLongRunningUpdate(3,10,1,5)
    latch.await(10, TimeUnit.SECONDS)


nit: add the following line to test the dropped sampling rate?

1 * statsD.count("long-running.dropped_sampling", 5, _)

Done: https://github.com/DataDog/dd-trace-java/pull/9874/files#diff-5e606d69df71458f5a8ebb4ac5a0ea045635712d13ea35d6eea9337fde309f2dR409

sarahchen6 · 2025-11-21T14:56:53Z

Hi DJ 👋 Thanks for your patience! Your notes and commit organization were really great for understanding this PR - I found them especially useful. I left two nit comments, but otherwise it looks good. Since this PR introduces some changes (e.g. keeping long running traces tracked in memory), I've brought it up for more sets of eyes ;). I'm out all of next week but will get back to you after if others don't beat me to it. Thanks again for the contribution!

manuel-alvarez-alvarez · 2025-11-21T17:40:07Z

dd-trace-core/src/main/java/datadog/trace/core/LongRunningTracesTracker.java

  }

-  private void addTrace(PendingTrace trace) {
+  private synchronized void addTrace(PendingTrace trace) {


~~Is synchronization really needed? AFAIK all access to the tracker are done from the single thread at PendingTraceBuffer#Worker~~

My bad, it's synchronized as it's used as a reporter.

dd-trace-core/src/main/java/datadog/trace/core/LongRunningTracesTracker.java

manuel-alvarez-alvarez

Overall LGTM, I just added a small comment around syncing, but it still requires a final approval from APM.

deejgregor · 2025-11-21T19:47:23Z

Thanks, @sarahchen6 and @manuel-alvarez-alvarez! I'll address the few tweaks suggested. Updates coming shortly.

Synchronized accesses to traceArray in LongRunningTracesTracker since the flare reporter can now access the array. This shouldn't be a concern for blocking because addTrace and flushAndCompact are the existing calls from PendingTraceBuffer's run() loop and getTracesAsJson is called by the reporter thread and will complete fairly quickly.

…ature This allows dumping long running traces when not connected to a Datadog Agent using the new JMX flare feature. A warning message will be logged in this case to indicate that long running traces will not be sent upstream but are available in a flare. Previously the long running traces buffer would always be empty, even though the feature was enabled with dd.trace.experimental.long-running.enabled=true. This led to a good amount of confusion when I was initially developing a feature to dump long running traces without a local Datadog Agent running.

The JMX telemetry feature is controlled by dd.telemetry.jmx.enabled and is disabled by default. It enables JMXFetch telemetry (if JMXFetch is enabled, which it is byd default) and also enables a new tracer flare MBean at datadog.flare:type=TracerFlare. This new MBean exposes three operations: java.lang.String listFlareFiles() - Returns a list of sources and files available from each source. java.lang.String getFlareFile(java.lang.String p1,java.lang.String p2) - Returns a single file from a specific reporter (or flare source). - If the file ends in ".txt", it is returned as-is, otherwise it is base64 encoded. java.lang.String generateFullFlareZip() - Returns a full flare dump, base64 encoded. An easy way to enable this for testing is to add these arguments: -Ddd.telemetry.jmx.enabled=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.0.1 -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false To test, you can use jmxterm (https://github.com/jiaqi/jmxterm) like this: echo "run -b datadog.flare:type=TracerFlare listFlareFiles" | \ java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \ -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \ java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \ -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \ jq . echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \ java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \ -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \ base64 -d > /tmp/flare.zip && \ unzip -v /tmp/flare.zip

…ng priority This likely isn't an important metric to track, but I noticed these traces were the only ones not reflected in existing LongRunningTraces metrics, so I thought it might be good to add for completeness.

deejgregor · 2025-11-21T22:05:25Z

Fixups made, rebased on latest master, and force pushed.

mcculls · 2026-01-07T10:30:24Z

Hi @deejgregor - first thanks for the contributions and Happy New Year!

Is it possible to separate out the JMX feature from the long-running traces addition? Having these as separate PRs would make it easier to complete the review and avoid one blocking the other.

deejgregor · 2026-01-07T19:01:49Z

Is it possible to separate out the JMX feature from the long-running traces addition? Having these as separate PRs would make it easier to complete the review and avoid one blocking the other.

Absolutely! I'll work on that, and in the process, I noticed there are a few other commits in this PR that I want to see if you had thoughts on where to go, @mcculls.

First, I'll break out the two big pieces this way:

Add long_running_traces.json to flare report - 4efe118 and the related refactoring in Trace dump refactor in preparation for adding long running traces - 4d1369d
Add JMX MBean for getting tracer flare files - c4850cf

The additional commits are:

Track long running traces when agent does not support long running feature -- 488dc52
LongRunningTracesTracker: add metric for traces dropped due to sampling priority -- f10c146
PendingTraceBuffer: Keep track of how often we write around the buffer -- a95fbb0

I think 1 and 3 are useful, and 2 was more of a completeness item I noticed and I'm fine if you want to take or leave it. 1 is little bit more of a substantial change, and I realize I want to do a little more testing because I now remember that I've seen the warning when I didn't expect it.

My proposal: add 2 and 3 (or just 3) from the additional list to the long running traces PR, and move 1 to its own PR (or drop if you don't want that piece).

deejgregor · 2026-01-08T04:26:44Z

Closing this, in favor of three individual PRs:

deejgregor requested a review from a team as a code owner October 28, 2025 17:00

deejgregor requested a review from manuel-alvarez-alvarez October 28, 2025 17:00

deejgregor force-pushed the pending-traces-jmx-dump branch from b4bfd1e to e17ca75 Compare November 12, 2025 20:43

deejgregor requested a review from a team as a code owner November 12, 2025 20:43

deejgregor requested review from sarahchen6 and removed request for a team November 12, 2025 20:43

mcculls added the tag: community Community contribution label Nov 18, 2025

sarahchen6 reviewed Nov 21, 2025

View reviewed changes

manuel-alvarez-alvarez reviewed Nov 21, 2025

View reviewed changes

dd-trace-core/src/main/java/datadog/trace/core/LongRunningTracesTracker.java Outdated Show resolved Hide resolved

manuel-alvarez-alvarez reviewed Nov 21, 2025

View reviewed changes

deejgregor added 6 commits November 21, 2025 14:01

Trace dump refactor in preparation for adding long running traces

4d1369d

PendingTraceBuffer: Keep track of how often we write around the buffer

a95fbb0

deejgregor force-pushed the pending-traces-jmx-dump branch from e17ca75 to a95fbb0 Compare November 21, 2025 22:03

deejgregor requested review from manuel-alvarez-alvarez and sarahchen6 November 21, 2025 22:05

This was referenced Jan 8, 2026

Add long running traces to flare report #10309

Open

Add JMX MBean for getting tracer flare files #10310

Open

Always track long running traces when feature is enabled, regardless of agent support #10311

Draft

deejgregor closed this Jan 8, 2026

	return writer.getDumpJson();
	String json = writer.getDumpJson();
	return json.isEmpty() ? "[]" : json;

Add long running traces to flare report, allow flare files to be downloaded with JMX #9874

Add long running traces to flare report, allow flare files to be downloaded with JMX #9874

Uh oh!

Conversation

deejgregor commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Outstanding items

Contributor Checklist

Uh oh!

aw-dd commented Oct 29, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarahchen6 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sarahchen6 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sarahchen6 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

deejgregor Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sarahchen6 commented Nov 21, 2025

Uh oh!

manuel-alvarez-alvarez Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

manuel-alvarez-alvarez left a comment

Choose a reason for hiding this comment

Uh oh!

deejgregor commented Nov 21, 2025

Uh oh!

deejgregor commented Nov 21, 2025

Uh oh!

mcculls commented Jan 7, 2026

Uh oh!

deejgregor commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deejgregor commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

deejgregor commented Oct 28, 2025 •

edited

Loading

aw-dd commented Oct 29, 2025 •

edited by atlassian bot

Loading

manuel-alvarez-alvarez Nov 21, 2025 •

edited

Loading

deejgregor commented Jan 7, 2026 •

edited

Loading