Skip to content

Conversation

@deejgregor
Copy link
Contributor

@deejgregor deejgregor commented Oct 28, 2025

What Does This Do

This does two main things:

  1. Adds long running traces to the flare report.
  2. Allow flare dumps and individual files from flares to be downloaded with JMX.

There are some other small additions, as well, each in its own commit. If some of this isn't desirable and should be rebased out or should be split into a separate PR, I'm happy to do so--just let me know. I would really like to at least get the long running traces added to the flare report.

Motivation

While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were end()ed during tests. dd.trace.debug=true and dd.trace.experimental.long-running.enabled=true could be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. When dd.trace.experimental.long-running.enabled=true is used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.

I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.

Additional Notes

An easy way to enable this for testing is to add these arguments to a JVM with the APM tracer:

    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

You can use this with jmxterm as shown in the examples below.

Example output:

$ echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" |  \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
         -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
[{"service":"pending-traces-test","name":"step-3","resource":"step-3","trace_id":1110088093037488208,"span_id":3740396906142869284,"parent_id":6982939151275616389,"start":1761670337688000209,"duration":0,"error":0,"metrics":{"step.number":3,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-2","resource":"step-2","trace_id":1110088093037488208,"span_id":6468860803773086654,"parent_id":6982939151275616389,"start":1761670337582715042,"duration":0,"error":0,"metrics":{"step.number":2,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-1","resource":"step-1","trace_id":1110088093037488208,"span_id":1210573307183346962,"parent_id":6982939151275616389,"start":1761670337477268167,"duration":0,"error":0,"metrics":{"step.number":1,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}}]
$ echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
Archive:  /tmp/flare.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      71  Defl:N       46  35% 10-28-2025 09:54 8963e853  flare_info.txt
      26  Defl:N       26   0% 10-28-2025 09:54 39f97d4e  tracer_version.txt
    9229  Defl:N     3316  64% 10-28-2025 09:54 f4c7920b  initial_config.txt
     487  Defl:N      231  53% 10-28-2025 09:54 f0284361  jvm_args.txt
      75  Defl:N       66  12% 10-28-2025 09:54 886a98a0  classpath.txt
     144  Defl:N       73  49% 10-28-2025 09:54 433c143d  library_path.txt
     307  Defl:N      170  45% 10-28-2025 09:54 773992bb  dynamic_config.txt
    1196  Defl:N      374  69% 10-28-2025 09:54 7396b38c  tracer_health.txt
      47  Defl:N       42  11% 10-28-2025 09:54 700f06af  span_metrics.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  pending_traces.txt
    2448  Defl:N      500  80% 10-28-2025 09:54 8b69071d  instrumenter_state.txt
      71  Defl:N       70   1% 10-28-2025 09:54 c84166ad  instrumenter_metrics.txt
     923  Defl:N      272  71% 10-28-2025 09:54 1f7f39aa  long_running_traces.txt
     213  Defl:N      130  39% 10-28-2025 09:54 eed91e78  dynamic_instrumentation.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  tracer.log
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  jmxfetch.txt
--------          -------  ---                            -------
   15237             5322  65%                            16 files

Outstanding items

  • Add an integration test that exercises JMX functionality when dd.telemetry.jmx.enabled=true.
  • Limit the number of long running traces added to the flare report, like is already done for the pending trace buffer ( MAX_DUMPED_TRACES = 50).
  • Other updates from the list below?

This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:

Note: I had a few fixups that I've merged into the above commits.

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@deejgregor deejgregor requested a review from a team as a code owner October 28, 2025 17:00
@aw-dd
Copy link

aw-dd commented Oct 29, 2025

Jira card for context: APMS-17557

@deejgregor deejgregor force-pushed the pending-traces-jmx-dump branch from b4bfd1e to e17ca75 Compare November 12, 2025 20:43
@deejgregor deejgregor requested a review from a team as a code owner November 12, 2025 20:43
@deejgregor deejgregor requested review from sarahchen6 and removed request for a team November 12, 2025 20:43
@mcculls mcculls added the tag: community Community contribution label Nov 18, 2025
for (int i = 0; i < limit; i++) {
writer.write(traces.get(i).getSpans());
}
return writer.getDumpJson();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: WDYT about this to match other getTracesAsJson() methods that return "[]" rather than an empty string when the json is empty?

Suggested change
return writer.getDumpJson();
String json = writer.getDumpJson();
return json.isEmpty() ? "[]" : json;

(along with a corresponding change in the the "getTracesAsJson with no traces" test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! I was meaning to change this, but actually going the other way. I was thinking it would be best to just return an empty string when there are no records for a few reasons:

  1. The existing implementation for pending traces serializes each pending trace as its own JSON record, with a newline between records (JSON Lines style). In this case, it's fine to have an empty string when there are no records.
  2. I think it's slightly more correct to have an empty string when there are no pending/long-running traces instead of using []. [] suggests a single pending/long-running trace with no pending spans (uncommon, but it can happen, particularly with pending traces once all the spans are finished but before it's processed in the queue).
  3. Doesn't change the existing functionality.

I think [] is a relic of my early days working on this before I understood the existing functionality--I had one heck of a time trying to actually see anything in the pending buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. Sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this code wasn't even used, so I just removed it. :) This was from the earlier pending traces/long running traces MBeans I had that have since been removed.

when:
healthMetrics.onLongRunningUpdate(3,10,1)
healthMetrics.onLongRunningUpdate(3,10,1,5)
latch.await(10, TimeUnit.SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add the following line to test the dropped sampling rate?

1 * statsD.count("long-running.dropped_sampling", 5, _)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarahchen6
Copy link
Contributor

Hi DJ 👋 Thanks for your patience! Your notes and commit organization were really great for understanding this PR - I found them especially useful. I left two nit comments, but otherwise it looks good. Since this PR introduces some changes (e.g. keeping long running traces tracked in memory), I've brought it up for more sets of eyes ;). I'm out all of next week but will get back to you after if others don't beat me to it. Thanks again for the contribution!

}

private void addTrace(PendingTrace trace) {
private synchronized void addTrace(PendingTrace trace) {
Copy link
Member

@manuel-alvarez-alvarez manuel-alvarez-alvarez Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is synchronization really needed? AFAIK all access to the tracker are done from the single thread at PendingTraceBuffer#Worker

My bad, it's synchronized as it's used as a reporter.

Copy link
Member

@manuel-alvarez-alvarez manuel-alvarez-alvarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I just added a small comment around syncing, but it still requires a final approval from APM.

@deejgregor
Copy link
Contributor Author

Thanks, @sarahchen6 and @manuel-alvarez-alvarez! I'll address the few tweaks suggested. Updates coming shortly.

Synchronized accesses to traceArray in LongRunningTracesTracker
since the flare reporter can now access the array. This shouldn't
be a concern for blocking because addTrace and flushAndCompact are
the existing calls from PendingTraceBuffer's run() loop and
getTracesAsJson is called by the reporter thread and will complete
fairly quickly.
…ature

This allows dumping long running traces when not connected to a
Datadog Agent using the new JMX flare feature. A warning message
will be logged in this case to indicate that long running traces
will not be sent upstream but are available in a flare.

Previously the long running traces buffer would always be empty,
even though the feature was enabled with
dd.trace.experimental.long-running.enabled=true. This led to a
good amount of confusion when I was initially developing a feature
to dump long running traces without a local Datadog Agent running.
The JMX telemetry feature is controlled by dd.telemetry.jmx.enabled
and is disabled by default. It enables JMXFetch telemetry (if
JMXFetch is enabled, which it is byd default) and also enables a
new tracer flare MBean at datadog.flare:type=TracerFlare. This new
MBean exposes three operations:

java.lang.String listFlareFiles()
- Returns a list of sources and files available from each source.

java.lang.String getFlareFile(java.lang.String p1,java.lang.String p2)
- Returns a single file from a specific reporter (or flare source).
- If the file ends in ".txt", it is returned as-is, otherwise it is
  base64 encoded.

java.lang.String generateFullFlareZip()
- Returns a full flare dump, base64 encoded.

An easy way to enable this for testing is to add these arguments:
    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

To test, you can use jmxterm (https://github.com/jiaqi/jmxterm) like
this:

echo "run -b datadog.flare:type=TracerFlare listFlareFiles" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent

echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    jq .

echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
…ng priority

This likely isn't an important metric to track, but I noticed these
traces were the only ones not reflected in existing LongRunningTraces
metrics, so I thought it might be good to add for completeness.
@deejgregor deejgregor force-pushed the pending-traces-jmx-dump branch from e17ca75 to a95fbb0 Compare November 21, 2025 22:03
@deejgregor
Copy link
Contributor Author

Fixups made, rebased on latest master, and force pushed.

@mcculls
Copy link
Contributor

mcculls commented Jan 7, 2026

Hi @deejgregor - first thanks for the contributions and Happy New Year!

Is it possible to separate out the JMX feature from the long-running traces addition? Having these as separate PRs would make it easier to complete the review and avoid one blocking the other.

@deejgregor
Copy link
Contributor Author

deejgregor commented Jan 7, 2026

Is it possible to separate out the JMX feature from the long-running traces addition? Having these as separate PRs would make it easier to complete the review and avoid one blocking the other.

Absolutely! I'll work on that, and in the process, I noticed there are a few other commits in this PR that I want to see if you had thoughts on where to go, @mcculls.

First, I'll break out the two big pieces this way:

  1. Add long_running_traces.json to flare report - 4efe118 and the related refactoring in Trace dump refactor in preparation for adding long running traces - 4d1369d
  2. Add JMX MBean for getting tracer flare files - c4850cf

The additional commits are:

  1. Track long running traces when agent does not support long running feature -- 488dc52
  2. LongRunningTracesTracker: add metric for traces dropped due to sampling priority -- f10c146
  3. PendingTraceBuffer: Keep track of how often we write around the buffer -- a95fbb0

I think 1 and 3 are useful, and 2 was more of a completeness item I noticed and I'm fine if you want to take or leave it. 1 is little bit more of a substantial change, and I realize I want to do a little more testing because I now remember that I've seen the warning when I didn't expect it.

My proposal: add 2 and 3 (or just 3) from the additional list to the long running traces PR, and move 1 to its own PR (or drop if you don't want that piece).

@deejgregor
Copy link
Contributor Author

@deejgregor deejgregor closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tag: community Community contribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants