-
Notifications
You must be signed in to change notification settings - Fork 323
Add long running traces to flare report, allow flare files to be downloaded with JMX #9874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jira card for context: APMS-17557 |
b4bfd1e to
e17ca75
Compare
| for (int i = 0; i < limit; i++) { | ||
| writer.write(traces.get(i).getSpans()); | ||
| } | ||
| return writer.getDumpJson(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: WDYT about this to match other getTracesAsJson() methods that return "[]" rather than an empty string when the json is empty?
| return writer.getDumpJson(); | |
| String json = writer.getDumpJson(); | |
| return json.isEmpty() ? "[]" : json; |
(along with a corresponding change in the the "getTracesAsJson with no traces" test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch! I was meaning to change this, but actually going the other way. I was thinking it would be best to just return an empty string when there are no records for a few reasons:
- The existing implementation for pending traces serializes each pending trace as its own JSON record, with a newline between records (JSON Lines style). In this case, it's fine to have an empty string when there are no records.
- I think it's slightly more correct to have an empty string when there are no pending/long-running traces instead of using
[].[]suggests a single pending/long-running trace with no pending spans (uncommon, but it can happen, particularly with pending traces once all the spans are finished but before it's processed in the queue). - Doesn't change the existing functionality.
I think [] is a relic of my early days working on this before I understood the existing functionality--I had one heck of a time trying to actually see anything in the pending buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense. Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this code wasn't even used, so I just removed it. :) This was from the earlier pending traces/long running traces MBeans I had that have since been removed.
| when: | ||
| healthMetrics.onLongRunningUpdate(3,10,1) | ||
| healthMetrics.onLongRunningUpdate(3,10,1,5) | ||
| latch.await(10, TimeUnit.SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add the following line to test the dropped sampling rate?
1 * statsD.count("long-running.dropped_sampling", 5, _)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Hi DJ 👋 Thanks for your patience! Your notes and commit organization were really great for understanding this PR - I found them especially useful. I left two nit comments, but otherwise it looks good. Since this PR introduces some changes (e.g. keeping long running traces tracked in memory), I've brought it up for more sets of eyes ;). I'm out all of next week but will get back to you after if others don't beat me to it. Thanks again for the contribution! |
| } | ||
|
|
||
| private void addTrace(PendingTrace trace) { | ||
| private synchronized void addTrace(PendingTrace trace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is synchronization really needed? AFAIK all access to the tracker are done from the single thread at PendingTraceBuffer#Worker
My bad, it's synchronized as it's used as a reporter.
dd-trace-core/src/main/java/datadog/trace/core/LongRunningTracesTracker.java
Outdated
Show resolved
Hide resolved
manuel-alvarez-alvarez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, I just added a small comment around syncing, but it still requires a final approval from APM.
|
Thanks, @sarahchen6 and @manuel-alvarez-alvarez! I'll address the few tweaks suggested. Updates coming shortly. |
Synchronized accesses to traceArray in LongRunningTracesTracker since the flare reporter can now access the array. This shouldn't be a concern for blocking because addTrace and flushAndCompact are the existing calls from PendingTraceBuffer's run() loop and getTracesAsJson is called by the reporter thread and will complete fairly quickly.
…ature This allows dumping long running traces when not connected to a Datadog Agent using the new JMX flare feature. A warning message will be logged in this case to indicate that long running traces will not be sent upstream but are available in a flare. Previously the long running traces buffer would always be empty, even though the feature was enabled with dd.trace.experimental.long-running.enabled=true. This led to a good amount of confusion when I was initially developing a feature to dump long running traces without a local Datadog Agent running.
The JMX telemetry feature is controlled by dd.telemetry.jmx.enabled
and is disabled by default. It enables JMXFetch telemetry (if
JMXFetch is enabled, which it is byd default) and also enables a
new tracer flare MBean at datadog.flare:type=TracerFlare. This new
MBean exposes three operations:
java.lang.String listFlareFiles()
- Returns a list of sources and files available from each source.
java.lang.String getFlareFile(java.lang.String p1,java.lang.String p2)
- Returns a single file from a specific reporter (or flare source).
- If the file ends in ".txt", it is returned as-is, otherwise it is
base64 encoded.
java.lang.String generateFullFlareZip()
- Returns a full flare dump, base64 encoded.
An easy way to enable this for testing is to add these arguments:
-Ddd.telemetry.jmx.enabled=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.host=127.0.0.1
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
To test, you can use jmxterm (https://github.com/jiaqi/jmxterm) like
this:
echo "run -b datadog.flare:type=TracerFlare listFlareFiles" | \
java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
-jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \
java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
-jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
jq .
echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
-jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
base64 -d > /tmp/flare.zip && \
unzip -v /tmp/flare.zip
…ng priority This likely isn't an important metric to track, but I noticed these traces were the only ones not reflected in existing LongRunningTraces metrics, so I thought it might be good to add for completeness.
e17ca75 to
a95fbb0
Compare
|
Fixups made, rebased on latest master, and force pushed. |
|
Hi @deejgregor - first thanks for the contributions and Happy New Year! Is it possible to separate out the JMX feature from the long-running traces addition? Having these as separate PRs would make it easier to complete the review and avoid one blocking the other. |
Absolutely! I'll work on that, and in the process, I noticed there are a few other commits in this PR that I want to see if you had thoughts on where to go, @mcculls. First, I'll break out the two big pieces this way:
The additional commits are:
I think 1 and 3 are useful, and 2 was more of a completeness item I noticed and I'm fine if you want to take or leave it. 1 is little bit more of a substantial change, and I realize I want to do a little more testing because I now remember that I've seen the warning when I didn't expect it. My proposal: add 2 and 3 (or just 3) from the additional list to the long running traces PR, and move 1 to its own PR (or drop if you don't want that piece). |
|
Closing this, in favor of three individual PRs: |
What Does This Do
This does two main things:
There are some other small additions, as well, each in its own commit. If some of this isn't desirable and should be rebased out or should be split into a separate PR, I'm happy to do so--just let me know. I would really like to at least get the long running traces added to the flare report.
Motivation
While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were
end()ed during tests.dd.trace.debug=trueanddd.trace.experimental.long-running.enabled=truecould be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. Whendd.trace.experimental.long-running.enabled=trueis used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.
Additional Notes
An easy way to enable this for testing is to add these arguments to a JVM with the APM tracer:
You can use this with jmxterm as shown in the examples below.
Example output:
Outstanding items
dd.telemetry.jmx.enabled=true.MAX_DUMPED_TRACES = 50).This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:
synchronizedto a few methods (see commit comment for details).features.supportsLongRunning()is false, the traces are kept in theTRACKEDstate, compared to theNOT_TRACKEDstate previously.add*methods as-is, but this could be simplified by refactoring the add* methods into Reporter instances (with a new signature that passes a few more arguments toaddReportToFlare). I think this refactoring would be a good change to make--let me know and I'll happily do that. I also considered not making the zip file an intermediary, and if you like, I could look at what that change might be, as well.Note: I had a few fixups that I've merged into the above commits.
Contributor Checklist
type:and (comp:orinst:) labels in addition to any useful labelsclose,fixor any linking keywords when referencing an issue.Use
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]