Skip to content

Conversation

@deejgregor
Copy link
Contributor

@deejgregor deejgregor commented Jan 8, 2026

What Does This Do

Currently, the tracer will silently drop long running traces--even if the feature is enabled--when it can't connect to an agent or when the agent doesn't support the long running traces feature.

Motivation

Previously the long running traces buffer would always be empty, even though the feature was enabled with dd.trace.experimental.long-running.enabled=true. This led to a good amount of confusion when I was initially developing a feature to dump long running traces without a local Datadog Agent running.

Additional Notes

This was originally part of #9874, which is being broken out into a few individual PRs.

Commit notes:

Assuming #10309 is merged, the only remaining case where a metric isn't tracked is when an entry is removed from the tracker is when the trace list is empty:

      if (trace.empty()) {
        trace.compareAndSetLongRunningState(WRITE_RUNNING_SPANS, NOT_TRACKED);
        cleanSlot(i);
        continue;
      }

If this case is desired to have a metric, let me know, and I'll add it (long-running.completed??). That would cover every single case where a trace is removed from the tracker.

Contributor Checklist

Jira ticket: [PROJ-IDENT]

Log a warning message when the long running traces feature is
enabled but the tracer is not connected to a Datadog Agent that
supports receiving long running traces.

Previously the long running traces buffer would always be empty,
even though the feature was enabled with
dd.trace.experimental.long-running.enabled=true. This led to a
good amount of confusion when I was initially developing a feature
to dump long running traces without a local Datadog Agent running.
This allows dumping long running traces when not connected to a
Datadog Agent when using the new JMX flare feature.

This introduces a change to the state handling for long-running traces.
Previously, if features.supportsLongRunning() is false, the trace's
slot is cleaned (but note that the state would never transition--see
discussion below). With this ocmmit, these traces stay in their slot in
the TRACKED until another condition removes them, which is the same as
what would happen if features.supportsLongRunning() returned true.

A note about state transitions:

When the "if (trace.empty() || !features.supportsLongRunning())" block
was entered previously, the trace's state would be transitioned to
NOT_TRACKED, but only if the state was WRITE_RUNNING_SPANS. This would
only be true when traces were empty AND they had passed through a flush
cycle (which would transition them from TRACKED to WRITE_RUNNING_SPANS).
Previously, when features.supportsLongRunning() is false, traces never
make that transition so they would always be in the TRACKED state when
cleanSlot is called. The only other consumer of the state is in
PendingTrace and it is only checking for the WRITE_RUNNING_SPANS state,
so I think this is not a problem. I think there might be a similar state
transition issue if the sampling priority is ever reduced after a flush
cycle, but it also looks innocuous.
This likely isn't an important metric to track, but I noticed these
long-running traces were the only ones not reflected in existing
metrics when they are dropped from the tracker, so I thought it might
be good to add this metric for completeness.

This change introduces a new metric, "long-running.dropped_sampling",
to count traces that are dropped when negativeOrNullPriority(trace)
is true.

There is an existing metric, "long-running.dropped", for long-running
traces that are dropped on input to the tracker when there are no
slots free. That metric name was kept as-is to not disturb any
existing consumers downstream. If that is not a concern, it might
be good to rename the existing metric to clarify that it captures
traces dropped on input.
@deejgregor
Copy link
Contributor Author

Marking this as a draft to prevent merging for now, as I remember seeing the warning generated in one case when the tracer could talk to a modern agent, so I want to look more into that.

Otherwise, feel free to review.

@PerfectSlayer PerfectSlayer added the tag: community Community contribution label Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tag: community Community contribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants