Fix Netty HTTP span lifecycle for chunked/streaming responses by gtukmachev · Pull Request #10656 · DataDog/dd-trace-java

gtukmachev · 2026-02-22T17:54:14Z

Fix Netty HTTP span lifecycle for chunked/streaming responses

Problem

Applications using chunked HTTP responses (e.g. Ktor's respondOutputStream) report near-zero latency in APM — the span closes when response headers are sent, not when the stream finishes.

// Ktor — respondOutputStream sends headers immediately, streams body async
call.respondOutputStream {
    repeat(5) { writeSomeChunk(); delay(1000) }  // 5s total — DD reports ~0ms
}

This affects any Netty-backed server that sends HttpResponse + multiple HttpContent + LastHttpContent separately (i.e. chunked transfer encoding).

Root Cause

Bug 1 — span always closed on `HttpResponse`, no `LastHttpContent` handling

The original handler had a single dispatch branch. Every HttpResponse (headers-only or full) immediately finished the span. LastHttpContent — the actual end of a chunked stream — was never inspected; it fell through to ctx.write(msg, prm) silently.

// Before: only one branch, span always closed at header-send time
if (span == null || !(msg instanceof HttpResponse)) {
    ctx.write(msg, prm);   // LastHttpContent passes through here — span untouched
    return;
}
// ... span.finish() always called here, even for chunked headers

The fix adds explicit routing for all four Netty message types. FullHttpResponse must be checked first because it extends both LastHttpContent and HttpResponse; without that ordering it would be caught by the wrong branch.

// After: three explicit branches; everything else (chunks, unrelated messages) falls through
if (msg instanceof FullHttpResponse) { handleFullHttpResponse(...); return; }  // finish immediately
if (msg instanceof HttpResponse)     { handleHttpResponse(...);     return; }  // headers only — don't finish
if (msg instanceof LastHttpContent)  { handleLastHttpContent(...);  return; }  // finish here
ctx.write(msg, prm);  // intermediate HttpContent chunks + unrelated messages — pass through

Bug 2 — keep-alive race condition

Under HTTP keep-alive, Netty's event loop can process channelRead for the next request (overwriting CONTEXT_ATTRIBUTE_KEY with the new span) before the pending write task for the previous response's LastHttpContent runs. Result: handleLastHttpContent finishes the new request's span with ~1-chunk duration.

Fix: a dedicated STREAMING_CONTEXT_KEY channel attribute, set when chunked headers are sent and read (then cleared) by LastHttpContent — immune to overwrite by the next request.

// AttributeKeys.java
public static final AttributeKey<Context> STREAMING_CONTEXT_KEY =
    attributeKey("datadog.server.streaming.context");

// Set when chunked headers go out
ctx.channel().attr(STREAMING_CONTEXT_KEY).set(storedContext);

// Read in LastHttpContent — safe from keep-alive overwrite
Context streamingCtx = ctx.channel().attr(STREAMING_CONTEXT_KEY).getAndRemove();

Files Changed

netty-4.1/.../HttpServerResponseTracingHandler.java — routing by message type, streaming context key
netty-common/.../AttributeKeys.java — added STREAMING_CONTEXT_KEY

Verification

Concurrent load test (48 requests, 8 threads): streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers.

Error capturing for streaming responses

The agent cannot change the HTTP status code once headers are sent. However, application code can mark the span as an error before LastHttpContent closes it, and the agent will preserve those tags.

The window is: after the exception is thrown inside the streaming lambda, before the framework closes the OutputStream (which writes LastHttpContent and triggers span.finish()). Mark the span during that catch block and it will be captured correctly.

Required pattern (application side)

Add io.opentracing:opentracing-api and io.opentracing:opentracing-util as dependencies (the DD agent auto-registers as the GlobalTracer). Then wrap the streaming body in a try/catch:

// build.gradle / pom.xml
implementation("io.opentracing:opentracing-api:0.33.0")
implementation("io.opentracing:opentracing-util:0.33.0")

import io.opentracing.util.GlobalTracer

call.respondOutputStream {
    try {
        // ... write chunks ...
        if (someErrorCondition) {
            throw RuntimeException("stream failed mid-way")
        }
    } catch (e: Exception) {
        // Mark the active DD span as error BEFORE LastHttpContent closes it.
        // HTTP status code cannot be changed (headers already sent), but the
        // span will be tagged as an error and appear correctly in APM.
        GlobalTracer.get().activeSpan()?.let { span ->
            span.setTag("error", true)
            span.setTag("error.type",    e::class.java.name)
            span.setTag("error.message", e.message ?: "unknown")
        }
        throw e   // re-throw so the framework closes the stream
    }
}

Why this works

The DD agent instruments Kotlin coroutines, so GlobalTracer.get().activeSpan() returns the HTTP server span even inside respondOutputStream's lambda on Dispatchers.IO. The catch block runs before Ktor closes the OutputStream and emits LastHttpContent, giving the agent time to record the error before span.finish() is called.

Verified result (DataDog APM)

Endpoint	Duration	Status	error.type	error.message
`POST /testing/streaming/slow` (error)	~5025ms	error	`java.lang.RuntimeException`	`Streaming slow: error after all chunks sent`
`POST /testing/streaming/slow` (success)	~5024ms	ok	—	—
`POST /testing/streaming/fast` (error)	~67ms	error	`java.lang.RuntimeException`	`Streaming fast: error after all chunks sent`
`POST /testing/streaming/fast` (success)	~64ms	ok	—	—

Full streaming duration is preserved for both success and error cases.

The original HttpServerResponseTracingHandler had a single dispatch branch that always closed the span when HttpResponse (headers) was written. LastHttpContent — the actual end of a chunked stream — was silently passed through with no span involvement, causing near-zero latency in APM for any streaming/chunked response. Two bugs fixed: Bug 1 — span always closed on HttpResponse, no LastHttpContent handling. Added explicit routing for all four Netty message types: - FullHttpResponse → finish span immediately (checked FIRST: extends both LastHttpContent and HttpResponse) - HttpResponse → chunked headers only; save context, don't finish span - LastHttpContent → finish span here to capture full streaming duration - HttpContent → intermediate chunk; pass through Bug 2 — keep-alive race condition. Under HTTP keep-alive, channelRead for the next request can overwrite CONTEXT_ATTRIBUTE_KEY before the pending LastHttpContent write task runs, causing handleLastHttpContent to finish the wrong span (~1-chunk duration). Fix: STREAMING_CONTEXT_KEY, a separate channel attribute set at chunked header time and consumed (getAndRemove) by LastHttpContent — immune to overwrite by the next request's span. Verified: concurrent load test (48 requests, 8 threads): streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gtukmachev · 2026-02-22T17:59:47Z

If needed, I may attach an archive with a testing Kotlin Gradle Ktor project and a script.

The fix was done by AI Agent "Claude Code" with my careful guidance.

gtukmachev marked this pull request as ready for review February 22, 2026 17:55

gtukmachev requested a review from a team as a code owner February 22, 2026 17:55

gtukmachev closed this Feb 22, 2026

gtukmachev reopened this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix Netty HTTP span lifecycle for chunked/streaming responses#10656

Fix Netty HTTP span lifecycle for chunked/streaming responses#10656
gtukmachev wants to merge 1 commit intoDataDog:masterfrom
gtukmachev:gsc-577-fix-ktor-streaming-instrumentation

gtukmachev commented Feb 22, 2026

Uh oh!

gtukmachev commented Feb 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

gtukmachev commented Feb 22, 2026

Fix Netty HTTP span lifecycle for chunked/streaming responses

Problem

Root Cause

Bug 1 — span always closed on HttpResponse, no LastHttpContent handling

Bug 2 — keep-alive race condition

Files Changed

Verification

Error capturing for streaming responses

Required pattern (application side)

Why this works

Verified result (DataDog APM)

Uh oh!

gtukmachev commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — span always closed on `HttpResponse`, no `LastHttpContent` handling

gtukmachev commented Feb 22, 2026 •

edited

Loading