Skip to content

Comments

Fix Netty HTTP span lifecycle for chunked/streaming responses#10656

Open
gtukmachev wants to merge 1 commit intoDataDog:masterfrom
gtukmachev:gsc-577-fix-ktor-streaming-instrumentation
Open

Fix Netty HTTP span lifecycle for chunked/streaming responses#10656
gtukmachev wants to merge 1 commit intoDataDog:masterfrom
gtukmachev:gsc-577-fix-ktor-streaming-instrumentation

Conversation

@gtukmachev
Copy link

Fix Netty HTTP span lifecycle for chunked/streaming responses

Problem

Applications using chunked HTTP responses (e.g. Ktor's respondOutputStream) report near-zero latency in APM — the span closes when response headers are sent, not when the stream finishes.

// Ktor — respondOutputStream sends headers immediately, streams body async
call.respondOutputStream {
    repeat(5) { writeSomeChunk(); delay(1000) }  // 5s total — DD reports ~0ms
}

This affects any Netty-backed server that sends HttpResponse + multiple HttpContent + LastHttpContent separately (i.e. chunked transfer encoding).


Root Cause

Bug 1 — span always closed on HttpResponse, no LastHttpContent handling

The original handler had a single dispatch branch. Every HttpResponse (headers-only or full) immediately finished the span. LastHttpContent — the actual end of a chunked stream — was never inspected; it fell through to ctx.write(msg, prm) silently.

// Before: only one branch, span always closed at header-send time
if (span == null || !(msg instanceof HttpResponse)) {
    ctx.write(msg, prm);   // LastHttpContent passes through here — span untouched
    return;
}
// ... span.finish() always called here, even for chunked headers

The fix adds explicit routing for all four Netty message types. FullHttpResponse must be checked first because it extends both LastHttpContent and HttpResponse; without that ordering it would be caught by the wrong branch.

// After: three explicit branches; everything else (chunks, unrelated messages) falls through
if (msg instanceof FullHttpResponse) { handleFullHttpResponse(...); return; }  // finish immediately
if (msg instanceof HttpResponse)     { handleHttpResponse(...);     return; }  // headers only — don't finish
if (msg instanceof LastHttpContent)  { handleLastHttpContent(...);  return; }  // finish here
ctx.write(msg, prm);  // intermediate HttpContent chunks + unrelated messages — pass through

Bug 2 — keep-alive race condition

Under HTTP keep-alive, Netty's event loop can process channelRead for the next request (overwriting CONTEXT_ATTRIBUTE_KEY with the new span) before the pending write task for the previous response's LastHttpContent runs. Result: handleLastHttpContent finishes the new request's span with ~1-chunk duration.

Fix: a dedicated STREAMING_CONTEXT_KEY channel attribute, set when chunked headers are sent and read (then cleared) by LastHttpContent — immune to overwrite by the next request.

// AttributeKeys.java
public static final AttributeKey<Context> STREAMING_CONTEXT_KEY =
    attributeKey("datadog.server.streaming.context");

// Set when chunked headers go out
ctx.channel().attr(STREAMING_CONTEXT_KEY).set(storedContext);

// Read in LastHttpContent — safe from keep-alive overwrite
Context streamingCtx = ctx.channel().attr(STREAMING_CONTEXT_KEY).getAndRemove();

Files Changed

  • netty-4.1/.../HttpServerResponseTracingHandler.java — routing by message type, streaming context key
  • netty-common/.../AttributeKeys.java — added STREAMING_CONTEXT_KEY

Verification

Concurrent load test (48 requests, 8 threads): streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers.


Error capturing for streaming responses

The agent cannot change the HTTP status code once headers are sent. However, application code can mark the span as an error before LastHttpContent closes it, and the agent will preserve those tags.

The window is: after the exception is thrown inside the streaming lambda, before the framework closes the OutputStream (which writes LastHttpContent and triggers span.finish()). Mark the span during that catch block and it will be captured correctly.

Required pattern (application side)

Add io.opentracing:opentracing-api and io.opentracing:opentracing-util as dependencies (the DD agent auto-registers as the GlobalTracer). Then wrap the streaming body in a try/catch:

// build.gradle / pom.xml
implementation("io.opentracing:opentracing-api:0.33.0")
implementation("io.opentracing:opentracing-util:0.33.0")
import io.opentracing.util.GlobalTracer

call.respondOutputStream {
    try {
        // ... write chunks ...
        if (someErrorCondition) {
            throw RuntimeException("stream failed mid-way")
        }
    } catch (e: Exception) {
        // Mark the active DD span as error BEFORE LastHttpContent closes it.
        // HTTP status code cannot be changed (headers already sent), but the
        // span will be tagged as an error and appear correctly in APM.
        GlobalTracer.get().activeSpan()?.let { span ->
            span.setTag("error", true)
            span.setTag("error.type",    e::class.java.name)
            span.setTag("error.message", e.message ?: "unknown")
        }
        throw e   // re-throw so the framework closes the stream
    }
}

Why this works

The DD agent instruments Kotlin coroutines, so GlobalTracer.get().activeSpan() returns the HTTP server span even inside respondOutputStream's lambda on Dispatchers.IO. The catch block runs before Ktor closes the OutputStream and emits LastHttpContent, giving the agent time to record the error before span.finish() is called.

Verified result (DataDog APM)

Endpoint Duration Status error.type error.message
POST /testing/streaming/slow (error) ~5025ms error java.lang.RuntimeException Streaming slow: error after all chunks sent
POST /testing/streaming/slow (success) ~5024ms ok
POST /testing/streaming/fast (error) ~67ms error java.lang.RuntimeException Streaming fast: error after all chunks sent
POST /testing/streaming/fast (success) ~64ms ok

Full streaming duration is preserved for both success and error cases.

The original HttpServerResponseTracingHandler had a single dispatch branch
that always closed the span when HttpResponse (headers) was written.
LastHttpContent — the actual end of a chunked stream — was silently passed
through with no span involvement, causing near-zero latency in APM for any
streaming/chunked response.

Two bugs fixed:

Bug 1 — span always closed on HttpResponse, no LastHttpContent handling.
Added explicit routing for all four Netty message types:
  - FullHttpResponse  → finish span immediately (checked FIRST: extends both
                        LastHttpContent and HttpResponse)
  - HttpResponse      → chunked headers only; save context, don't finish span
  - LastHttpContent   → finish span here to capture full streaming duration
  - HttpContent       → intermediate chunk; pass through

Bug 2 — keep-alive race condition.
Under HTTP keep-alive, channelRead for the next request can overwrite
CONTEXT_ATTRIBUTE_KEY before the pending LastHttpContent write task runs,
causing handleLastHttpContent to finish the wrong span (~1-chunk duration).
Fix: STREAMING_CONTEXT_KEY, a separate channel attribute set at chunked
header time and consumed (getAndRemove) by LastHttpContent — immune to
overwrite by the next request's span.

Verified: concurrent load test (48 requests, 8 threads):
  streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gtukmachev gtukmachev marked this pull request as ready for review February 22, 2026 17:55
@gtukmachev gtukmachev requested a review from a team as a code owner February 22, 2026 17:55
@gtukmachev
Copy link
Author

gtukmachev commented Feb 22, 2026

If needed, I may attach an archive with a testing Kotlin Gradle Ktor project and a script.

The fix was done by AI Agent "Claude Code" with my careful guidance.

@gtukmachev gtukmachev closed this Feb 22, 2026
@gtukmachev gtukmachev reopened this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant