Fix Netty HTTP span lifecycle for chunked/streaming responses#10656
Open
gtukmachev wants to merge 1 commit intoDataDog:masterfrom
Open
Fix Netty HTTP span lifecycle for chunked/streaming responses#10656gtukmachev wants to merge 1 commit intoDataDog:masterfrom
gtukmachev wants to merge 1 commit intoDataDog:masterfrom
Conversation
The original HttpServerResponseTracingHandler had a single dispatch branch
that always closed the span when HttpResponse (headers) was written.
LastHttpContent — the actual end of a chunked stream — was silently passed
through with no span involvement, causing near-zero latency in APM for any
streaming/chunked response.
Two bugs fixed:
Bug 1 — span always closed on HttpResponse, no LastHttpContent handling.
Added explicit routing for all four Netty message types:
- FullHttpResponse → finish span immediately (checked FIRST: extends both
LastHttpContent and HttpResponse)
- HttpResponse → chunked headers only; save context, don't finish span
- LastHttpContent → finish span here to capture full streaming duration
- HttpContent → intermediate chunk; pass through
Bug 2 — keep-alive race condition.
Under HTTP keep-alive, channelRead for the next request can overwrite
CONTEXT_ATTRIBUTE_KEY before the pending LastHttpContent write task runs,
causing handleLastHttpContent to finish the wrong span (~1-chunk duration).
Fix: STREAMING_CONTEXT_KEY, a separate channel attribute set at chunked
header time and consumed (getAndRemove) by LastHttpContent — immune to
overwrite by the next request's span.
Verified: concurrent load test (48 requests, 8 threads):
streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
If needed, I may attach an archive with a testing Kotlin Gradle Ktor project and a script. The fix was done by AI Agent "Claude Code" with my careful guidance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Netty HTTP span lifecycle for chunked/streaming responses
Problem
Applications using chunked HTTP responses (e.g. Ktor's
respondOutputStream) report near-zero latency in APM — the span closes when response headers are sent, not when the stream finishes.This affects any Netty-backed server that sends
HttpResponse+ multipleHttpContent+LastHttpContentseparately (i.e. chunked transfer encoding).Root Cause
Bug 1 — span always closed on
HttpResponse, noLastHttpContenthandlingThe original handler had a single dispatch branch. Every
HttpResponse(headers-only or full) immediately finished the span.LastHttpContent— the actual end of a chunked stream — was never inspected; it fell through toctx.write(msg, prm)silently.The fix adds explicit routing for all four Netty message types.
FullHttpResponsemust be checked first because it extends bothLastHttpContentandHttpResponse; without that ordering it would be caught by the wrong branch.Bug 2 — keep-alive race condition
Under HTTP keep-alive, Netty's event loop can process
channelReadfor the next request (overwritingCONTEXT_ATTRIBUTE_KEYwith the new span) before the pending write task for the previous response'sLastHttpContentruns. Result:handleLastHttpContentfinishes the new request's span with ~1-chunk duration.Fix: a dedicated
STREAMING_CONTEXT_KEYchannel attribute, set when chunked headers are sent and read (then cleared) byLastHttpContent— immune to overwrite by the next request.Files Changed
netty-4.1/.../HttpServerResponseTracingHandler.java— routing by message type, streaming context keynetty-common/.../AttributeKeys.java— addedSTREAMING_CONTEXT_KEYVerification
Concurrent load test (48 requests, 8 threads):
streaming/slow→ ~5020ms,streaming/fast→ ~61ms in DataDog APM. No outliers.Error capturing for streaming responses
The agent cannot change the HTTP status code once headers are sent. However, application code can mark the span as an error before
LastHttpContentcloses it, and the agent will preserve those tags.The window is: after the exception is thrown inside the streaming lambda, before the framework closes the
OutputStream(which writesLastHttpContentand triggersspan.finish()). Mark the span during that catch block and it will be captured correctly.Required pattern (application side)
Add
io.opentracing:opentracing-apiandio.opentracing:opentracing-utilas dependencies (the DD agent auto-registers as theGlobalTracer). Then wrap the streaming body in atry/catch:Why this works
The DD agent instruments Kotlin coroutines, so
GlobalTracer.get().activeSpan()returns the HTTP server span even insiderespondOutputStream's lambda onDispatchers.IO. The catch block runs before Ktor closes theOutputStreamand emitsLastHttpContent, giving the agent time to record the error beforespan.finish()is called.Verified result (DataDog APM)
POST /testing/streaming/slow(error)java.lang.RuntimeExceptionStreaming slow: error after all chunks sentPOST /testing/streaming/slow(success)POST /testing/streaming/fast(error)java.lang.RuntimeExceptionStreaming fast: error after all chunks sentPOST /testing/streaming/fast(success)Full streaming duration is preserved for both success and error cases.