Fix zombie process when proxy stops due to health check failure#3543
Fix zombie process when proxy stops due to health check failure#3543dmjb merged 8 commits intostacklok:mainfrom
Conversation
When health checks fail for a remote MCP server, the TransparentProxy calls Stop() which closes its shutdownCh. However, HTTPTransport.IsRunning() only checked its own shutdownCh, not the proxy's. This caused the runner's status monitor to keep returning true, preventing the process from exiting. Changes: - Add IsRunning() method to the Proxy interface - Update HTTPTransport.IsRunning() to also check proxy.IsRunning() - Add IsRunning() implementations to HTTPSSEProxy and HTTPProxy (streamable) - Add unit tests for the IsRunning behavior - Add e2e test for health check zombie prevention The TransparentProxy already had IsRunning() - this change ensures the HTTPTransport properly delegates to it when checking running status. Signed-off-by: Greg Katz <gkatz@indeed.com>
|
@gkatz2 Thanks for this change. If you can address the pieces of feedback I have here, I'm happy to merge. |
|
Looks like you may also need to run |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3543 +/- ##
==========================================
+ Coverage 65.51% 65.59% +0.07%
==========================================
Files 406 406
Lines 39996 40019 +23
==========================================
+ Hits 26202 26249 +47
+ Misses 11762 11736 -26
- Partials 2032 2034 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The previous commit generated this mock with full paths in the header comments, but main uses relative paths. Regenerated to match main and fix CI's "Verify Code Generation" check.
|
Thanks, @dmjb. Just pushed a commit for the mock generation issue. Looking at the code suggestion now. |
|
Re: Context parameter in You're right - the context isn't used by any of the implementations. I added it for consistency with Re: Code restructure I think the suggestion has an issue - using Here are two corrected versions: Option A - declare at top, single return: proxyRunning := true
var err error
if t.proxy != nil {
proxyRunning, err = t.proxy.IsRunning(ctx)
if err != nil {
return false, err
}
}
return proxyRunning, nilOption B - early return: if t.proxy != nil {
proxyRunning, err := t.proxy.IsRunning(ctx)
if err != nil {
return false, err
}
return proxyRunning, nil
}
return true, nilLet me know which you prefer and I'll update! |
The test was incorrectly expecting the process to exit after health check failures. The actual expected behavior is that the transport detects "not running" and triggers a restart attempt. Changed the test to verify: - "restart needed" appears in logs (proves transport detected failure) - Server is still listed (restart in progress) This matches the fix's intent: prevent zombie state by detecting when the proxy stops and triggering the restart mechanism.
IsRunning() checks a channel's state - an instant, non-blocking operation. Context is useful for operations that might block or need cancellation, which doesn't apply here. The parameter was originally added for consistency with Start/Stop, but that consistency adds no value since the semantics are different. Addresses PR review feedback from dmjb.
- Rewrite test to use status API instead of fragile log file parsing - Fix mock server to return 500 on all paths when unhealthy - Add TOOLHIVE_HEALTH_CHECK_INTERVAL env var for configurable intervals (allows test to complete in ~12s instead of 30+ seconds) Signed-off-by: Greg Katz <gkatz@indeed.com>
|
@dmjb - I wanted to get your thoughts on one aspect of this PR. The latest commit adds a I realize this expands the scope a bit - adding production code to support faster E2E testing. A few options:
Happy to go either direction based on your preference. What do you think? |
|
@gkatz2 I think the interval env var is a good idea, let's keep it 👍 |
Address code review feedback by restructuring the proxy check to declare variables upfront and use assignment instead of short declaration, avoiding variable shadowing.
|
@gkatz2 Thanks for addressing the feedback. I'll merge when the build goes green. |
|
@dmjb Implemented the code restructure per your suggestion. I added proxyRunning := true
var err error
if t.proxy != nil {
proxyRunning, err = t.proxy.IsRunning()
if err != nil {
return false, err
}
}
return proxyRunning, nilCI is passing. Update: Oops, didn't see you already merged it! Thanks. |
Summary
IsRunning()method to theProxyinterfaceHTTPTransport.IsRunning()to check proxy state in addition to its own shutdown channelProblem
When health checks fail 3 consecutive times,
TransparentProxy.Stop()is called which closes the proxy's shutdown channel. However,HTTPTransport.IsRunning()only checks its own shutdown channel, not the proxy's. This causes the runner's status monitor to keep returningtrue, preventing the process from exiting and blocking the self-healing restart mechanism.Solution
HTTPTransport.IsRunning()now also callsproxy.IsRunning()to detect when the proxy has stopped itself due to health check failure. This enables the runner to detect the stopped state and trigger automatic restart.Test plan
IsRunning()behavior (pkg/transport/http_test.go)test/e2e/health_check_zombie_test.go)Fixes #3542