Skip to content

Fix zombie process when proxy stops due to health check failure#3543

Merged
dmjb merged 8 commits intostacklok:mainfrom
gkatz2:fix/health-check-zombie-process
Feb 4, 2026
Merged

Fix zombie process when proxy stops due to health check failure#3543
dmjb merged 8 commits intostacklok:mainfrom
gkatz2:fix/health-check-zombie-process

Conversation

@gkatz2
Copy link
Contributor

@gkatz2 gkatz2 commented Feb 1, 2026

Summary

  • Fix zombie process issue where the ToolHive proxy process continues running after health check failure triggers proxy shutdown
  • Add IsRunning() method to the Proxy interface
  • Update HTTPTransport.IsRunning() to check proxy state in addition to its own shutdown channel
  • Add unit tests and e2e test for the fix

Problem

When health checks fail 3 consecutive times, TransparentProxy.Stop() is called which closes the proxy's shutdown channel. However, HTTPTransport.IsRunning() only checks its own shutdown channel, not the proxy's. This causes the runner's status monitor to keep returning true, preventing the process from exiting and blocking the self-healing restart mechanism.

Solution

HTTPTransport.IsRunning() now also calls proxy.IsRunning() to detect when the proxy has stopped itself due to health check failure. This enables the runner to detect the stopped state and trigger automatic restart.

Test plan

  • Unit tests for IsRunning() behavior (pkg/transport/http_test.go)
  • E2e test verifying runner detects proxy shutdown (test/e2e/health_check_zombie_test.go)
  • Manual testing confirmed fix enables self-healing restart

Fixes #3542

When health checks fail for a remote MCP server, the TransparentProxy
calls Stop() which closes its shutdownCh. However, HTTPTransport.IsRunning()
only checked its own shutdownCh, not the proxy's. This caused the runner's
status monitor to keep returning true, preventing the process from exiting.

Changes:
- Add IsRunning() method to the Proxy interface
- Update HTTPTransport.IsRunning() to also check proxy.IsRunning()
- Add IsRunning() implementations to HTTPSSEProxy and HTTPProxy (streamable)
- Add unit tests for the IsRunning behavior
- Add e2e test for health check zombie prevention

The TransparentProxy already had IsRunning() - this change ensures the
HTTPTransport properly delegates to it when checking running status.

Signed-off-by: Greg Katz <gkatz@indeed.com>
@dmjb
Copy link
Contributor

dmjb commented Feb 3, 2026

@gkatz2 Thanks for this change. If you can address the pieces of feedback I have here, I'm happy to merge.

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Feb 3, 2026
@dmjb
Copy link
Contributor

dmjb commented Feb 3, 2026

Looks like you may also need to run task gen locally to re-update some mocks.

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 41.37931% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.59%. Comparing base (ac82979) to head (48c4dff).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/transport/proxy/httpsse/http_proxy.go 0.00% 6 Missing ⚠️
pkg/transport/proxy/streamable/streamable_proxy.go 0.00% 6 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go 57.14% 2 Missing and 1 partial ⚠️
pkg/runner/runner.go 0.00% 1 Missing ⚠️
pkg/transport/stdio.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3543      +/-   ##
==========================================
+ Coverage   65.51%   65.59%   +0.07%     
==========================================
  Files         406      406              
  Lines       39996    40019      +23     
==========================================
+ Hits        26202    26249      +47     
+ Misses      11762    11736      -26     
- Partials     2032     2034       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The previous commit generated this mock with full paths in the header
comments, but main uses relative paths. Regenerated to match main and
fix CI's "Verify Code Generation" check.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 3, 2026
@gkatz2
Copy link
Contributor Author

gkatz2 commented Feb 3, 2026

Thanks, @dmjb. Just pushed a commit for the mock generation issue. Looking at the code suggestion now.

@gkatz2
Copy link
Contributor Author

gkatz2 commented Feb 3, 2026

Re: Context parameter in IsRunning

You're right - the context isn't used by any of the implementations. I added it for consistency with Start/Stop, but since IsRunning is just checking a channel state (instant, non-blocking), context serves no purpose here. Will remove it.

Re: Code restructure

I think the suggestion has an issue - using := inside the if block creates a new proxyRunning that shadows the outer one, meaning that although the code appears to be using the same instance of the proxyRunning variable throughout, there are actually two. The bug manifests when the proxy has stopped (e.g., due to health check failure) - IsRunning() returns false, nil, the inner proxyRunning is set to false, but when the if block exits, the outer proxyRunning (still true) is returned.

Here are two corrected versions:

Option A - declare at top, single return:

proxyRunning := true
var err error
if t.proxy != nil {
    proxyRunning, err = t.proxy.IsRunning(ctx)
    if err != nil {
        return false, err
    }
}
return proxyRunning, nil

Option B - early return:

if t.proxy != nil {
    proxyRunning, err := t.proxy.IsRunning(ctx)
    if err != nil {
        return false, err
    }
    return proxyRunning, nil
}
return true, nil

Let me know which you prefer and I'll update!

The test was incorrectly expecting the process to exit after health
check failures. The actual expected behavior is that the transport
detects "not running" and triggers a restart attempt.

Changed the test to verify:
- "restart needed" appears in logs (proves transport detected failure)
- Server is still listed (restart in progress)

This matches the fix's intent: prevent zombie state by detecting
when the proxy stops and triggering the restart mechanism.
IsRunning() checks a channel's state - an instant, non-blocking operation.
Context is useful for operations that might block or need cancellation,
which doesn't apply here. The parameter was originally added for
consistency with Start/Stop, but that consistency adds no value since
the semantics are different.

Addresses PR review feedback from dmjb.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 3, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 3, 2026
- Rewrite test to use status API instead of fragile log file parsing
- Fix mock server to return 500 on all paths when unhealthy
- Add TOOLHIVE_HEALTH_CHECK_INTERVAL env var for configurable intervals
  (allows test to complete in ~12s instead of 30+ seconds)

Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot removed the size/M Medium PR: 300-599 lines changed label Feb 3, 2026
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Feb 3, 2026
@gkatz2
Copy link
Contributor Author

gkatz2 commented Feb 3, 2026

@dmjb - I wanted to get your thoughts on one aspect of this PR.

The latest commit adds a TOOLHIVE_HEALTH_CHECK_INTERVAL environment variable to the transparent proxy. This allows the E2E test to use a 1-second health check interval instead of the default 10 seconds, which reduces test runtime from 30+ seconds to ~12 seconds.

I realize this expands the scope a bit - adding production code to support faster E2E testing. A few options:

  1. Keep it as-is: Being able to control the health check interval could be valuable for production use cases (e.g., adjusting for slow/unreliable backends, or tightening intervals for faster failure detection)
  2. Remove it: I can revert that part and let the E2E test run slower (~30-35s). The test would still be reliable with the other fixes (status-based detection instead of log parsing)

Happy to go either direction based on your preference. What do you think?

@dmjb
Copy link
Contributor

dmjb commented Feb 4, 2026

@gkatz2 I think the interval env var is a good idea, let's keep it 👍

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 4, 2026
Address code review feedback by restructuring the proxy check
to declare variables upfront and use assignment instead of short
declaration, avoiding variable shadowing.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 4, 2026
@dmjb
Copy link
Contributor

dmjb commented Feb 4, 2026

@gkatz2 Thanks for addressing the feedback. I'll merge when the build goes green.

@dmjb dmjb merged commit f33de13 into stacklok:main Feb 4, 2026
33 checks passed
@gkatz2
Copy link
Contributor Author

gkatz2 commented Feb 4, 2026

@dmjb Implemented the code restructure per your suggestion. I added var err error since using = instead of := requires the variable to be pre-declared.

proxyRunning := true
var err error
if t.proxy != nil {
    proxyRunning, err = t.proxy.IsRunning()
    if err != nil {
        return false, err
    }
}
return proxyRunning, nil

CI is passing.

Update: Oops, didn't see you already merged it! Thanks.

@gkatz2 gkatz2 deleted the fix/health-check-zombie-process branch February 4, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zombie process when TransparentProxy stops due to health check failure

2 participants