Fix/2229 webhook service down failures#2477
Open
paulnegz wants to merge 2 commits intoreplicate:mainfrom
Open
Conversation
This change replaces pip with uv for Python package installation in container builds. Key changes: - Update StandardGenerator to use uv for package installation - Add proper uv caching configuration - Update tests to expect uv-based commands - Update documentation to reflect uv usage Fixes replicate#2167 Signed-off-by: Paul Negedu <paul.negedu@yahoo.com>
…cancellation - Add webhook timeout (10s default, configurable via COG_WEBHOOK_TIMEOUT) - Use ThreadPoolExecutor for webhook calls to prevent blocking main thread - Reduce max retries from 12 to 6 to avoid blocking too long (~60s vs 320s) - Add comprehensive tests for timeout, retry behavior, and background execution - Fix GOARCH assignment bug in dockerfile generation This fixes issue replicate#2229 where webhook service being down would: 1. Block async /predictions requests indefinitely 2. Prevent cancellation of stuck requests 3. Leave health check stuck in 'BUSY' state The fix ensures webhook failures are handled gracefully in background threads without blocking the main prediction workflow. Signed-off-by: Paul Negedu <paul.negedu@yahoo.com>
Author
|
@zeke I'll appreciate if I can get a review and feedback for this. Open to making adjustments to move this forward |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Webhook Service Down No Longer Blocks Async Predictions or Cancellation
Webhooks now run in background threads using a ThreadPoolExecutor, so failures or timeouts do not block the main prediction flow.
Timeouts and retries improved: Webhook calls have a default 10s timeout (configurable via COG_WEBHOOK_TIMEOUT), and terminal status webhooks now retry up to 6 times (down from 12), reducing worst-case wait from ~320s to ~60s.
Graceful error handling: Connection errors, timeouts, and HTTP errors are logged but do not block or crash the worker.
Comprehensive tests added: New and improved tests simulate webhook timeouts, connection failures, retry logic, and verify that cancellation and health checks are never blocked by webhook issues.
Bonus: Fixed a bug in Dockerfile generation where GOARCH was incorrectly set to runtime.GOOS instead of runtime.GOARCH.
Closes #2229.
Async predictions and cancellation are now robust to webhook service outages.