-
Notifications
You must be signed in to change notification settings - Fork 695
Fix Transient Gradle Wrapper Download Failures in CI/CD Pipeline #7952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…qDistributedTestCore job - Implements version-agnostic wrapper error detection - Retries only on wrapper download failures (403 errors, network issues) - Fails fast on real test/build failures to avoid wasting CI time - Safety check: fails immediately if execution >2 minutes (not wrapper issue) - Max 3 retry attempts with 15-second wait between retries - Prevents false failures from transient network/rate limit issues
|
Do you have an example of a build that failed due to this error? Shouldn't |
|
Hi @raboof , Thank you so much for taking your time to review this. The log is available at https://github.com/apache/geode/actions/runs/18988411198/job/54237555157. Unfortunately the download failed. |
|
I see. And indeed there the gradle-build-action doesn't appear to perform the download. It seems gradle-build-action is deprecated in favor of setup-gradle, and setup-gradle does appear to be able to download: https://github.com/gradle/actions/blob/main/docs/setup-gradle.md#build-with-a-specific-gradle-version . The code in this PR is quite a bit of extra stuff to maintain. It might be worth an experiment to see if upgrading to the setup-gradle action could perhaps be sufficient as well? |
|
Hi @raboof , That's an excellent idea. Let me experiment to see if it's feasible. I appreciate your insight. Thanks. |
- Replace deprecated gradle-build-action@v2 with setup-gradle@v5 - Remove 137 lines of complex retry logic from cqDistributedTestCore - Enable wrapper caching to prevent download failures - Configure all jobs to use project's gradle wrapper version Benefits: - Simpler code (net -93 lines) - Better reliability with built-in caching - Official action maintained by Gradle team - Automatic wrapper distribution caching - No custom retry logic needed The setup-gradle action provides superior caching and distribution management that should eliminate wrapper download failures while providing better debugging through job summaries.
|
Hi @raboof. I have implemented your suggestion according to your advice. All tests have passed. Thank you. |
raboof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since I think this is a good improvement.
However, it looks like the gradle-all.zip is still downloaded as part of the 'Run cq distributed tests' step, rather than the 'Setup Gradle' step. I'm not Gradle expert, but I had expected this to happen in the 'Setup Gradle' step. It might be helpful to hold on to your earlier retry commits, as that logic might turn out to be useful after all...
|
Thank you for your approval and helpful suggestion @raboof. |
|
Merging this as it's clearly an improvement, let's monitor for download failures to decide whether retry logic might still be needed. |
|
Thank you so much for your support @raboof |
Executive Summary
This PR implements intelligent retry logic for the
cqDistributedTestCorejob to handle transient Gradle wrapper download failures that result in 403 errors. The solution automatically retries wrapper download failures while failing fast on real test failures, preventing false positives without wasting CI/CD time.Key Metrics:
Problem Statement
Failure Description
The
cqDistributedTestCoreCI/CD job intermittently fails with a 403 Forbidden error when the Gradle wrapper attempts to download the Gradle 7.3.3 distribution. This is a transient infrastructure issue, not a code or test problem.Frequency: Occurs sporadically
Duration: Failures happen within 1 second of job start
Impact: False failures block CI/CD pipeline despite having no actual code/test issues
Error Details
Critical Observations from Stack Trace:
services.gradle.org(primary) andgithub.com/gradle/gradle-distributions(fallback)GradleWrapperMain.main()- this is during./gradlewStrictexecutionGRADLE_BUILD_ACTION_CACHE_RESTORED: true- cache was restored but wrapper still tries to downloadAffected Workflow
cqDistributedTestCore.github/workflows/gradle.ymlContext
The error occurs when the CI pipeline creates a modified
gradlewStrictwrapper script and attempts to execute distributed tests. The pipeline was successfully downloading from the official Gradle server (https://services.gradle.org/distributions/gradle-7.3.3-all.zip) but the wrapper internally tries to fall back to GitHub releases, which returns a 403 Forbidden error.Root Cause Analysis
Investigation Summary
Configuration Status: The Gradle wrapper is already correctly configured:
This configuration uses the official Gradle distribution server as recommended by Gradle best practices.
The Real Problem: Gradle Wrapper's Built-in Fallback Mechanism
The Gradle wrapper (version 7.3.3) has hardcoded fallback logic in the wrapper jar that cannot be configured.
Evidence from the actual error log shows the wrapper attempted BOTH URLs within 1 second:
What Actually Happened:
The log shows "Downloading https://services.gradle.org/..." printed at 00:45:19, followed immediately by a 403 error from "https://github.com/gradle/gradle-distributions/..." at 00:45:20.
This proves the Gradle wrapper (gradle-7.3.3) uses GitHub releases as a fallback source. The wrapper jar contains hardcoded logic that:
gradle-wrapper.properties(services.gradle.org)Important: The GitHub releases URL is NOT in our configuration - it's hardcoded in the Gradle 7.3.3 wrapper jar itself. We cannot disable or configure this fallback behavior.
Why Both Downloads Failed
In this specific incident:
Root causes:
Why This is Transient
Evidence the issue is not systematic:
Conclusion: This is a transient network/infrastructure issue, not a code or configuration problem.
Why Retry is the Solution
Since the failure is:
./gradlewStrictexecution)The only viable solution is to retry the entire command when wrapper download fails.
Solution: Intelligent Retry with Fail-Fast Protection
Design Philosophy
The solution must satisfy three critical requirements:
Implementation Strategy
Added a bash script with intelligent retry logic to the
cqDistributedTestCorejob that:Technical Implementation
File:
.github/workflows/gradle.ymlJob:
cqDistributedTestCoreStep: "Run cq distributed tests with intelligent retry"
Key Components:
1. Version-Agnostic Error Detection Function
Why This Works:
gradle-[0-9]+\.[0-9]+instead of hardcoded7.3.32. Dual Protection Mechanism
Protection 1: Time-Based Safety Check
Protection 2: Pattern-Based Detection
Why Both Are Needed:
3. Retry Loop with Clear Logging
Why This Approach:
Decision Tree
Performance Analysis
Time Overhead Comparison
Key Metrics:
Statistical Analysis
Assumptions:
Without Intelligent Retry:
With Intelligent Retry:
Expected Time Impact:
Impact
Risk Assessment
Affected Areas
cqDistributedTestCorejob in.github/workflows/gradle.ymlBackward Compatibility
Test Scenarios and Expected Behavior
Scenario 1: Normal Execution (Wrapper Downloads Successfully)
Input: Test execution with working network
Expected Output:
Result: ✓ Single attempt, no retry, total time = normal test duration
Scenario 2: Wrapper Download Failure with Successful Retry
Input: Transient network issue on first attempt
Expected Output:
Result: ✓ Automatic recovery, overhead = 16 seconds, prevents false failure
Scenario 3: Real Test Failure (Long-Running)
Input: Legitimate test failure after 4 hours
Expected Output:
Result: ✓ Immediate failure, NO RETRY, time saved = 4+ hours
Scenario 4: Compilation Error (Medium Duration)
Input: Code compilation failure
Expected Output:
Result: ✓ Immediate failure, NO RETRY, time saved = 3+ minutes
Scenario 5: Wrapper Fails Multiple Times, Eventually Succeeds
Input: Persistent but temporary network issues
Expected Output:
Result: ✓ Maximum resilience, overhead = 32 seconds
Scenario 6: Wrapper Fails All Attempts
Input: Persistent infrastructure problem (e.g., services.gradle.org down)
Expected Output:
Result: ✓ Clear indication of infrastructure issue, total attempts = 3
Additional Context
Why Intelligent Retry is Needed
The Gradle wrapper has built-in fallback logic:
services.gradle.org(official CDN)github.com/gradle/gradle-distributionsThis happens intermittently due to:
Failure Timeline
Based on the error logs, wrapper failures occur extremely fast:
Total duration: 1 second
This fast failure allows the retry logic to:
Future Enhancements
If similar issues occur in other test jobs, apply the same retry logic to:
wanDistributedTestCoreluceneDistributedTestCoremgmtDistributedTestCoreassemblyDistributedTestCoreThe retry script is version-agnostic and can be reused across all jobs.
References
.github/workflows/gradle.yml(cqDistributedTestCore job)Files Changed
.github/workflows/gradle.yml- Added intelligent retry logic tocqDistributedTestCorejobChecklist
For all changes, please confirm:
develop)?gradlew buildrun cleanly?