-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
CI Run Link: https://github.com/coder/coder/actions/runs/20274781010
Branch: main
Commit: 089b67761ad7b8f66404a0b3ac61f62b9cec0b74 (author: Mathias Fredriksson) — coder/coder@089b677
Timing: Failures occurred within minutes of the Slack alert (same run/day).
Failure summary
- Workflow job: gen
- Step: Install Protoc
- Root cause classification: Infrastructure (external artifact download from GitHub Releases)
Key evidence (from job logs)
mkdir -p /tmp/proto
pushd /tmp/proto
curl -L -o protoc.zip https://github.com/protocolbuffers/protobuf/releases/download/v23.4/protoc-23.4-linux-x86_64.zip
unzip protoc.zip
...
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 92 100 92 0 0 2885 0 --:--:-- --:--:-- --:--:-- 2967
End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive.
Archive: protoc.zip
unzip: cannot find zipfile directory in one of protoc.zip or protoc.zip.zip, and cannot find protoc.zip.ZIP, period.
##[error]Process completed with exit code 9.
Related evidence from the same run (corroborating infra flake)
- test-go-pg (windows-2022) failed to download Arsenal Image Mounter driver files due to repeated 504s from GitHub:
curl: (22) The requested URL returned error: 504
Warning: Problem : HTTP error. Will retry in 5 seconds. 5 retries left.
...
##[error]Process completed with exit code 22.
- test-go-race-pg failed Terraform provider installation with:
Error while installing coder/coder v2.13.1: ... please try again later: 504 Gateway Timeout returned from github.com
(Tracked in existing issue below.)
Root cause
- External artifact fetches (GitHub Releases) intermittently returned partial/invalid content or 504 Gateway Timeout.
- Not a product code/test flake; no panic/OOM/data race signatures.
Related issues
- Similar closed: flake: Build/offlinedocs external download failures (protoc unzip; rcodesign 503) #1144 (protoc unzip failure in offlinedocs/build) — same failure mode, different job.
- Related open: flake: gen - Terraform provider download failure (context deadline exceeded) #1201 (Terraform provider download flake) — same underlying infra, different artifact/job.
Ownership / assignment analysis
- The step lives in .github/workflows/ci.yaml under the gen job (Install Protoc). This is CI infra ownership.
- Recent substantive CI maintenance has been by @kacpersaw and @ethanndickson.
- Assigning to @kacpersaw for triage of CI download reliability in gen.
Mitigations to consider
- Add robust retries and validation to the protoc download step (e.g., curl --retry-all-errors --retry 5 --retry-delay 2 --fail; unzip -t before install; verify checksum and retry on failure).
- Consider mirroring protoc or using a package manager/cache.
Reproduction / next steps
- Re-run the workflow typically succeeds (transient).
- Update the Install Protoc step in ci.yaml to include retries and validation as above.
Quality Checklist
- Downloaded complete logs for failing jobs
- Verified timing alignment with Slack alert
- Searched coder/internal (open + last 30 days closed) for duplicates
- Classified root cause (Infrastructure) and included concrete evidence
- Not a duplicate of existing issue (flake: Build/offlinedocs external download failures (protoc unzip; rcodesign 503) #1144 is similar but a different job; flake: gen - Terraform provider download failure (context deadline exceeded) #1201 tracks provider fetches)
- Assignment based on CI component ownership, not commit/PR author