diff --git a/.claude/skills/analyze-toxgen/SKILL.md b/.claude/skills/analyze-toxgen/SKILL.md new file mode 100644 index 0000000000..c6026afd78 --- /dev/null +++ b/.claude/skills/analyze-toxgen/SKILL.md @@ -0,0 +1,90 @@ +--- +name: analyze-toxgen-failures +description: Analyze toxgen failures +--- + +# Analyze Toxgen + +## Instructions + +The purpose of this skill is to analyze and resolve test failures introduced by +updating our test matrix. + + +### Step 1: Fetch CI status + +Find the newest PR from the toxgen/update branch in the getsentry/sentry-python +repo on GitHub. + +Check the results of the CI runs on the PR. If not all workflows have +finished, wait for them to finish, checking periodically every ~10 seconds. +Look for failures of worfklows starting with "Test" (for example, +"Test Agents" or "Test Web 1", etc.). If there are no failures, inform the +user and don't continue with the next steps. + +The different jobs run the test suite for multiple integrations on one Python +version. So for instance "DBs (3.12, ubuntu-22.04)" runs the tests for database +integrations on Python 3.12. Multiple versions of a single integration might +be run in a single job, so for example the Redis step in the +"DBs (3.12, ubuntu-22.04)" job might run the test suite against Redis versions +5.3.1 as well as 6.4.0. The test matrix that determines what package versions +are run on which Python versions is stored in tox.ini. + +Make a list of all tox targets that are failing. A tox target has the pattern +"py{python-version}-{integration}-{package-version}". A failing tox target +will look like this: "py3.14t-openai_agents-v0.9.1: FAIL", while a passing +one will look like this: "py3.14t-openai_agents-v0.9.1: OK". + +Compile a text summary that contains the following: + - A list of all failing integrations. + - For each integration: + * The specific tox targets that are failing + * The test failure message or error output from the failing tests + * The command used in CI to run the test suite to reproduce the failure -- + it should use tox, check the job output for the specific command + - Show the summary to the user. + + +### Step 2: Analyze failures + +Do this for each integration that is failing. + +#### Determine if the failure is consistent + +The first step is to determine whether the failure is a flake, for example +because of a temporary connection problem, or if it persists, because it's +related to a change introduced in a new version. +- Check if the package version that's failing was newly added to tox.ini via + the PR. If not, it's likely to be a flake. +- Check whether the same package version is failing on other Python versions, + too. Check whether there are more failing jobs on different Python versions + where the same version of the same package is failing. If the specific package + version is failing across multiple Python versions, it's unlikely to be a + flake. + +If it looks like a flake, offer to rerun the failing test suite. If the +user accepts, wait for the result, polling periodically. If the run is +successful, there's nothing else to do; move on to the next integration if +there is another one that's failing. + +#### Analyze non-flake failures + +Run the test suite for the failing tox target locally via `tox -e tox_target}`. + +Analyze the error message from the local run, then start localizing +the source of the breakage: + +1. Retrieve the repo code. Use the checkout-project-code skill for that. +2. Using git, retrieve the diff between the last working version of the package + (the original max version in tox.ini before the PR) and the newly introduced, + failing version. +3. Analyze the diff, looking for parts that could be related to the failing + tests. Remember the specific code parts that are relevant so that you can + show them to the user. + +Present the user with the results of your investigation. Make sure to link or +point to the specific code snippets for double checking. + +#### Propose a fix + +Ask the user if you should propose a fix. If yes, do it.