Extend existing failure notifications for tracking flaky CI issues by G-D-Petrov · Pull Request #2974 · man-group/ArcticDB

G-D-Petrov · 2026-03-17T09:01:41Z

Replaces the inline bash in failure_notification.yaml with standalone Python
scripts that parse CI failures, track them as GitHub issues, and send Slack
notifications. The new implementation adds three capabilities that the bash
version lacked:

Timeout detection — test-runner steps that fail without producing test
names are recognised as timeouts and tracked under a dedicated CI timeout
issue instead of the generic "unparseable" bucket.
PR comment trigger — developers can comment /analyse-failures on a PR
to analyse its most recent failed CI run and get results posted back as a
comment.
Correct JSON escaping — Slack payloads are built with json.dumps
instead of bash string interpolation, eliminating broken messages from
special characters in test names.

How failures are classified

                  ┌─────────────────────────┐
                  │ Are there parsed test    │
                  │ names (FAILED/ERROR)?    │
                  └────────┬────────────────┘
                     yes   │          no
                           ▼          │
                  ┌──────────────┐    ▼
                  │ test_failure │  ┌──────────────────────────┐
                  │              │  │ Are there infra steps    │
                  │ Track each   │  │ (non-test-runner)?       │
                  │ test as a    │  └────────┬─────────────────┘
                  │ flaky-test   │     yes   │         no
                  │ issue        │           ▼         │
                  └──────────────┘  ┌──────────────┐   ▼
                                    │ infra_failure │ ┌────────────────────────┐
                                    │              │ │ Were test-runner steps  │
                                    │ Track each   │ │ stripped out?           │
                                    │ step as a    │ └────────┬───────────────┘
                                    │ flaky-step   │    yes   │        no
                                    │ issue        │          ▼        │
                                    └──────────────┘ ┌─────────┐       ▼
                                                     │ timeout │  ┌─────────┐
                                                     │         │  │ unknown │
                                                     │ Single  │  │         │
                                                     │ CI      │  │ Unparse-│
                                                     │ timeout │  │ able    │
                                                     │ issue   │  │ issue   │
                                                     └─────────┘  └─────────┘

--conclusion timed_out from GitHub forces the timeout path regardless.

Real-world examples

Example 1: Test failure — run 24664309055

Context: Building master on push, conclusion failure.
One failed job: 3.9 Linux / integration-DefaultCache, failed step: Run test.

Logs contain:

FAILED tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data

Pipeline output:

failure_kind.txt → test_failure
failing_tests.txt → tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data
failed_steps.txt → empty (Run test filtered as test-runner step)
Issue created: Flaky test: tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data with flaky-test label
Slack: :rotating_light: *New* — \tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data``

Example 2: Infrastructure failure — run 24807645574

Context: Building master on schedule, conclusion failure.
One failed job: 3.11 Windows / compile (...), failed step: Install Required MSVC.

No test logs (compile job, not a test job).

Pipeline output:

failure_kind.txt → infra_failure
failing_tests.txt → empty
failed_steps.txt → Install Required MSVC
Issue created: Flaky step: Install Required MSVC with flaky-step label
Slack: :rotating_light: *New* — \Install Required MSVC``

Example 3: Timeout — run 24841138872

Context: Building master on workflow_dispatch, conclusion failure.
Two failed jobs:

3.11 Windows / unit-DefaultCache — step Run test timed out after 120 min
3.10 Windows / stress-DefaultCache — step Run test timed out

Logs show tests running (at 95% progress) but no FAILED/ERROR lines — the
runner was killed before completion.

Pipeline output:

failure_kind.txt → timeout (only test-runner steps, no parsed test names)
failing_tests.txt → empty
failed_steps.txt → empty (Run test filtered)
Issue created: CI timeout with ci-failure label
Slack: :hourglass: *Timeout*

claude · 2026-03-17T09:03:13Z

+    types: [completed]
+
+jobs:
+  track-failures:


The workflow triggers on any branch failure (workflow_run fires regardless of the branch the triggering run ran on). The PR description says "on the master branch" but the if: condition only checks conclusion == 'failure' — it does not filter on branch. This means flaky-test issues will be auto-created for failures on feature branches, release branches, forks, etc., generating noise.

Consider adding a branch filter:

Suggested change

track-failures:

if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.head_branch == 'master'

claude · 2026-03-17T09:03:21Z

+          for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do
+            job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name")
+            step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \
+              --jq '.steps[] | select(.conclusion == "failure") | .name')


The log download silently succeeds even when it fails (|| true). If gh run view --log-failed fails (e.g. rate-limit, token permission issue), /tmp/failed_logs.txt may be empty or contain an error message, causing the grep patterns to produce zero results. The workflow would then fall through to the "unparseable failures" path and create a generic issue for every CI failure — this is the highest-noise fallback and should be avoided.

Consider logging a warning when the log download fails so at least the issue body contains diagnostic context.

claude · 2026-03-17T09:03:31Z

+          echo "$FAILING_TESTS" | while IFS= read -r test_name; do
+            [ -z "$test_name" ] && continue
+
+            echo "Processing test: $test_name"


Injection risk via ${{ steps.parse.outputs.tests }}.

FAILING_TESTS is populated from parsed log output (test names) and then passed through ${{ steps.parse.outputs.tests }}. GitHub Actions interpolates ${{ ... }} expressions directly into the YAML before the runner executes the step. If a crafted test name contains shell metacharacters or YAML-breaking sequences, this could alter step behaviour.

The correct mitigation is to reference the output only via the environment variable ($FAILING_TESTS), which is already being done inside the run: script — the environment variable assignment itself is the risk. Prefer writing step outputs to a file (e.g. /tmp/failing_tests.txt) and reading that file in the subsequent step, rather than passing through ${{ steps.parse.outputs.tests }} as an env value.

The same pattern applies to FAILED_STEPS at line 197.

claude · 2026-03-17T09:03:38Z

+              body="### Another failure observed"
+              body="$body"$'\n\n'"- **Run:** $RUN_URL"
+              body="$body"$'\n'"- **Commit:** \`${COMMIT_SHA:0:10}\`"
+              body="$body"$'\n'"- **Date:** $run_date"


The GitHub Search API (used by gh issue list --search) can return stale results with a lag of a few minutes. If two CI runs fail in quick succession for the same test, the deduplication search may find no existing issue for both runs, causing two issues to be created with the same title.

A safer deduplication approach would be to search by exact title match using gh issue list --search "is:issue is:open label:flaky-test \"$issue_title\"" and then verify the title server-side with --jq, which is what the code already does — but the underlying search index latency is unavoidable. This is worth noting in the workflow comments so maintainers understand the duplicates-on-simultaneous-failures edge case.

claude · 2026-03-17T09:03:49Z

+          RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
+          REPO: ${{ github.repository }}
+          COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
+        run: |


The "unparseable failures" step creates a new issue per run with a unique title (CI failure (unparseable): run #$RUN_ID). If the log download frequently fails or the grep patterns never match (e.g. due to a log format change), this will create an unbounded number of issues — one for every CI failure.

Unlike the flaky-test and flaky-step steps, there is no deduplication here. Consider either deduplicating by searching for a generic title (e.g. CI failure (unparseable)) or enforcing a rate-limit / daily cap, or simply skipping issue creation and only writing to $GITHUB_STEP_SUMMARY for this fallback case.

claude · 2026-03-17T09:03:55Z

+              body="$body"$'\n\n'"---"
+              body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
+              body="$body"$'\n'"*Add further failure occurrences as comments below.*"
+              gh issue create --repo "$REPO" \


The issue body hard-codes "on the master branch" but the workflow is not restricted to master (see the branch-filter comment on line 9). This text will be misleading when issues are filed from non-master runs.

Suggested change

gh issue create --repo "$REPO" \

body="$body"$'\n\n'"This test has been detected as failing in CI."

claude · 2026-03-17T09:04:02Z

+          # Get failed jobs with their names and IDs
+          failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
+            --jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]')
+


The API call fetches up to 100 jobs (per_page=100) but does not handle pagination. If a workflow run has more than 100 jobs (unlikely but possible as the matrix grows), failures in jobs beyond position 100 will be silently missed. Consider adding --paginate if the gh api CLI supports it here, or at least document this limitation.

claude · 2026-03-17T09:04:38Z

ArcticDB Code Review Summary

Changes in this PR: Extracts inline bash logic from failure_notification.yaml into standalone Python scripts (parse_ci_failures.py, track_ci_issues.py, send_slack_notification.py) with unit tests. Adds a /analyse-failures slash command, a new resolve_pr_run.py script, failure_kind.txt failure classification, a dedicated handle_timeout method, and the autogenerated label on all auto-created issues.

Previously Flagged — Status After Latest Push

✅ FIXED: GitHub Search special-char escaping — _sanitise_search_query strips [, ], :, /, ., " before the search query, and an exact title match loop filters results accurately.
✅ FIXED: create_grouped_issue deduplication — now uses stable titles (Grouped CI test failures / Grouped CI step failures) with find_existing_issue, so repeated grouped failures comment on the same issue rather than creating a new one per run.
✅ FIXED: status=failure query misses timed_out runs — find_latest_failed_run now loops over both failure and timed_out status values.
✅ FIXED: Branch name not URL-encoded — resolve_pr_run.py now uses urllib.parse.quote(branch, safe="").
✅ FIXED: /analyse-failures collaborator check — any-user trigger risk resolved.
✅ FIXED: import re inside _sanitise_search_query function body — moved to module top-level in track_ci_issues.py.
✅ FIXED: open(...).close() without context manager — those lines were removed entirely.
✅ FIXED: run_date used current wall-clock time instead of actual CI run time — IssueTracker now accepts a run_date argument populated from created_at in the GitHub API response.
✅ FIXED: Duplicated run_gh helper in every script — extracted to utils.py; all scripts import from it.
✅ FIXED: except Exception: in download_failed_logs (parse_ci_failures.py) — now correctly catches subprocess.CalledProcessError only, eliminating duplicate error messages.
✅ FIXED: Slack <url|text> links not converted to GitHub Markdown in notify-pr PR comment — notify-pr now reads github_summary output which is produced by SummaryEntry.to_github(), generating proper [text](url) hyperlinks natively. No sed conversion needed.
PARTIAL: Expression injection (head_branch) — WR_BRANCH is now passed as an environment variable rather than interpolated directly into a shell command, eliminating the practical injection risk. GitHub Advanced Security may still flag the ${{ github.event.workflow_run.head_branch }} expression in the env: block; the practical risk is negligible but the scanner flag will likely remain.

Open Issues

OPEN: head_branch GHAS flag (failure_notification.yaml, notify-slack job) — scanner will likely continue to flag ${{ github.event.workflow_run.head_branch }} even though it is now passed as an env var. Can be silenced with a comment-level suppression or by documenting why it is safe.

Minor / Informational

utils.py run_gh truncates the logged command to args[:3] for readability — if a failure is caused by an argument beyond position 3 (e.g. a bad --jq expression), the error message won't show it. Very minor.
test_ci_scripts.yml push trigger omits ".github/workflows/test_ci_scripts.yml" from the paths filter (unlike the pull_request trigger), so a direct push to master that only modifies the workflow file itself won't run the tests. Low-impact since changes to this repo flow through PRs.
find_latest_failed_run in resolve_pr_run.py interpolates workflow_name directly into the JQ expression string (f-string). The values are hardcoded in TRACKED_WORKFLOWS so there is no injection risk in practice, but it would be cleaner to use --arg.
handle_unparseable inlines the issue footer text rather than using _issue_footer(), producing slightly different wording. Not a bug, just minor inconsistency.
Note: The notify-pr heredoc <<EOF works correctly — YAML literal block scalars strip the base indentation before passing the script to the shell, so the EOF terminator lands at column 0. $SUMMARY expansion in the unquoted heredoc is safe: bash does not recursively scan the output of variable expansion for backtick command substitutions.

Checklist

Security:

✅ FIXED: /analyse-failures collaborator check — any-user trigger risk resolved
GHAS still flags head_branch expression (failure_notification.yaml, notify-slack job) — low practical risk but needs resolution or documented suppression

Correctness:

✅ FIXED: Slack <url|text> links now render correctly in PR comments via to_github() format

Testing:

DONE: Unit tests added and cover parse_ci_failures, track_ci_issues, resolve_pr_run, send_slack_notification, utils
DONE: New test_ci_scripts.yml workflow runs Python unit tests on PRs touching .github/scripts/

claude · 2026-03-17T09:46:36Z

 jobs:
-  on-failure:
+  track-failures:
    runs-on: ubuntu-latest


The track-failures job has no branch restriction — it fires on failures from any branch (feature branches, forks' PRs, etc.). This will create GitHub issues for every flaky test run on any contributor's branch, generating significant noise.

The original workflow explicitly filtered to master or '' (scheduled runs). That filter should be preserved here, or at minimum added as a comment explaining the intentional choice to track all branches.

Suggested change

runs-on: ubuntu-latest

if: >-

(github.event.workflow_run.conclusion == 'failure' ||

github.event.workflow_run.conclusion == 'timed_out') &&

(github.event.workflow_run.head_branch == 'master' ||

github.event.workflow_run.head_branch == '')

I think we should make the track-failures run on PR branches only after a human presses a button. Otherwise we will fill up github with false positives (e.g. I pushed a small change to my PR but broke all my tests because of dumb bug)

claude · 2026-03-17T09:46:46Z

+
+          echo "Fetching failed jobs for run $RUN_ID..."
+
+          failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \


The jobs API is called with per_page=100 but without pagination. A large matrix workflow can exceed 100 jobs (ArcticDB's build matrix is already sizeable). Any failures in jobs beyond position 100 will be silently missed, causing the issue tracker to never see those test failures.

Consider using gh api --paginate or at least adding a comment noting the limit. Note that --paginate with --jq requires using jq -s to merge pages, e.g.:

failed_jobs_json=$(gh api --paginate "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \ --jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]' | jq -s 'add // []')

I agree with claude's comment

claude · 2026-03-17T09:46:57Z

+          echo "$failed_steps"
+
+          # --- Download logs and parse test failures ---
+          gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || true


The || true here silently swallows failures from gh run view --log-failed. If this command fails — due to a rate limit, insufficient token permissions (actions: read is granted, but log access can be restricted on forks), or a transient API error — the file /tmp/failed_logs.txt will either be empty or contain an error message like HTTP 403. The subsequent grep calls will match nothing, both FAILING_TESTS and FAILED_STEPS will be empty, and the workflow will fall through to the unparseable fallback, creating a new ci-failure issue for every such run.

At minimum, log the exit code or file content so failures are visible in the step logs:

gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || { echo "WARNING: gh run view --log-failed failed (exit $?); test name parsing will be skipped" }

claude · 2026-03-17T09:47:08Z

+          RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
+          REPO: ${{ github.repository }}
+          COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
+          FAILING_TESTS: ${{ steps.parse.outputs.tests }}


Security: step output injection via ${{ ... }} expression interpolation.

FAILING_TESTS and FAILED_STEPS are populated from parsed CI log content — test names and step names that ultimately come from source code, commit messages, and workflow definitions. GitHub Actions evaluates ${{ steps.parse.outputs.tests }} by substituting the value directly into the YAML before the runner executes the step.

If a test name or step name contains characters that break the YAML/shell context (e.g. a test name with \n, backticks, or $(...) sequences), this can alter step behaviour. While the workflow_run trigger somewhat limits exposure (only code on this repo can influence test names), it is still best practice to avoid interpolating step outputs directly into env: blocks.

The safe pattern is to write the parsed outputs to a file in the parse step and read that file in the issues step:

# In parse step: echo "$failing_tests" > /tmp/failing_tests.txt echo "$infra_steps" > /tmp/failed_steps.txt # In issues step (no ${{ }} interpolation of untrusted content): FAILING_TESTS=$(cat /tmp/failing_tests.txt) FAILED_STEPS=$(cat /tmp/failed_steps.txt)

This avoids the ${{ }} interpolation path entirely for untrusted content.

This is probably fine because of the somewhat strict regex pattern matching, but still worth considering this claude's suggestion. I guess if we extract the above to bash script we can pass --output-dir to it where it will put all the text files needed for next steps.

claude · 2026-03-17T09:47:19Z

+
+          # --- Fallback for unparseable failures ---
+          if [ -z "$FAILING_TESTS" ] && [ -z "$FAILED_STEPS" ]; then
+            issue_title="CI failure (unparseable): run #$RUN_ID"


The unparseable-failure fallback creates a new issue for every CI run using a unique title (CI failure (unparseable): run #$RUN_ID). There is no deduplication here.

The primary scenario that triggers this path is when the log download fails (see the || true concern on line 68). If that happens consistently — e.g. due to a log permission change or gh CLI upgrade breaking the --log-failed flag — every CI failure on master will produce a new issue indefinitely.

Consider one of:

Writing to $GITHUB_STEP_SUMMARY only, instead of opening an issue, so humans can investigate without issue spam.

Searching for a generic open issue titled CI failures (unparseable) and adding a comment to it rather than creating a new issue per run.

Capping at a daily or weekly issue (use a date-stamped title like CI failure (unparseable): YYYY-MM-DD).

claude · 2026-03-17T09:47:30Z

+      (github.event.workflow_run.head_branch == 'master' ||
+       github.event.workflow_run.head_branch == '') &&
+      always() &&
+      needs.track-failures.result != 'skipped'


The notify-slack condition has a subtle logic gap. The combination of always() and needs.track-failures.result != 'skipped' means this job will also fire when track-failures itself fails (e.g. due to a permissions error, API rate limit, or script bug). In that case needs.track-failures.outputs.slack_summary will be empty, and the Slack message will be sent but with no failure details — potentially confusing on-call engineers.

Consider adding needs.track-failures.result != 'failure' to the condition, or at minimum handle the case where SUMMARY is empty by adding a fallback message indicating that issue tracking itself failed:

if: >- (github.event.workflow_run.head_branch == 'master' || github.event.workflow_run.head_branch == '') && always() && needs.track-failures.result != 'skipped' && needs.track-failures.result != 'cancelled'

I think we should indeed notify in case of track-failures failure (otherwise we'll be losing coverage of failures silently), but we should probably have a friendly error message in that case.

claude · 2026-03-17T09:47:40Z

+                "type": "section",
+                "text": {
+                  "type": "mrkdwn",
+                  "text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`"


Shell variables injected unescaped into a JSON string literal inside a heredoc.

$WORKFLOW_NAME, $CONCLUSION, and $BRANCH are substituted directly into the JSON payload string. If any of these contain double quotes, backslashes, or newlines (e.g. a workflow name like Build "with" quotes), the resulting JSON will be malformed and curl will send an invalid payload, silently failing with -f.

The escaped_summary is already being JSON-escaped via python3. Apply the same treatment to the other variables before inserting them:

esc_workflow=$(printf '%s' "$WORKFLOW_NAME" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])") esc_conclusion=$(printf '%s' "$CONCLUSION" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])") esc_branch=$(printf '%s' "$BRANCH" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")

Then use $esc_workflow, $esc_conclusion, $esc_branch in the heredoc.

IvoDD · 2026-04-24T06:46:32Z

 jobs:
-  on-failure:
+  track-failures:
    runs-on: ubuntu-latest


I think we should make the track-failures run on PR branches only after a human presses a button. Otherwise we will fill up github with false positives (e.g. I pushed a small change to my PR but broke all my tests because of dumb bug)

IvoDD · 2026-04-24T06:49:13Z

+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          REPO: ${{ github.repository }}
+        run: |
+          gh label create "flaky-test" --repo "$REPO" --color "FBCA04" --description "Flaky test auto-tracked by CI" 2>/dev/null || true


Should we use an autogenerated label for all of these? (So that we can filter and remove all in case of noise or bugs)

Or prefix all these labels with autogenerated e.g. autogenerated-flaky-test

IvoDD · 2026-04-24T06:50:26Z

Nit: Maybe rename the workflow

IvoDD · 2026-04-24T06:52:13Z

+
+          echo "Fetching failed jobs for run $RUN_ID..."
+
+          failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \


I agree with claude's comment

IvoDD · 2026-04-24T07:03:37Z

+          for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do
+            job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name")
+            step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \
+              --jq '.steps[] | select(.conclusion == "failure") | .name')


I don't think large bash scripts should live standalone in github workflows.
What do you think about extracting that to a separate file and calling into it with interface like:

track_failures --run-id something

and here we can just parse the stdout.

This will also make manual testing easier. E.g. you can run the track_failures script locally against a few example failures and see the output is correct. I think including such evidence in PR description would also be helpful.

IvoDD · 2026-04-24T07:12:39Z

+          RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
+          REPO: ${{ github.repository }}
+          COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
+          FAILING_TESTS: ${{ steps.parse.outputs.tests }}


This is probably fine because of the somewhat strict regex pattern matching, but still worth considering this claude's suggestion. I guess if we extract the above to bash script we can pass --output-dir to it where it will put all the text files needed for next steps.

IvoDD · 2026-04-24T07:22:15Z

+              body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
+              body="$body"$'\n'"*Add further failure occurrences as comments below.*"
+
+              track_item "$test_name" "Flaky test" "flaky-test" "$body"


I think opening a separate issue for each test can be problematic.

Most probably if >10 tests fail together their failures are correlated. This can happen in cases:

Someone accidentally breaks 1000 tests on their PR and fat fingers the button to track failure

Some fixture or shared setup breaks and results in dozens of failed tests

I guess we can try to produce a single item if >10 failures.

IvoDD · 2026-04-24T07:23:15Z

+              body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
+              body="$body"$'\n'"*Add further failure occurrences as comments below.*"
+
+              track_item "$step_entry" "Flaky step" "flaky-step" "$body"


Some general gh action outage can cause a lot of steps to fail together. Grouping them if many makes sense.

IvoDD · 2026-04-24T07:31:50Z

+      (github.event.workflow_run.head_branch == 'master' ||
+       github.event.workflow_run.head_branch == '') &&
+      always() &&
+      needs.track-failures.result != 'skipped'


I think we should indeed notify in case of track-failures failure (otherwise we'll be losing coverage of failures silently), but we should probably have a friendly error message in that case.

claude · 2026-04-24T11:56:03Z

+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: >
+          python3 .github/scripts/parse_ci_failures.py
+          --run-id "${{ steps.meta.outputs.run_id }}"


The run_id from a workflow_dispatch is user-supplied and still reaches the shell via ${{ steps.meta.outputs.run_id }} expression interpolation (GitHub Actions evaluates ${{ }} before the runner executes the step). If a user inputs a value containing ", `, or $(), it may break the quoted argument or execute shell code.

The meta step already follows the safe pattern of routing context values through env: variables. Apply the same here:

Suggested change

--run-id "${{ steps.meta.outputs.run_id }}"

- name: Parse CI failures

env:

GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

META_RUN_ID: ${{ steps.meta.outputs.run_id }}

META_REPO: ${{ github.repository }}

run: |

python3 .github/scripts/parse_ci_failures.py \

--run-id "$META_RUN_ID" \

--repo "$META_REPO" \

--output-dir /tmp/ci_failures

Same fix applies to the identical pattern in the "Create or update issues" step (line 85).

claude · 2026-04-24T11:56:07Z

+        "--json", "number,title",
+    )
+    if not raw:
+        return None


GitHub Search treats [, ], :, /, and . as special syntax. Pytest parametrized test names like tests/test_foo.py::test_bar[param1-param2] contain all of these. The search query '"tests/test_foo.py::test_bar[param1-param2]" in:title' may return zero results even when an exact-title issue exists, causing find_existing_issue to return None and create a duplicate issue on every subsequent run.

The exact-title match on line 63 (if issue["title"] == title) is correct dedup logic, but it only runs if the search returns any results. Consider broadening the search (e.g. drop brackets/colons from the query term) and relying solely on the exact-match filter, or escaping search-special characters before passing the title to --search.

claude · 2026-04-24T11:56:11Z

+
+        if item_type == "test":
+            label = "flaky-test"
+            heading = f"Grouped CI test failures ({count} tests)"


The grouped-issue title embeds run #{self.run_id}, making it unique per run. This means grouped failures have no deduplication: if 15+ tests fail consistently across many CI runs, a new issue is created for every single run — the same unbounded issue growth the handle_unparseable fix resolved.

Consider using a stable title (e.g. f"Grouped CI {item_type} failures") and applying the same find-or-create pattern as track_item / handle_unparseable: search for an open issue with that title, add a comment if found, create only if not found.

claude · 2026-04-24T12:25:16Z

+        raw = run_gh(
+            "api", "--paginate",
+            f"repos/{repo}/actions/runs"
+            f"?branch={branch}&status=failure&per_page=5",


Two issues on this line:

1. status=failure misses timed_out runs

The GitHub Actions Runs API treats failure and timed_out as separate conclusion (and status) values. Querying status=failure will not return runs that concluded as timed_out. The code then checks run["conclusion"] in ("failure", "timed_out") on the results, but timed_out runs are never returned by this query, so they will always be missed.

Fix: query with status=completed and filter locally, or make a second query with status=timed_out:

for status in ("failure", "timed_out"): raw = run_gh( "api", "--paginate", f"repos/{repo}/actions/runs?branch={branch_encoded}&status={status}&per_page=10", ... )

2. Branch name is not URL-encoded

branch comes from .head.ref via get_pr_head(). A branch named feat/foo&status=success would inject extra query parameters into the URL, returning wrong results (or all runs regardless of status). Apply URL encoding:

from urllib.parse import quote branch_encoded = quote(branch, safe='')

claude · 2026-04-24T12:59:02Z

+    if: >-
+      github.event_name == 'issue_comment' &&
+      github.event.issue.pull_request &&
+      contains(github.event.comment.body, '/analyse-failures')


The /analyse-failures command has no collaborator check. Any GitHub user can comment /analyse-failures on any PR, causing the workflow to consume actions: read quota, call gh issue list/create/comment (using issues: write), and post a PR comment (using pull-requests: write). On a public repo this is open to the world.

Consider gating on author_association:

Suggested change

contains(github.event.comment.body, '/analyse-failures')

contains(github.event.comment.body, '/analyse-failures') &&

(github.event.comment.author_association == 'MEMBER' ||

github.event.comment.author_association == 'OWNER' ||

github.event.comment.author_association == 'COLLABORATOR')

This still allows all repo members/maintainers while blocking external actors.

IvoDD · 2026-04-27T08:56:55Z


 jobs:
-  on-failure:
+  resolve-pr-run:


I'm ok with triggering via PR comments but I think it has a couple downsides:

Pollutes the PR description page with things not related to the PR (e.g. flaky master tests)

Can't easily trigger for a failure before the last

I guess I was thinking of adding a button to the job summary of a failed "Build and Test" or "Build with conda" or other like:

failure-analysis-link: needs: [cpp-test-linux, cpp-test-windows, cpp-test-macos, build-python-wheels-linux, build-python-wheels-windows, build-python-wheels-macos, persistent_storage_verify_linux, persistent_storage_verify_windows] if: always() && failure() runs-on: ubuntu-latest steps: - name: Write failure analysis link to summary run: | url="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/workflows/failure_notification.yaml" cat >> "$GITHUB_STEP_SUMMARY" <<'EOF' ## Failure Analysis EOF echo "[**Analyse failures →**]($url)" >> "$GITHUB_STEP_SUMMARY" echo "" >> "$GITHUB_STEP_SUMMARY" echo "Run ID: \`${GITHUB_RUN_ID}\`" >> "$GITHUB_STEP_SUMMARY"

I don't feel strongly about this, but using a link like above would allow us to delete reselov_pr_run.py and simplify logic here

I guess using both activation mechanisms is also an option

I think that I prefer the implemented approach a bit more because:

It doesn't require changes to "Build and Test", "Build with conda", etc

Makes flaky tests a bit more clear in the PR itself, thus easier to see for reviewers

IvoDD · 2026-04-27T13:28:07Z

+  notify-slack:
+    runs-on: ubuntu-latest
+    needs: [track-failures]
+    # Only send Slack notifications for automatic master triggers, not manual dispatch


Could be argued that manual dispatch calls on PRs are still flakyness present on master.

Still probably better to keep the notifications just for master to avoid excessive noise.

IvoDD · 2026-04-27T13:35:21Z

+def download_failed_logs(repo: str, run_id: str, output_path: str) -> bool:
+    """Download logs for the failed run. Returns True on success."""
+    result = subprocess.run(
+        ["gh", "run", "view", run_id, "--repo", repo, "--log-failed"],


Nit: Can use run_gh method

IvoDD · 2026-04-27T13:37:28Z

+    )
+    with open(output_path, "w") as f:
+        f.write(result.stdout)
+    if result.returncode != 0:


Maybe we can add some error handling in the general run_gh instead of just this one gh run view command

IvoDD · 2026-04-27T14:10:22Z

Are these tests ever ran on the CI?
I think it makes sense to enforce they pass before merging a PR.

added a flow to test them

IvoDD · 2026-04-27T14:30:31Z

+    req = urllib.request.Request(
+        webhook_url,
+        data=data,
+        headers={"Content-Type": "application/json"},
+        method="POST",
+    )


How does authentication against the slack webhook work? I see we in gh workflow we have the url itself as a secret.

Any chance urllib exception will leak the url in logs or something similar? It seems like in exception handling we at least don't log the url which is good.

Added extra handling for the URLError.

But in any case, since SLACK_WEBHOOK_URL is a repository secret, GitHub Actions automatically masks its value in all log output. Any occurrence of the secret's value gets replaced with ***. So even if it did leak through a traceback, it would be masked.

IvoDD · 2026-04-27T14:32:51Z

+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Send Slack notification


We were previously using an existing action ravsamhq/notify-slack-action@be814b201e233b2dc673608aa46e5447c8ab13f2.
Any reason to write our own slack notification code instead of building the message and using the existing action?

The ravsamhq/notify-slack-action doesn't support what we need. It uses Slack's legacy attachments format with a fixed schema — no Block Kit support and no way to pass an arbitrary payload. We'd lose the structured summary (per-test
lines, grouped failure messages, timeout indicators) and the tracker-failure warning.

IvoDD · 2026-04-27T14:33:54Z

+GROUPING_THRESHOLD = 10
+
+
+def run_gh(*args: str, check: bool = True) -> str:


This method is defined in more than one place, maybe we can add a common method in utils.py?

IvoDD · 2026-04-27T14:36:48Z

+        self.run_id = run_id
+        self.run_url = run_url
+        self.short_sha = commit_sha[:10]
+        self.run_date = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")


The run might be e.g. a week old, so the time of report might be misleading. Would be useful to use the time of the failure so we can cross reference against e.g. github outages.

claude · 2026-04-29T07:27:57Z

+    """Download logs for the failed run. Returns True on success."""
+    try:
+        output = run_gh("run", "view", run_id, "--repo", repo, "--log-failed")
+    except Exception:


Two issues here:

1. Overly broad exception type. run_gh raises subprocess.CalledProcessError on failure — catching Exception also swallows unexpected errors such as MemoryError or a programming bug inside run_gh itself. Prefer except subprocess.CalledProcessError: to be explicit about what you're tolerating.

2. Double error messages. When gh run view --log-failed fails, run_gh already prints "gh command failed (exit N): ..." to stderr before raising. Then this except block prints a second "WARNING: gh run view --log-failed failed; ..." message. Both appear in the same log for the same failure event.

To avoid the duplicate you could call run_gh with check=False and check whether the returned string is empty, but that conflates "command failed" with "no failed logs". The simplest targeted fix is:

Suggested change

except Exception:

except subprocess.CalledProcessError:

and add import subprocess at the top of the file (or rely on it being transitively available via utils). This keeps the warning message but narrows the catch to the expected failure mode.

claude · 2026-04-29T07:50:13Z

+          body=$(printf '%s\n' "$SUMMARY" \
+            | sed 's/:rotating_light:/🚨/g' \
+            | sed 's/:warning:/⚠️/g' \
+            | sed 's/:question:/❓/g' \
+            | sed 's/:hourglass:/⏳/g' \
+            | sed 's/\*\([^*]*\)\*/**\1**/g')


The sed pipeline converts Slack emoji codes and *bold* to GitHub Markdown equivalents, but leaves Slack's <url|text> link syntax unconverted. The track_ci_issues.py output produces lines like:

🚨 **New** — `TestSuite.TestName` (<https://github.com/.../issues/1|issue>)

The (<url|text>) part won't render as a hyperlink in GitHub Markdown — it will appear as literal angle-bracket text. Add a sed substitution to convert Slack links to Markdown:

Suggested change

body=$(printf '%s\n' "$SUMMARY" \

| sed 's/:rotating_light:/🚨/g' \

| sed 's/:warning:/⚠️/g' \

| sed 's/:question:/❓/g' \

| sed 's/:hourglass:/⏳/g' \

| sed 's/\*$[^*]*$\*/**\1**/g')

body=$(printf '%s\n' "$SUMMARY" \

| sed 's/:rotating_light:/🚨/g' \

| sed 's/:warning:/⚠️/g' \

| sed 's/:question:/❓/g' \

| sed 's/:hourglass:/⏳/g' \

| sed 's/\*$[^*]*$\*/**\1**/g' \

| sed 's/<$[^|>]*$|$[^>]*$>/[\2](\1)/g')

IvoDD · 2026-04-29T14:40:27Z

+            run_gh("api", "repos/owner/repo")
+        captured = capsys.readouterr()
+        assert "gh command failed" in captured.err
+        assert "Bad credentials" in captured.err


I feel like these tests are a bit too much for a simple utility to run a github command.

ognyanstoimenov

Once this is up and running and it's verified to be working as expected we should probably get rid of the old flaky test tracking. I.e. revert #2862 and #2571

ognyanstoimenov · 2026-05-04T10:49:13Z

+            run_gh("api", "repos/owner/repo")
+        captured = capsys.readouterr()
+        assert "gh command failed" in captured.err
+        assert "Bad credentials" in captured.err


Add a flow for tracking flaky CI issues

f7473db

claude Bot reviewed Mar 17, 2026

View reviewed changes

Change existing failure flow

261b39b

github-advanced-security AI found potential problems Mar 17, 2026

View reviewed changes

Comment thread .github/workflows/failure_notification.yaml Fixed

G-D-Petrov changed the title ~~Add a flow for tracking flaky CI issues~~ Extend existing failure notifications for tracking flaky CI issues Mar 17, 2026

claude Bot reviewed Mar 17, 2026

View reviewed changes

IvoDD reviewed Apr 24, 2026

View reviewed changes

Refactor CI failure notification workflow

46fb47e

github-advanced-security AI found potential problems Apr 24, 2026

View reviewed changes

Comment thread .github/workflows/failure_notification.yaml Fixed

claude Bot reviewed Apr 24, 2026

View reviewed changes

Georgi Petrov added 2 commits April 24, 2026 13:14

Support manual calls from PR

9ced6fe

Handle test timeouts

d075f88

claude Bot reviewed Apr 24, 2026

View reviewed changes

G-D-Petrov added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 24, 2026

Fix issues

797b20f

claude Bot reviewed Apr 24, 2026

View reviewed changes

Improve security on analyse-failure trigger

436b1a4

IvoDD reviewed Apr 27, 2026

View reviewed changes

Address PR comments

2c409f6

claude Bot reviewed Apr 29, 2026

View reviewed changes

Improve error handling

e2dc407

claude Bot reviewed Apr 29, 2026

View reviewed changes

Unify the slack/GH summary outputs

dd3c346

IvoDD approved these changes Apr 29, 2026

View reviewed changes

ognyanstoimenov approved these changes May 4, 2026

View reviewed changes

G-D-Petrov merged commit 2ef9463 into master May 4, 2026
222 of 225 checks passed

G-D-Petrov deleted the gpetrov/flaky_ci_issues branch May 4, 2026 11:26

	track-failures:
	if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.head_branch == 'master'

	gh issue create --repo "$REPO" \
	body="$body"$'\n\n'"This test has been detected as failing in CI."


		echo "Fetching failed jobs for run $RUN_ID..."

		failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \

-          --run-id "${{ steps.meta.outputs.run_id }}"
+      - name: Parse CI failures
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          META_RUN_ID: ${{ steps.meta.outputs.run_id }}
+          META_REPO: ${{ github.repository }}
+        run: |
+          python3 .github/scripts/parse_ci_failures.py \
+            --run-id "$META_RUN_ID" \
+            --repo "$META_REPO" \
+            --output-dir /tmp/ci_failures

-      contains(github.event.comment.body, '/analyse-failures')
+      contains(github.event.comment.body, '/analyse-failures') &&
+      (github.event.comment.author_association == 'MEMBER' ||
+       github.event.comment.author_association == 'OWNER' ||
+       github.event.comment.author_association == 'COLLABORATOR')

		GROUPING_THRESHOLD = 10


		def run_gh(*args: str, check: bool = True) -> str:

Conversation

G-D-Petrov commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How failures are classified

Real-world examples

Example 1: Test failure — run 24664309055

Example 2: Infrastructure failure — run 24807645574

Example 3: Timeout — run 24841138872

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

Uh oh!

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

G-D-Petrov commented Mar 17, 2026 •

edited

Loading

claude Bot commented Mar 17, 2026 •

edited

Loading

G-D-Petrov Apr 29, 2026 •

edited

Loading