Skip to content

Extend existing failure notifications for tracking flaky CI issues#2974

Merged
G-D-Petrov merged 10 commits into
masterfrom
gpetrov/flaky_ci_issues
May 4, 2026
Merged

Extend existing failure notifications for tracking flaky CI issues#2974
G-D-Petrov merged 10 commits into
masterfrom
gpetrov/flaky_ci_issues

Conversation

@G-D-Petrov

@G-D-Petrov G-D-Petrov commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator

Replaces the inline bash in failure_notification.yaml with standalone Python
scripts that parse CI failures, track them as GitHub issues, and send Slack
notifications. The new implementation adds three capabilities that the bash
version lacked:

  • Timeout detection — test-runner steps that fail without producing test
    names are recognised as timeouts and tracked under a dedicated CI timeout
    issue instead of the generic "unparseable" bucket.
  • PR comment trigger — developers can comment /analyse-failures on a PR
    to analyse its most recent failed CI run and get results posted back as a
    comment.
  • Correct JSON escaping — Slack payloads are built with json.dumps
    instead of bash string interpolation, eliminating broken messages from
    special characters in test names.

How failures are classified

                  ┌─────────────────────────┐
                  │ Are there parsed test    │
                  │ names (FAILED/ERROR)?    │
                  └────────┬────────────────┘
                     yes   │          no
                           ▼          │
                  ┌──────────────┐    ▼
                  │ test_failure │  ┌──────────────────────────┐
                  │              │  │ Are there infra steps    │
                  │ Track each   │  │ (non-test-runner)?       │
                  │ test as a    │  └────────┬─────────────────┘
                  │ flaky-test   │     yes   │         no
                  │ issue        │           ▼         │
                  └──────────────┘  ┌──────────────┐   ▼
                                    │ infra_failure │ ┌────────────────────────┐
                                    │              │ │ Were test-runner steps  │
                                    │ Track each   │ │ stripped out?           │
                                    │ step as a    │ └────────┬───────────────┘
                                    │ flaky-step   │    yes   │        no
                                    │ issue        │          ▼        │
                                    └──────────────┘ ┌─────────┐       ▼
                                                     │ timeout │  ┌─────────┐
                                                     │         │  │ unknown │
                                                     │ Single  │  │         │
                                                     │ CI      │  │ Unparse-│
                                                     │ timeout │  │ able    │
                                                     │ issue   │  │ issue   │
                                                     └─────────┘  └─────────┘

--conclusion timed_out from GitHub forces the timeout path regardless.

Real-world examples

Example 1: Test failure — run 24664309055

Context: Building master on push, conclusion failure.
One failed job: 3.9 Linux / integration-DefaultCache, failed step: Run test.

Logs contain:

FAILED tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data

Pipeline output:

  • failure_kind.txttest_failure
  • failing_tests.txttests/integration/toolbox/test_library_tool.py::test_overwrite_append_data
  • failed_steps.txt → empty (Run test filtered as test-runner step)
  • Issue created: Flaky test: tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data with flaky-test label
  • Slack: :rotating_light: *New* — \tests/integration/toolbox/test_library_tool.py::test_overwrite_append_data``

Example 2: Infrastructure failure — run 24807645574

Context: Building master on schedule, conclusion failure.
One failed job: 3.11 Windows / compile (...), failed step: Install Required MSVC.

No test logs (compile job, not a test job).

Pipeline output:

  • failure_kind.txtinfra_failure
  • failing_tests.txt → empty
  • failed_steps.txtInstall Required MSVC
  • Issue created: Flaky step: Install Required MSVC with flaky-step label
  • Slack: :rotating_light: *New* — \Install Required MSVC``

Example 3: Timeout — run 24841138872

Context: Building master on workflow_dispatch, conclusion failure.
Two failed jobs:

  • 3.11 Windows / unit-DefaultCache — step Run test timed out after 120 min
  • 3.10 Windows / stress-DefaultCache — step Run test timed out

Logs show tests running (at 95% progress) but no FAILED/ERROR lines — the
runner was killed before completion.

Pipeline output:

  • failure_kind.txttimeout (only test-runner steps, no parsed test names)
  • failing_tests.txt → empty
  • failed_steps.txt → empty (Run test filtered)
  • Issue created: CI timeout with ci-failure label
  • Slack: :hourglass: *Timeout*

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
types: [completed]

jobs:
track-failures:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow triggers on any branch failure (workflow_run fires regardless of the branch the triggering run ran on). The PR description says "on the master branch" but the if: condition only checks conclusion == 'failure' — it does not filter on branch. This means flaky-test issues will be auto-created for failures on feature branches, release branches, forks, etc., generating noise.

Consider adding a branch filter:

Suggested change
track-failures:
if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.head_branch == 'master'

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do
job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name")
step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \
--jq '.steps[] | select(.conclusion == "failure") | .name')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log download silently succeeds even when it fails (|| true). If gh run view --log-failed fails (e.g. rate-limit, token permission issue), /tmp/failed_logs.txt may be empty or contain an error message, causing the grep patterns to produce zero results. The workflow would then fall through to the "unparseable failures" path and create a generic issue for every CI failure — this is the highest-noise fallback and should be avoided.

Consider logging a warning when the log download fails so at least the issue body contains diagnostic context.

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
echo "$FAILING_TESTS" | while IFS= read -r test_name; do
[ -z "$test_name" ] && continue

echo "Processing test: $test_name"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injection risk via ${{ steps.parse.outputs.tests }}.

FAILING_TESTS is populated from parsed log output (test names) and then passed through ${{ steps.parse.outputs.tests }}. GitHub Actions interpolates ${{ ... }} expressions directly into the YAML before the runner executes the step. If a crafted test name contains shell metacharacters or YAML-breaking sequences, this could alter step behaviour.

The correct mitigation is to reference the output only via the environment variable ($FAILING_TESTS), which is already being done inside the run: script — the environment variable assignment itself is the risk. Prefer writing step outputs to a file (e.g. /tmp/failing_tests.txt) and reading that file in the subsequent step, rather than passing through ${{ steps.parse.outputs.tests }} as an env value.

The same pattern applies to FAILED_STEPS at line 197.

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
body="### Another failure observed"
body="$body"$'\n\n'"- **Run:** $RUN_URL"
body="$body"$'\n'"- **Commit:** \`${COMMIT_SHA:0:10}\`"
body="$body"$'\n'"- **Date:** $run_date"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub Search API (used by gh issue list --search) can return stale results with a lag of a few minutes. If two CI runs fail in quick succession for the same test, the deduplication search may find no existing issue for both runs, causing two issues to be created with the same title.

A safer deduplication approach would be to search by exact title match using gh issue list --search "is:issue is:open label:flaky-test \"$issue_title\"" and then verify the title server-side with --jq, which is what the code already does — but the underlying search index latency is unavoidable. This is worth noting in the workflow comments so maintainers understand the duplicates-on-simultaneous-failures edge case.

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
run: |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "unparseable failures" step creates a new issue per run with a unique title (CI failure (unparseable): run #$RUN_ID). If the log download frequently fails or the grep patterns never match (e.g. due to a log format change), this will create an unbounded number of issues — one for every CI failure.

Unlike the flaky-test and flaky-step steps, there is no deduplication here. Consider either deduplicating by searching for a generic title (e.g. CI failure (unparseable)) or enforcing a rate-limit / daily cap, or simply skipping issue creation and only writing to $GITHUB_STEP_SUMMARY for this fallback case.

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
body="$body"$'\n\n'"---"
body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
body="$body"$'\n'"*Add further failure occurrences as comments below.*"
gh issue create --repo "$REPO" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue body hard-codes "on the master branch" but the workflow is not restricted to master (see the branch-filter comment on line 9). This text will be misleading when issues are filed from non-master runs.

Suggested change
gh issue create --repo "$REPO" \
body="$body"$'\n\n'"This test has been detected as failing in CI."

Comment thread .github/workflows/flaky_CI_issue.yml Outdated
# Get failed jobs with their names and IDs
failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
--jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API call fetches up to 100 jobs (per_page=100) but does not handle pagination. If a workflow run has more than 100 jobs (unlikely but possible as the matrix grows), failures in jobs beyond position 100 will be silently missed. Consider adding --paginate if the gh api CLI supports it here, or at least document this limitation.

@claude

claude Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

Changes in this PR: Extracts inline bash logic from failure_notification.yaml into standalone Python scripts (parse_ci_failures.py, track_ci_issues.py, send_slack_notification.py) with unit tests. Adds a /analyse-failures slash command, a new resolve_pr_run.py script, failure_kind.txt failure classification, a dedicated handle_timeout method, and the autogenerated label on all auto-created issues.


Previously Flagged — Status After Latest Push

  • ✅ FIXED: GitHub Search special-char escaping — _sanitise_search_query strips [, ], :, /, ., " before the search query, and an exact title match loop filters results accurately.
  • ✅ FIXED: create_grouped_issue deduplication — now uses stable titles (Grouped CI test failures / Grouped CI step failures) with find_existing_issue, so repeated grouped failures comment on the same issue rather than creating a new one per run.
  • ✅ FIXED: status=failure query misses timed_out runs — find_latest_failed_run now loops over both failure and timed_out status values.
  • ✅ FIXED: Branch name not URL-encoded — resolve_pr_run.py now uses urllib.parse.quote(branch, safe="").
  • ✅ FIXED: /analyse-failures collaborator check — any-user trigger risk resolved.
  • ✅ FIXED: import re inside _sanitise_search_query function body — moved to module top-level in track_ci_issues.py.
  • ✅ FIXED: open(...).close() without context manager — those lines were removed entirely.
  • ✅ FIXED: run_date used current wall-clock time instead of actual CI run time — IssueTracker now accepts a run_date argument populated from created_at in the GitHub API response.
  • ✅ FIXED: Duplicated run_gh helper in every script — extracted to utils.py; all scripts import from it.
  • ✅ FIXED: except Exception: in download_failed_logs (parse_ci_failures.py) — now correctly catches subprocess.CalledProcessError only, eliminating duplicate error messages.
  • ✅ FIXED: Slack <url|text> links not converted to GitHub Markdown in notify-pr PR comment — notify-pr now reads github_summary output which is produced by SummaryEntry.to_github(), generating proper [text](url) hyperlinks natively. No sed conversion needed.
  • PARTIAL: Expression injection (head_branch) — WR_BRANCH is now passed as an environment variable rather than interpolated directly into a shell command, eliminating the practical injection risk. GitHub Advanced Security may still flag the ${{ github.event.workflow_run.head_branch }} expression in the env: block; the practical risk is negligible but the scanner flag will likely remain.

Open Issues

  • OPEN: head_branch GHAS flag (failure_notification.yaml, notify-slack job) — scanner will likely continue to flag ${{ github.event.workflow_run.head_branch }} even though it is now passed as an env var. Can be silenced with a comment-level suppression or by documenting why it is safe.

Minor / Informational

  • utils.py run_gh truncates the logged command to args[:3] for readability — if a failure is caused by an argument beyond position 3 (e.g. a bad --jq expression), the error message won't show it. Very minor.
  • test_ci_scripts.yml push trigger omits ".github/workflows/test_ci_scripts.yml" from the paths filter (unlike the pull_request trigger), so a direct push to master that only modifies the workflow file itself won't run the tests. Low-impact since changes to this repo flow through PRs.
  • find_latest_failed_run in resolve_pr_run.py interpolates workflow_name directly into the JQ expression string (f-string). The values are hardcoded in TRACKED_WORKFLOWS so there is no injection risk in practice, but it would be cleaner to use --arg.
  • handle_unparseable inlines the issue footer text rather than using _issue_footer(), producing slightly different wording. Not a bug, just minor inconsistency.
  • Note: The notify-pr heredoc <<EOF works correctly — YAML literal block scalars strip the base indentation before passing the script to the shell, so the EOF terminator lands at column 0. $SUMMARY expansion in the unquoted heredoc is safe: bash does not recursively scan the output of variable expansion for backtick command substitutions.

Checklist

Security:

  • ✅ FIXED: /analyse-failures collaborator check — any-user trigger risk resolved
  • GHAS still flags head_branch expression (failure_notification.yaml, notify-slack job) — low practical risk but needs resolution or documented suppression

Correctness:

  • ✅ FIXED: Slack <url|text> links now render correctly in PR comments via to_github() format

Testing:

  • DONE: Unit tests added and cover parse_ci_failures, track_ci_issues, resolve_pr_run, send_slack_notification, utils
  • DONE: New test_ci_scripts.yml workflow runs Python unit tests on PRs touching .github/scripts/

Comment thread .github/workflows/failure_notification.yaml Fixed
@G-D-Petrov G-D-Petrov changed the title Add a flow for tracking flaky CI issues Extend existing failure notifications for tracking flaky CI issues Mar 17, 2026
jobs:
on-failure:
track-failures:
runs-on: ubuntu-latest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The track-failures job has no branch restriction — it fires on failures from any branch (feature branches, forks' PRs, etc.). This will create GitHub issues for every flaky test run on any contributor's branch, generating significant noise.

The original workflow explicitly filtered to master or '' (scheduled runs). That filter should be preserved here, or at minimum added as a comment explaining the intentional choice to track all branches.

Suggested change
runs-on: ubuntu-latest
if: >-
(github.event.workflow_run.conclusion == 'failure' ||
github.event.workflow_run.conclusion == 'timed_out') &&
(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the track-failures run on PR branches only after a human presses a button. Otherwise we will fill up github with false positives (e.g. I pushed a small change to my PR but broke all my tests because of dumb bug)


echo "Fetching failed jobs for run $RUN_ID..."

failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs API is called with per_page=100 but without pagination. A large matrix workflow can exceed 100 jobs (ArcticDB's build matrix is already sizeable). Any failures in jobs beyond position 100 will be silently missed, causing the issue tracker to never see those test failures.

Consider using gh api --paginate or at least adding a comment noting the limit. Note that --paginate with --jq requires using jq -s to merge pages, e.g.:

failed_jobs_json=$(gh api --paginate "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \
  --jq '[.jobs[] | select(.conclusion == "failure") | {id, name}]' | jq -s 'add // []')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with claude's comment

echo "$failed_steps"

# --- Download logs and parse test failures ---
gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The || true here silently swallows failures from gh run view --log-failed. If this command fails — due to a rate limit, insufficient token permissions (actions: read is granted, but log access can be restricted on forks), or a transient API error — the file /tmp/failed_logs.txt will either be empty or contain an error message like HTTP 403. The subsequent grep calls will match nothing, both FAILING_TESTS and FAILED_STEPS will be empty, and the workflow will fall through to the unparseable fallback, creating a new ci-failure issue for every such run.

At minimum, log the exit code or file content so failures are visible in the step logs:

gh run view "$RUN_ID" --repo "$REPO" --log-failed > /tmp/failed_logs.txt 2>&1 || {
  echo "WARNING: gh run view --log-failed failed (exit $?); test name parsing will be skipped"
}

RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
FAILING_TESTS: ${{ steps.parse.outputs.tests }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: step output injection via ${{ ... }} expression interpolation.

FAILING_TESTS and FAILED_STEPS are populated from parsed CI log content — test names and step names that ultimately come from source code, commit messages, and workflow definitions. GitHub Actions evaluates ${{ steps.parse.outputs.tests }} by substituting the value directly into the YAML before the runner executes the step.

If a test name or step name contains characters that break the YAML/shell context (e.g. a test name with \n, backticks, or $(...) sequences), this can alter step behaviour. While the workflow_run trigger somewhat limits exposure (only code on this repo can influence test names), it is still best practice to avoid interpolating step outputs directly into env: blocks.

The safe pattern is to write the parsed outputs to a file in the parse step and read that file in the issues step:

# In parse step:
echo "$failing_tests" > /tmp/failing_tests.txt
echo "$infra_steps"   > /tmp/failed_steps.txt

# In issues step (no ${{ }} interpolation of untrusted content):
FAILING_TESTS=$(cat /tmp/failing_tests.txt)
FAILED_STEPS=$(cat /tmp/failed_steps.txt)

This avoids the ${{ }} interpolation path entirely for untrusted content.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine because of the somewhat strict regex pattern matching, but still worth considering this claude's suggestion. I guess if we extract the above to bash script we can pass --output-dir to it where it will put all the text files needed for next steps.


# --- Fallback for unparseable failures ---
if [ -z "$FAILING_TESTS" ] && [ -z "$FAILED_STEPS" ]; then
issue_title="CI failure (unparseable): run #$RUN_ID"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unparseable-failure fallback creates a new issue for every CI run using a unique title (CI failure (unparseable): run #$RUN_ID). There is no deduplication here.

The primary scenario that triggers this path is when the log download fails (see the || true concern on line 68). If that happens consistently — e.g. due to a log permission change or gh CLI upgrade breaking the --log-failed flag — every CI failure on master will produce a new issue indefinitely.

Consider one of:

  1. Writing to $GITHUB_STEP_SUMMARY only, instead of opening an issue, so humans can investigate without issue spam.
  2. Searching for a generic open issue titled CI failures (unparseable) and adding a comment to it rather than creating a new issue per run.
  3. Capping at a daily or weekly issue (use a date-stamped title like CI failure (unparseable): YYYY-MM-DD).

(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '') &&
always() &&
needs.track-failures.result != 'skipped'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notify-slack condition has a subtle logic gap. The combination of always() and needs.track-failures.result != 'skipped' means this job will also fire when track-failures itself fails (e.g. due to a permissions error, API rate limit, or script bug). In that case needs.track-failures.outputs.slack_summary will be empty, and the Slack message will be sent but with no failure details — potentially confusing on-call engineers.

Consider adding needs.track-failures.result != 'failure' to the condition, or at minimum handle the case where SUMMARY is empty by adding a fallback message indicating that issue tracking itself failed:

    if: >-
      (github.event.workflow_run.head_branch == 'master' ||
       github.event.workflow_run.head_branch == '') &&
      always() &&
      needs.track-failures.result != 'skipped' &&
      needs.track-failures.result != 'cancelled'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should indeed notify in case of track-failures failure (otherwise we'll be losing coverage of failures silently), but we should probably have a friendly error message in that case.

"type": "section",
"text": {
"type": "mrkdwn",
"text": "$icon *$WORKFLOW_NAME* $CONCLUSION on \`$BRANCH\`"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell variables injected unescaped into a JSON string literal inside a heredoc.

$WORKFLOW_NAME, $CONCLUSION, and $BRANCH are substituted directly into the JSON payload string. If any of these contain double quotes, backslashes, or newlines (e.g. a workflow name like Build "with" quotes), the resulting JSON will be malformed and curl will send an invalid payload, silently failing with -f.

The escaped_summary is already being JSON-escaped via python3. Apply the same treatment to the other variables before inserting them:

esc_workflow=$(printf '%s' "$WORKFLOW_NAME" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_conclusion=$(printf '%s' "$CONCLUSION"    | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")
esc_branch=$(printf '%s' "$BRANCH"           | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])")

Then use $esc_workflow, $esc_conclusion, $esc_branch in the heredoc.

jobs:
on-failure:
track-failures:
runs-on: ubuntu-latest

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the track-failures run on PR branches only after a human presses a button. Otherwise we will fill up github with false positives (e.g. I pushed a small change to my PR but broke all my tests because of dumb bug)

GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
run: |
gh label create "flaky-test" --repo "$REPO" --color "FBCA04" --description "Flaky test auto-tracked by CI" 2>/dev/null || true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use an autogenerated label for all of these? (So that we can filter and remove all in case of noise or bugs)

Or prefix all these labels with autogenerated e.g. autogenerated-flaky-test

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe rename the workflow


echo "Fetching failed jobs for run $RUN_ID..."

failed_jobs_json=$(gh api "repos/$REPO/actions/runs/$RUN_ID/jobs?filter=latest&per_page=100" \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with claude's comment

for job_id in $(echo "$failed_jobs_json" | jq -r '.[].id'); do
job_name=$(echo "$failed_jobs_json" | jq -r ".[] | select(.id == $job_id) | .name")
step_names=$(gh api "repos/$REPO/actions/jobs/$job_id" \
--jq '.steps[] | select(.conclusion == "failure") | .name')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think large bash scripts should live standalone in github workflows.
What do you think about extracting that to a separate file and calling into it with interface like:

track_failures --run-id something

and here we can just parse the stdout.

This will also make manual testing easier. E.g. you can run the track_failures script locally against a few example failures and see the output is correct. I think including such evidence in PR description would also be helpful.

RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}
REPO: ${{ github.repository }}
COMMIT_SHA: ${{ github.event.workflow_run.head_sha }}
FAILING_TESTS: ${{ steps.parse.outputs.tests }}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine because of the somewhat strict regex pattern matching, but still worth considering this claude's suggestion. I guess if we extract the above to bash script we can pass --output-dir to it where it will put all the text files needed for next steps.

body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
body="$body"$'\n'"*Add further failure occurrences as comments below.*"

track_item "$test_name" "Flaky test" "flaky-test" "$body"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think opening a separate issue for each test can be problematic.

Most probably if >10 tests fail together their failures are correlated. This can happen in cases:

  • Someone accidentally breaks 1000 tests on their PR and fat fingers the button to track failure
  • Some fixture or shared setup breaks and results in dozens of failed tests

I guess we can try to produce a single item if >10 failures.

body="$body"$'\n'"*This issue was automatically created by the flaky test tracker workflow.*"
body="$body"$'\n'"*Add further failure occurrences as comments below.*"

track_item "$step_entry" "Flaky step" "flaky-step" "$body"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general gh action outage can cause a lot of steps to fail together. Grouping them if many makes sense.

(github.event.workflow_run.head_branch == 'master' ||
github.event.workflow_run.head_branch == '') &&
always() &&
needs.track-failures.result != 'skipped'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should indeed notify in case of track-failures failure (otherwise we'll be losing coverage of failures silently), but we should probably have a friendly error message in that case.

Comment thread .github/workflows/failure_notification.yaml Fixed
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: >
python3 .github/scripts/parse_ci_failures.py
--run-id "${{ steps.meta.outputs.run_id }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_id from a workflow_dispatch is user-supplied and still reaches the shell via ${{ steps.meta.outputs.run_id }} expression interpolation (GitHub Actions evaluates ${{ }} before the runner executes the step). If a user inputs a value containing ", `, or $(), it may break the quoted argument or execute shell code.

The meta step already follows the safe pattern of routing context values through env: variables. Apply the same here:

Suggested change
--run-id "${{ steps.meta.outputs.run_id }}"
- name: Parse CI failures
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
META_RUN_ID: ${{ steps.meta.outputs.run_id }}
META_REPO: ${{ github.repository }}
run: |
python3 .github/scripts/parse_ci_failures.py \
--run-id "$META_RUN_ID" \
--repo "$META_REPO" \
--output-dir /tmp/ci_failures

Same fix applies to the identical pattern in the "Create or update issues" step (line 85).

"--json", "number,title",
)
if not raw:
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Search treats [, ], :, /, and . as special syntax. Pytest parametrized test names like tests/test_foo.py::test_bar[param1-param2] contain all of these. The search query '"tests/test_foo.py::test_bar[param1-param2]" in:title' may return zero results even when an exact-title issue exists, causing find_existing_issue to return None and create a duplicate issue on every subsequent run.

The exact-title match on line 63 (if issue["title"] == title) is correct dedup logic, but it only runs if the search returns any results. Consider broadening the search (e.g. drop brackets/colons from the query term) and relying solely on the exact-match filter, or escaping search-special characters before passing the title to --search.

Comment thread .github/scripts/track_ci_issues.py Outdated

if item_type == "test":
label = "flaky-test"
heading = f"Grouped CI test failures ({count} tests)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grouped-issue title embeds run #{self.run_id}, making it unique per run. This means grouped failures have no deduplication: if 15+ tests fail consistently across many CI runs, a new issue is created for every single run — the same unbounded issue growth the handle_unparseable fix resolved.

Consider using a stable title (e.g. f"Grouped CI {item_type} failures") and applying the same find-or-create pattern as track_item / handle_unparseable: search for an open issue with that title, add a comment if found, create only if not found.

Comment thread .github/scripts/resolve_pr_run.py Outdated
raw = run_gh(
"api", "--paginate",
f"repos/{repo}/actions/runs"
f"?branch={branch}&status=failure&per_page=5",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues on this line:

1. status=failure misses timed_out runs

The GitHub Actions Runs API treats failure and timed_out as separate conclusion (and status) values. Querying status=failure will not return runs that concluded as timed_out. The code then checks run["conclusion"] in ("failure", "timed_out") on the results, but timed_out runs are never returned by this query, so they will always be missed.

Fix: query with status=completed and filter locally, or make a second query with status=timed_out:

for status in ("failure", "timed_out"):
    raw = run_gh(
        "api", "--paginate",
        f"repos/{repo}/actions/runs?branch={branch_encoded}&status={status}&per_page=10",
        ...
    )

2. Branch name is not URL-encoded

branch comes from .head.ref via get_pr_head(). A branch named feat/foo&status=success would inject extra query parameters into the URL, returning wrong results (or all runs regardless of status). Apply URL encoding:

from urllib.parse import quote
branch_encoded = quote(branch, safe='')

@G-D-Petrov G-D-Petrov added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Apr 24, 2026
if: >-
github.event_name == 'issue_comment' &&
github.event.issue.pull_request &&
contains(github.event.comment.body, '/analyse-failures')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /analyse-failures command has no collaborator check. Any GitHub user can comment /analyse-failures on any PR, causing the workflow to consume actions: read quota, call gh issue list/create/comment (using issues: write), and post a PR comment (using pull-requests: write). On a public repo this is open to the world.

Consider gating on author_association:

Suggested change
contains(github.event.comment.body, '/analyse-failures')
contains(github.event.comment.body, '/analyse-failures') &&
(github.event.comment.author_association == 'MEMBER' ||
github.event.comment.author_association == 'OWNER' ||
github.event.comment.author_association == 'COLLABORATOR')

This still allows all repo members/maintainers while blocking external actors.


jobs:
on-failure:
resolve-pr-run:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with triggering via PR comments but I think it has a couple downsides:

  • Pollutes the PR description page with things not related to the PR (e.g. flaky master tests)
  • Can't easily trigger for a failure before the last

I guess I was thinking of adding a button to the job summary of a failed "Build and Test" or "Build with conda" or other like:

    failure-analysis-link:                                                                     
      needs: [cpp-test-linux, cpp-test-windows, cpp-test-macos, build-python-wheels-linux,     
              build-python-wheels-windows, build-python-wheels-macos,                          
              persistent_storage_verify_linux, persistent_storage_verify_windows]              
      if: always() && failure()                                                                
      runs-on: ubuntu-latest                                                                   
      steps:                                                                                   
        - name: Write failure analysis link to summary                                         
          run: |
            
  url="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/workflows/failure_notification.yaml"  
            cat >> "$GITHUB_STEP_SUMMARY" <<'EOF'
            ## Failure Analysis                                                                
                             
            EOF
            echo "[**Analyse failures →**]($url)" >> "$GITHUB_STEP_SUMMARY"                    
            echo "" >> "$GITHUB_STEP_SUMMARY"
            echo "Run ID: \`${GITHUB_RUN_ID}\`" >> "$GITHUB_STEP_SUMMARY"

I don't feel strongly about this, but using a link like above would allow us to delete reselov_pr_run.py and simplify logic here

I guess using both activation mechanisms is also an option

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I prefer the implemented approach a bit more because:

  1. It doesn't require changes to "Build and Test", "Build with conda", etc
  2. Makes flaky tests a bit more clear in the PR itself, thus easier to see for reviewers

notify-slack:
runs-on: ubuntu-latest
needs: [track-failures]
# Only send Slack notifications for automatic master triggers, not manual dispatch

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be argued that manual dispatch calls on PRs are still flakyness present on master.

Still probably better to keep the notifications just for master to avoid excessive noise.

Comment thread .github/scripts/parse_ci_failures.py Outdated
def download_failed_logs(repo: str, run_id: str, output_path: str) -> bool:
"""Download logs for the failed run. Returns True on success."""
result = subprocess.run(
["gh", "run", "view", run_id, "--repo", repo, "--log-failed"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can use run_gh method

Comment thread .github/scripts/parse_ci_failures.py Outdated
)
with open(output_path, "w") as f:
f.write(result.stdout)
if result.returncode != 0:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some error handling in the general run_gh instead of just this one gh run view command

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests ever ran on the CI?
I think it makes sense to enforce they pass before merging a PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a flow to test them

Comment on lines +77 to +82
req = urllib.request.Request(
webhook_url,
data=data,
headers={"Content-Type": "application/json"},
method="POST",
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does authentication against the slack webhook work? I see we in gh workflow we have the url itself as a secret.

Any chance urllib exception will leak the url in logs or something similar? It seems like in exception handling we at least don't log the url which is good.

@G-D-Petrov G-D-Petrov Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added extra handling for the URLError.

But in any case, since SLACK_WEBHOOK_URL is a repository secret, GitHub Actions automatically masks its value in all log output. Any occurrence of the secret's value gets replaced with ***. So even if it did leak through a traceback, it would be masked.

- name: Checkout repository
uses: actions/checkout@v4

- name: Send Slack notification

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were previously using an existing action ravsamhq/notify-slack-action@be814b201e233b2dc673608aa46e5447c8ab13f2.
Any reason to write our own slack notification code instead of building the message and using the existing action?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ravsamhq/notify-slack-action doesn't support what we need. It uses Slack's legacy attachments format with a fixed schema — no Block Kit support and no way to pass an arbitrary payload. We'd lose the structured summary (per-test
lines, grouped failure messages, timeout indicators) and the tracker-failure warning.

Comment thread .github/scripts/track_ci_issues.py Outdated
GROUPING_THRESHOLD = 10


def run_gh(*args: str, check: bool = True) -> str:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is defined in more than one place, maybe we can add a common method in utils.py?

Comment thread .github/scripts/track_ci_issues.py Outdated
self.run_id = run_id
self.run_url = run_url
self.short_sha = commit_sha[:10]
self.run_date = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run might be e.g. a week old, so the time of report might be misleading. Would be useful to use the time of the failure so we can cross reference against e.g. github outages.

Comment thread .github/scripts/parse_ci_failures.py Outdated
"""Download logs for the failed run. Returns True on success."""
try:
output = run_gh("run", "view", run_id, "--repo", repo, "--log-failed")
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here:

1. Overly broad exception type. run_gh raises subprocess.CalledProcessError on failure — catching Exception also swallows unexpected errors such as MemoryError or a programming bug inside run_gh itself. Prefer except subprocess.CalledProcessError: to be explicit about what you're tolerating.

2. Double error messages. When gh run view --log-failed fails, run_gh already prints "gh command failed (exit N): ..." to stderr before raising. Then this except block prints a second "WARNING: gh run view --log-failed failed; ..." message. Both appear in the same log for the same failure event.

To avoid the duplicate you could call run_gh with check=False and check whether the returned string is empty, but that conflates "command failed" with "no failed logs". The simplest targeted fix is:

Suggested change
except Exception:
except subprocess.CalledProcessError:

and add import subprocess at the top of the file (or rely on it being transitively available via utils). This keeps the warning message but narrows the catch to the expected failure mode.

Comment on lines +238 to +243
body=$(printf '%s\n' "$SUMMARY" \
| sed 's/:rotating_light:/🚨/g' \
| sed 's/:warning:/⚠️/g' \
| sed 's/:question:/❓/g' \
| sed 's/:hourglass:/⏳/g' \
| sed 's/\*\([^*]*\)\*/**\1**/g')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sed pipeline converts Slack emoji codes and *bold* to GitHub Markdown equivalents, but leaves Slack's <url|text> link syntax unconverted. The track_ci_issues.py output produces lines like:

🚨 **New** — `TestSuite.TestName` (<https://github.com/.../issues/1|issue>)

The (<url|text>) part won't render as a hyperlink in GitHub Markdown — it will appear as literal angle-bracket text. Add a sed substitution to convert Slack links to Markdown:

Suggested change
body=$(printf '%s\n' "$SUMMARY" \
| sed 's/:rotating_light:/🚨/g' \
| sed 's/:warning:/⚠️/g' \
| sed 's/:question:/❓/g' \
| sed 's/:hourglass:/⏳/g' \
| sed 's/\*\([^*]*\)\*/**\1**/g')
body=$(printf '%s\n' "$SUMMARY" \
| sed 's/:rotating_light:/🚨/g' \
| sed 's/:warning:/⚠️/g' \
| sed 's/:question:/❓/g' \
| sed 's/:hourglass:/⏳/g' \
| sed 's/\*\([^*]*\)\*/**\1**/g' \
| sed 's/<\([^|>]*\)|\([^>]*\)>/[\2](\1)/g')

run_gh("api", "repos/owner/repo")
captured = capsys.readouterr()
assert "gh command failed" in captured.err
assert "Bad credentials" in captured.err

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these tests are a bit too much for a simple utility to run a github command.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ognyanstoimenov ognyanstoimenov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is up and running and it's verified to be working as expected we should probably get rid of the old flaky test tracking. I.e. revert #2862 and #2571

run_gh("api", "repos/owner/repo")
captured = capsys.readouterr()
assert "gh command failed" in captured.err
assert "Bad credentials" in captured.err

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@G-D-Petrov G-D-Petrov merged commit 2ef9463 into master May 4, 2026
222 of 225 checks passed
@G-D-Petrov G-D-Petrov deleted the gpetrov/flaky_ci_issues branch May 4, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants