Skip to content

Add request check to startup smoke test#6084

Merged
falkoschindler merged 4 commits into
zauberzeug:mainfrom
SSDWGG:fix/startup-smoke-request
Jun 15, 2026
Merged

Add request check to startup smoke test#6084
falkoschindler merged 4 commits into
zauberzeug:mainfrom
SSDWGG:fix/startup-smoke-request

Conversation

@SSDWGG

@SSDWGG SSDWGG commented May 30, 2026

Copy link
Copy Markdown
Contributor

Summary

  • run each startup target under the existing 10 second timeout in the background
  • wait for the server ready log, then issue a real HTTP request to the detected local server URL
  • fail the startup smoke test on request failures or HTTP 5xx responses while preserving the existing log checks

Fixes #6079.

Testing

  • bash -n test_startup.sh
  • git diff --check HEAD~1 HEAD
  • PATH=/tmp/gh-pr-test-bin:$PATH bash -c "$(awk '/^check main.py/{exit} {print}' test_startup.sh); check examples/progress/main.py"
  • PATH=/tmp/gh-pr-test-bin:$PATH bash -c "$(awk '/^check main.py/{exit} {print}' test_startup.sh); check examples/single_page_app/main.py"

The last two local checks used a small GNU timeout shim because macOS does not provide the CI timeout command by default.

@evnchn

evnchn commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Thanks for your PR. Looks quite good.

Well I already have a private attempt at evnchn#163

Will reconcile and see how to take the best of both worlds once have time. But I feel like your work will dominate.

Rest assured you will be credited no matter what happens.

@falkoschindler falkoschindler added infrastructure Type/scope: GitHub Actions and IDE review Status: PR is open and needs review labels May 31, 2026
@falkoschindler falkoschindler added this to the 3.13 milestone May 31, 2026
@evnchn

evnchn commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Let's also do this in early 3.14 instead

@evnchn evnchn modified the milestones: 3.13, Next Jun 3, 2026
@SSDWGG

SSDWGG commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Sounds good, thanks. I'm happy for this to be considered for early 3.14.

@falkoschindler falkoschindler modified the milestones: Next, 3.14 Jun 5, 2026
evnchn
evnchn previously approved these changes Jun 12, 2026

@evnchn evnchn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted by Claude Code on Evan's behalf.

TL;DR: Approving — we verified both directions empirically, including a reproduction of the exact historical bug that motivated #6079. This supersedes evnchn#163 (our own draft for the same issue) and covers everything it did, in one respect better: the URL is extracted from the server log instead of a hard-coded port list. @falkoschindler — could you do a go-around on this one soon? The thread converged on early-3.14, and test fixtures merging early means they get to find bugs for the whole cycle; these crucial CI components also deserve more eyes early.

Empirical verification

1. Negative path — a request-time 500 that boots cleanly (a NiceGUI app whose / page raises):

script result
test_startup.sh on main exit 0 — bug ships undetected (confirms the #6079 premise)
this PR exit 1 — caught twice over: returned HTTP 500 from the new request check, and the server traceback now lands in the captured output

2. Gold standard — the historical #6061 bug (checkout 5328b594, the #5960 merge commit, with this PR's script):

http://localhost:8080/ returned HTTP 500
|   File ".../nicegui/storage.py", line 141, in _create_user_storage
| TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

exit 1 — the exact regression class that shipped green through the old smoke test is now caught.

3. Code audit: all new failure-prone statements are set -euo pipefail-safe (|| exitcode=1 guards; the [[ ]] && var= patterns fall under the &&-list exemption), the ready-poll (50×0.2 s) is aligned with the 10 s timeout, only ≥500 fails (404-tolerant for apps not serving /), the temp file is cleaned before the failure block, the check-target list and 3-attempt retry wrapper are untouched, and CI green covers the positive path on every real target including examples/fastapi on port 8000 (covered by the log-extracted URL).

Nice work @SSDWGG — this held up under adversarial testing on the first try. 🎉

@falkoschindler falkoschindler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution, @SSDWGG!
Unfortunately there are still some major issues that need to be fixed before merging:

  1. The startup check fails for the fastapi example — the PR would fail the merge queue (blocking) — full-test never ran on this PR (it only runs in the merge queue), so I ran test_startup.sh in a CI-like Linux container (Debian, uv sync --locked, same invocation as the workflow). examples/fastapi fails deterministically on all 3 attempts with Failed to request http://127.0.0.1:8000/: <urlopen error [Errno 111] Connection refused>: with reload=True, uvicorn's reloader parent logs Uvicorn running on http://127.0.0.1:8000 before the worker process has bound the socket, so the single-shot request fires into the ~1 s spawn-and-import gap. Probing confirmed it: immediately after the ready line the request is refused; one second later it returns 200. The gap is the worker's startup time, so fast CI runners hit it too. Fix: retry refused connections, e.g. curl --retry 5 --retry-connrefused --retry-delay 1 (with the urllib helper it would need a manual retry loop).

  2. The threaded_nicegui example fails too — it exits before the request (blocking) — examples/threaded_nicegui runs the server in a daemon thread while the main thread reads stdin. With CI's closed stdin, input() raises EOFError immediately, the main thread exits cleanly, and the daemon server dies — after the ready line was printed. The old script passed this (clean exit, ready line present); the request check now gets connection refused from a dead process, deterministically, in both the urllib and curl variants. Retries don't help. Fix: keep stdin open for the test window, e.g. launch the background job with < <(sleep 15) — this emulates an interactive user who hasn't typed yet and is a no-op for all other examples (verified: the example passes with it).

  3. Request failure reason is missing from the failure dump (major) — the helper's error message goes to the script's stderr, so it appears above the "Wrong exit code 1. Output was:" dump while the dump itself shows a clean server log with no hint of why the check failed (see the fastapi failure above). Note for the fix: simply appending to the captured log file does not work — the server's > redirect keeps writing at its own file offset and overwrites the appended text (verified). Collect the request output separately and append it to $output before the dump.

  4. Race between the request and the 10-second timeout (major) — the request only starts after the ready line appears, but the timeout 10 clock has been running since launch. A heavy example that becomes ready at ~8–9 s leaves too little window: the server gets SIGTERMed mid-request, and since slow startup is systematic rather than random, the 3 retries in check won't save it. (The connection-refused retries from finding 1 widen this exposure slightly.) Options: raise the timeout (e.g. 15 s) while keeping the 10 s ready-poll, or kill the server right after a successful request and treat exit 143 like 124 — which would also cut the job's wall clock roughly in half, at the cost of no longer catching errors that appear late in the window.

  5. Replace the bash-Python mashup with curl (major) — the 20-line Python heredoc spawned via uv run --no-sync python is barely readable, depends on the example's environment, adds an interpreter start inside the already-tight timeout window, and contains a dead branch (urlopen raises HTTPError for any status ≥ 400, so the in-with response.status >= 500 check is unreachable). The whole helper reduces to:

    request() {
        local url="$1" code
        # NOTE servers with reload=True log their URL before the socket accepts connections, so retry refused connections
        code=$(curl --silent --location --retry 5 --retry-connrefused --retry-delay 1 --output /dev/null --write-out '%{http_code}' --max-time 5 "$url" || true)
        if [[ $code -lt 200 || $code -ge 500 ]]; then
            echo "Request to $url failed with HTTP status $code"
            return 1
        fi
    }

    Semantics match the Python version: transport errors (000) and 5xx fail, 4xx is tolerated, redirects are followed (--location, matching urllib). curl is preinstalled on GitHub runners and macOS, so the local-testing story is no worse.

    Verified locally end-to-end: a healthy NiceGUI app passes (URL detection picks the ready-line port, not the fallback), and a test app raising on its first GET / is caught with the traceback and request failure line in the dump — the regression class of #6061.

@falkoschindler falkoschindler added in progress Status: Someone is working on it and removed review Status: PR is open and needs review labels Jun 12, 2026
@evnchn

evnchn commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Sorry for this one. I think focused on "Is this superset of evnchn#163" for the winning condition for this PR in the review. Thanks for your look @falkoschindler 🙇

@SSDWGG

SSDWGG commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review. I pushed an update that addresses the requested startup-script changes:

  • replaced the Python request helper with curl
  • added retry handling for reload servers that log before accepting connections
  • kept stdin open during the test window for examples that read from input()
  • captured request failure output and appended it to the final failure dump
  • raised the timeout to 15 seconds while keeping the existing readiness polling window

I could run bash -n test_startup.sh locally. Full local example verification was blocked by local environment issues (timeout is not available on macOS, and uv sync --locked timed out fetching poetry-dynamic-versioning from PyPI), so I’m relying on the Ubuntu CI run for the end-to-end startup matrix.

@falkoschindler falkoschindler added review Status: PR is open and needs review and removed in progress Status: Someone is working on it labels Jun 15, 2026
@falkoschindler falkoschindler self-requested a review June 15, 2026 11:56
falkoschindler

This comment was marked as off-topic.

The ready-poll loop capped at 50 x 0.2 s = 10 s while the process timeout
was 15 s, so a server taking 10-15 s to log its ready line would be marked
failed even though it was still starting. Derive the poll window from a
single timeout_seconds constant so it covers the full timeout window, and
drop the coupled magic numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@falkoschindler falkoschindler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tightening this up! The smoke test now exercises the request path instead of only the boot log, which is exactly the gap that let #6061 slip through.

I ran the full suite in a CI-like Linux container against the latest state and all 54 checks pass with no retries — including examples/fastapi (the reload-race that the connection-refused retries fix) and examples/threaded_nicegui (which needed the open stdin so it doesn't exit before the request).

I pushed one small follow-up in 2d1b116 to align the ready-poll window with the run timeout: the poll loop was capped at 10 s while the process timeout is 15 s, so a server taking 10–15 s to log its ready line would have been marked failed even though it was still starting. It now derives the poll window from a single timeout_seconds constant.

LGTM — approving. 🎉

@falkoschindler falkoschindler enabled auto-merge June 15, 2026 14:07
@falkoschindler falkoschindler added this pull request to the merge queue Jun 15, 2026
Merged via the queue into zauberzeug:main with commit 049871f Jun 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure Type/scope: GitHub Actions and IDE review Status: PR is open and needs review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

startup CI smoke test never issues a request — request-time regressions (e.g. #6061) ship green

3 participants