Add request check to startup smoke test#6084
Conversation
|
Thanks for your PR. Looks quite good. Well I already have a private attempt at evnchn#163 Will reconcile and see how to take the best of both worlds once have time. But I feel like your work will dominate. Rest assured you will be credited no matter what happens. |
|
Let's also do this in early 3.14 instead |
|
Sounds good, thanks. I'm happy for this to be considered for early 3.14. |
evnchn
left a comment
There was a problem hiding this comment.
Posted by Claude Code on Evan's behalf.
TL;DR: Approving — we verified both directions empirically, including a reproduction of the exact historical bug that motivated #6079. This supersedes evnchn#163 (our own draft for the same issue) and covers everything it did, in one respect better: the URL is extracted from the server log instead of a hard-coded port list. @falkoschindler — could you do a go-around on this one soon? The thread converged on early-3.14, and test fixtures merging early means they get to find bugs for the whole cycle; these crucial CI components also deserve more eyes early.
Empirical verification
1. Negative path — a request-time 500 that boots cleanly (a NiceGUI app whose / page raises):
| script | result |
|---|---|
test_startup.sh on main |
exit 0 — bug ships undetected (confirms the #6079 premise) |
| this PR | exit 1 — caught twice over: returned HTTP 500 from the new request check, and the server traceback now lands in the captured output |
2. Gold standard — the historical #6061 bug (checkout 5328b594, the #5960 merge commit, with this PR's script):
http://localhost:8080/ returned HTTP 500
| File ".../nicegui/storage.py", line 141, in _create_user_storage
| TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
exit 1 — the exact regression class that shipped green through the old smoke test is now caught.
3. Code audit: all new failure-prone statements are set -euo pipefail-safe (|| exitcode=1 guards; the [[ ]] && var= patterns fall under the &&-list exemption), the ready-poll (50×0.2 s) is aligned with the 10 s timeout, only ≥500 fails (404-tolerant for apps not serving /), the temp file is cleaned before the failure block, the check-target list and 3-attempt retry wrapper are untouched, and CI green covers the positive path on every real target including examples/fastapi on port 8000 (covered by the log-extracted URL).
Nice work @SSDWGG — this held up under adversarial testing on the first try. 🎉
falkoschindler
left a comment
There was a problem hiding this comment.
Thanks a lot for the contribution, @SSDWGG!
Unfortunately there are still some major issues that need to be fixed before merging:
-
The startup check fails for the fastapi example — the PR would fail the merge queue (blocking) —
full-testnever ran on this PR (it only runs in the merge queue), so I rantest_startup.shin a CI-like Linux container (Debian,uv sync --locked, same invocation as the workflow).examples/fastapifails deterministically on all 3 attempts withFailed to request http://127.0.0.1:8000/: <urlopen error [Errno 111] Connection refused>: withreload=True, uvicorn's reloader parent logsUvicorn running on http://127.0.0.1:8000before the worker process has bound the socket, so the single-shot request fires into the ~1 s spawn-and-import gap. Probing confirmed it: immediately after the ready line the request is refused; one second later it returns 200. The gap is the worker's startup time, so fast CI runners hit it too. Fix: retry refused connections, e.g.curl --retry 5 --retry-connrefused --retry-delay 1(with the urllib helper it would need a manual retry loop). -
The threaded_nicegui example fails too — it exits before the request (blocking) —
examples/threaded_niceguiruns the server in a daemon thread while the main thread reads stdin. With CI's closed stdin,input()raisesEOFErrorimmediately, the main thread exits cleanly, and the daemon server dies — after the ready line was printed. The old script passed this (clean exit, ready line present); the request check now gets connection refused from a dead process, deterministically, in both the urllib and curl variants. Retries don't help. Fix: keep stdin open for the test window, e.g. launch the background job with< <(sleep 15)— this emulates an interactive user who hasn't typed yet and is a no-op for all other examples (verified: the example passes with it). -
Request failure reason is missing from the failure dump (major) — the helper's error message goes to the script's stderr, so it appears above the
"Wrong exit code 1. Output was:"dump while the dump itself shows a clean server log with no hint of why the check failed (see the fastapi failure above). Note for the fix: simply appending to the captured log file does not work — the server's>redirect keeps writing at its own file offset and overwrites the appended text (verified). Collect the request output separately and append it to$outputbefore the dump. -
Race between the request and the 10-second timeout (major) — the request only starts after the ready line appears, but the
timeout 10clock has been running since launch. A heavy example that becomes ready at ~8–9 s leaves too little window: the server gets SIGTERMed mid-request, and since slow startup is systematic rather than random, the 3 retries incheckwon't save it. (The connection-refused retries from finding 1 widen this exposure slightly.) Options: raise the timeout (e.g. 15 s) while keeping the 10 s ready-poll, or kill the server right after a successful request and treat exit 143 like 124 — which would also cut the job's wall clock roughly in half, at the cost of no longer catching errors that appear late in the window. -
Replace the bash-Python mashup with
curl(major) — the 20-line Python heredoc spawned viauv run --no-sync pythonis barely readable, depends on the example's environment, adds an interpreter start inside the already-tight timeout window, and contains a dead branch (urlopenraisesHTTPErrorfor any status ≥ 400, so the in-withresponse.status >= 500check is unreachable). The whole helper reduces to:request() { local url="$1" code # NOTE servers with reload=True log their URL before the socket accepts connections, so retry refused connections code=$(curl --silent --location --retry 5 --retry-connrefused --retry-delay 1 --output /dev/null --write-out '%{http_code}' --max-time 5 "$url" || true) if [[ $code -lt 200 || $code -ge 500 ]]; then echo "Request to $url failed with HTTP status $code" return 1 fi }
Semantics match the Python version: transport errors (
000) and 5xx fail, 4xx is tolerated, redirects are followed (--location, matching urllib).curlis preinstalled on GitHub runners and macOS, so the local-testing story is no worse.Verified locally end-to-end: a healthy NiceGUI app passes (URL detection picks the ready-line port, not the fallback), and a test app raising on its first
GET /is caught with the traceback and request failure line in the dump — the regression class of #6061.
|
Sorry for this one. I think focused on "Is this superset of evnchn#163" for the winning condition for this PR in the review. Thanks for your look @falkoschindler 🙇 |
|
Thanks for the detailed review. I pushed an update that addresses the requested startup-script changes:
I could run |
The ready-poll loop capped at 50 x 0.2 s = 10 s while the process timeout was 15 s, so a server taking 10-15 s to log its ready line would be marked failed even though it was still starting. Derive the poll window from a single timeout_seconds constant so it covers the full timeout window, and drop the coupled magic numbers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
falkoschindler
left a comment
There was a problem hiding this comment.
Thanks for tightening this up! The smoke test now exercises the request path instead of only the boot log, which is exactly the gap that let #6061 slip through.
I ran the full suite in a CI-like Linux container against the latest state and all 54 checks pass with no retries — including examples/fastapi (the reload-race that the connection-refused retries fix) and examples/threaded_nicegui (which needed the open stdin so it doesn't exit before the request).
I pushed one small follow-up in 2d1b116 to align the ready-poll window with the run timeout: the poll loop was capped at 10 s while the process timeout is 15 s, so a server taking 10–15 s to log its ready line would have been marked failed even though it was still starting. It now derives the poll window from a single timeout_seconds constant.
LGTM — approving. 🎉
Summary
Fixes #6079.
Testing
bash -n test_startup.shgit diff --check HEAD~1 HEADPATH=/tmp/gh-pr-test-bin:$PATH bash -c "$(awk '/^check main.py/{exit} {print}' test_startup.sh); check examples/progress/main.py"PATH=/tmp/gh-pr-test-bin:$PATH bash -c "$(awk '/^check main.py/{exit} {print}' test_startup.sh); check examples/single_page_app/main.py"The last two local checks used a small GNU
timeoutshim because macOS does not provide the CItimeoutcommand by default.