Skip to content

fix(db): retry on asyncpg InternalClientError#158

Merged
ohld merged 5 commits intoproductionfrom
fix/asyncpg-concurrent-connection-retry
Apr 6, 2026
Merged

fix(db): retry on asyncpg InternalClientError#158
ohld merged 5 commits intoproductionfrom
fix/asyncpg-concurrent-connection-retry

Conversation

@ohld
Copy link
Copy Markdown
Member

@ohld ohld commented Apr 6, 2026

Summary

  • Handles InternalClientError: cannot switch to state 15; another operation is in progress from asyncpg — a transient race where the pool hands out a connection whose previous async commit hasn't fully completed
  • Marks InternalClientError connections as disconnect so the pool evicts them (same as existing ConnectionDoesNotExistError handling)
  • Adds retry-once logic in fetch_one, fetch_all, and execute for this error class
  • Sentry issue 7343447228 (8 occurrences, culprit: /tgbot/webhook)

Test plan

  • Verify ruff passes (pre-commit hook ran successfully)
  • Deploy and monitor Sentry for recurrence of the InternalClientError
  • Confirm no regressions in webhook response times under normal traffic

🤖 Generated with Claude Code

CTO Agent and others added 4 commits April 1, 2026 20:18
…user hang

meme_stats: RECENT_MEME_IDS had `OR sent_at > ...` which forced a full
sequential scan on user_meme_reaction (22M rows) even though reacted_at
has an index. The OR condition prevents index usage — Postgres can't use
ix_user_meme_reaction_reacted_at when one branch has no index. Dropping
the sent_at branch lets the query use the index, cutting RECENT_MEME_IDS
lookup from ~200s to <1s. Skip signals for memes sent-but-not-reacted
are captured on the next 15-min run once the user reacts to anything.

broadcast: sequential per-user loop had no timeout — when check_queue()
triggered generate_recommendations() under DB pool exhaustion (e.g. stats
flow running concurrently), a single user could block for 3.5+ minutes,
consuming the entire 600s flow budget. Added asyncio.wait_for(timeout=20)
per user so one slow user is logged and skipped rather than hanging all.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
… upserts

Root cause: concurrent stats flow runs (Prefect retries + overlapping schedules)
both execute INSERT...ON CONFLICT DO UPDATE on the same stats table rows. When
two transactions acquire row locks in different orders, PostgreSQL deadlocks one
of them and rolls it back.

Scenario: calculate_meme_stats has retries=2, retry_delay=30s. A run at :03 that
fails at :06 retries at :06:30. If still running at :18, both instances upsert
the same recently-active meme_ids into meme_stats → deadlock. Same applies to
meme_source_stats (full table scan, retries=2) and user_meme_source_stats.

Fix 1 (database.py): _is_deadlock_error() + retry with 100ms/200ms backoff in
execute(). PostgreSQL's own recommendation: retry the victim transaction.

Fix 2 (meme.py, meme_source.py): Add ORDER BY to the final SELECT in both stats
upserts. Consistent row ordering reduces the probability of circular lock waits
when two transactions do happen to overlap.

Fix 3 (user.py, user_meme_source.py): Correct misleading docstrings claiming
"no deadlock risk" — Prefect retries make concurrent runs possible.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Remove nvidia/nemotron-nano-12b-v2-vl:free (invalid JSON/unterminated strings)
- Remove google/gemma-3-4b-it:free (invalid JSON escape sequences)
- Add meta-llama/llama-3.2-11b-vision-instruct:free as third fallback
- Add 3-stage JSON recovery: standard parse → escape-fix → regex extraction

Fixes 10% success rate (3/30 memes) caused by unreliable free model output.
Circuit breaker must be manually resumed after deploy.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…use)

Under high webhook concurrency the SQLAlchemy pool can hand out an asyncpg
connection whose previous async commit hasn't fully completed, causing
"cannot switch to state 15; another operation is in progress". Mark these
connections as disconnect (pool eviction) and retry once, matching existing
stale-connection handling.

Sentry: 7343447228

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@ohld
Copy link
Copy Markdown
Member Author

ohld commented Apr 6, 2026

Pre-Landing Review: 0 critical, 4 informational

Scope Check: CLEAN — 4 commits, each well-scoped to the stated intent (Sentry 7343447228 + related stability fixes).

CI Status

  • Lint FAIL: src/flows/storage/describe_memes.py needs ruff format. Quick fix.
  • Test FAIL: Pre-existing alembic migration cycle (a1b2c3d4e5f6 etc.) — not caused by this PR.

Structured Review (Pass 1 + Pass 2)

No critical findings. SQL uses parameterized queries. Retry logic is well-bounded. Error classification is correct.

Adversarial Review (Codex, gpt-5.4)

4 informational findings worth being aware of, none blocking:

1. Broadcast timeout can lose a popped meme (broadcasts/meme.py:35)
wait_for(timeout=20) wraps the full pop→send→persist pipeline. If cancellation lands after get_next_meme_for_user() pops from Redis but before send_meme_to_user completes, the meme is lost from the queue without delivery. Mitigated by: queue regenerates at ≤2 items, and this is a strict improvement over the old behavior (unbounded hang blocking all remaining users). Acceptable tradeoff.

2. Sent-but-unreacted memes excluded from incremental stats (stats/meme.py:41)
Removing OR sent_at > ... from RECENT_MEME_IDS means memes sent but not yet reacted to won't trigger incremental stats recompute. Mitigated by: once any user reacts, all stats for that meme recompute from full history. The old condition was causing full-table scans — intentional tradeoff.

3. Regex salvage can permanently store empty description (describe_memes.py:284)
If regex extraction yields only ocr_text but not description, line 284 defaults description to "". The retry query (line 89) checks IS NULL, not = '', so the meme is never re-described. Consider changing the retry filter to: OR COALESCE(M.ocr_result->>'description', '') = ''.

4. Broad InternalClientError matching (database.py:108)
Matches the entire InternalClientError class rather than the specific "cannot switch to state" message. In practice, all InternalClientError subtypes indicate a broken connection state, so retry-once-with-fresh-connection is safe.

Verdict

APPROVED. Solid production stability fixes. The retry logic, timeout handling, and stats query improvements are well-motivated and correctly implemented. Fix the lint issue before merge.

🤖 Reviewed by Staff Engineer agent (FFM-296)

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@ohld ohld merged commit 9f5f992 into production Apr 6, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant