fix(db): retry on asyncpg InternalClientError#158
Conversation
…user hang meme_stats: RECENT_MEME_IDS had `OR sent_at > ...` which forced a full sequential scan on user_meme_reaction (22M rows) even though reacted_at has an index. The OR condition prevents index usage — Postgres can't use ix_user_meme_reaction_reacted_at when one branch has no index. Dropping the sent_at branch lets the query use the index, cutting RECENT_MEME_IDS lookup from ~200s to <1s. Skip signals for memes sent-but-not-reacted are captured on the next 15-min run once the user reacts to anything. broadcast: sequential per-user loop had no timeout — when check_queue() triggered generate_recommendations() under DB pool exhaustion (e.g. stats flow running concurrently), a single user could block for 3.5+ minutes, consuming the entire 600s flow budget. Added asyncio.wait_for(timeout=20) per user so one slow user is logged and skipped rather than hanging all. Co-Authored-By: Paperclip <noreply@paperclip.ing>
… upserts Root cause: concurrent stats flow runs (Prefect retries + overlapping schedules) both execute INSERT...ON CONFLICT DO UPDATE on the same stats table rows. When two transactions acquire row locks in different orders, PostgreSQL deadlocks one of them and rolls it back. Scenario: calculate_meme_stats has retries=2, retry_delay=30s. A run at :03 that fails at :06 retries at :06:30. If still running at :18, both instances upsert the same recently-active meme_ids into meme_stats → deadlock. Same applies to meme_source_stats (full table scan, retries=2) and user_meme_source_stats. Fix 1 (database.py): _is_deadlock_error() + retry with 100ms/200ms backoff in execute(). PostgreSQL's own recommendation: retry the victim transaction. Fix 2 (meme.py, meme_source.py): Add ORDER BY to the final SELECT in both stats upserts. Consistent row ordering reduces the probability of circular lock waits when two transactions do happen to overlap. Fix 3 (user.py, user_meme_source.py): Correct misleading docstrings claiming "no deadlock risk" — Prefect retries make concurrent runs possible. Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Remove nvidia/nemotron-nano-12b-v2-vl:free (invalid JSON/unterminated strings) - Remove google/gemma-3-4b-it:free (invalid JSON escape sequences) - Add meta-llama/llama-3.2-11b-vision-instruct:free as third fallback - Add 3-stage JSON recovery: standard parse → escape-fix → regex extraction Fixes 10% success rate (3/30 memes) caused by unreliable free model output. Circuit breaker must be manually resumed after deploy. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…use) Under high webhook concurrency the SQLAlchemy pool can hand out an asyncpg connection whose previous async commit hasn't fully completed, causing "cannot switch to state 15; another operation is in progress". Mark these connections as disconnect (pool eviction) and retry once, matching existing stale-connection handling. Sentry: 7343447228 Co-Authored-By: Paperclip <noreply@paperclip.ing>
Pre-Landing Review: 0 critical, 4 informationalScope Check: CLEAN — 4 commits, each well-scoped to the stated intent (Sentry 7343447228 + related stability fixes). CI Status
Structured Review (Pass 1 + Pass 2)No critical findings. SQL uses parameterized queries. Retry logic is well-bounded. Error classification is correct. Adversarial Review (Codex, gpt-5.4)4 informational findings worth being aware of, none blocking: 1. Broadcast timeout can lose a popped meme ( 2. Sent-but-unreacted memes excluded from incremental stats ( 3. Regex salvage can permanently store empty description ( 4. Broad InternalClientError matching ( VerdictAPPROVED. Solid production stability fixes. The retry logic, timeout handling, and stats query improvements are well-motivated and correctly implemented. Fix the lint issue before merge. 🤖 Reviewed by Staff Engineer agent (FFM-296) |
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Summary
InternalClientError: cannot switch to state 15; another operation is in progressfrom asyncpg — a transient race where the pool hands out a connection whose previous async commit hasn't fully completedInternalClientErrorconnections as disconnect so the pool evicts them (same as existingConnectionDoesNotExistErrorhandling)fetch_one,fetch_all, andexecutefor this error class/tgbot/webhook)Test plan
InternalClientError🤖 Generated with Claude Code