fix(db): retry on asyncpg InternalClientError (concurrent connection use)#160
Closed
ohld wants to merge 8 commits intoproductionfrom
Closed
fix(db): retry on asyncpg InternalClientError (concurrent connection use)#160ohld wants to merge 8 commits intoproductionfrom
ohld wants to merge 8 commits intoproductionfrom
Conversation
…user hang meme_stats: RECENT_MEME_IDS had `OR sent_at > ...` which forced a full sequential scan on user_meme_reaction (22M rows) even though reacted_at has an index. The OR condition prevents index usage — Postgres can't use ix_user_meme_reaction_reacted_at when one branch has no index. Dropping the sent_at branch lets the query use the index, cutting RECENT_MEME_IDS lookup from ~200s to <1s. Skip signals for memes sent-but-not-reacted are captured on the next 15-min run once the user reacts to anything. broadcast: sequential per-user loop had no timeout — when check_queue() triggered generate_recommendations() under DB pool exhaustion (e.g. stats flow running concurrently), a single user could block for 3.5+ minutes, consuming the entire 600s flow budget. Added asyncio.wait_for(timeout=20) per user so one slow user is logged and skipped rather than hanging all. Co-Authored-By: Paperclip <noreply@paperclip.ing>
… upserts Root cause: concurrent stats flow runs (Prefect retries + overlapping schedules) both execute INSERT...ON CONFLICT DO UPDATE on the same stats table rows. When two transactions acquire row locks in different orders, PostgreSQL deadlocks one of them and rolls it back. Scenario: calculate_meme_stats has retries=2, retry_delay=30s. A run at :03 that fails at :06 retries at :06:30. If still running at :18, both instances upsert the same recently-active meme_ids into meme_stats → deadlock. Same applies to meme_source_stats (full table scan, retries=2) and user_meme_source_stats. Fix 1 (database.py): _is_deadlock_error() + retry with 100ms/200ms backoff in execute(). PostgreSQL's own recommendation: retry the victim transaction. Fix 2 (meme.py, meme_source.py): Add ORDER BY to the final SELECT in both stats upserts. Consistent row ordering reduces the probability of circular lock waits when two transactions do happen to overlap. Fix 3 (user.py, user_meme_source.py): Correct misleading docstrings claiming "no deadlock risk" — Prefect retries make concurrent runs possible. Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Remove nvidia/nemotron-nano-12b-v2-vl:free (invalid JSON/unterminated strings) - Remove google/gemma-3-4b-it:free (invalid JSON escape sequences) - Add meta-llama/llama-3.2-11b-vision-instruct:free as third fallback - Add 3-stage JSON recovery: standard parse → escape-fix → regex extraction Fixes 10% success rate (3/30 memes) caused by unreliable free model output. Circuit breaker must be manually resumed after deploy. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…use) Under high webhook concurrency the SQLAlchemy pool can hand out an asyncpg connection whose previous async commit hasn't fully completed, causing "cannot switch to state 15; another operation is in progress". Mark these connections as disconnect (pool eviction) and retry once, matching existing stale-connection handling. Sentry: 7343447228 Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
…current-connection-retry
…6-plus meta-llama/llama-3.2-11b-vision-instruct:free returns HTTP 404 on OpenRouter (model removed from free tier). Replace with qwen/qwen3.6-plus:free which supports image+video input and 1M context. Co-Authored-By: Paperclip <noreply@paperclip.ing>
…current-connection-retry
Member
Author
|
Closing — all changes in this branch were already merged via PR #159 (merged 2026-04-06). The diff against production is empty. The CI test failure (Alembic migration cycle in revisions a1b2c3d4e5f6..f6a7b8c9d0e1) is a pre-existing issue on production — not introduced by this branch. Production CI is also failing with the same error. |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes asyncpg concurrent connection errors still appearing in Sentry (issue #7343447228, culprit: /tgbot/webhook, count=8 total as of 2026-04-06).
DB connection retry (primary fix):
_mark_bad_connection_as_disconnectto invalidate connections for bothConnectionDoesNotExistErrorandInternalClientError— pool discards the bad connection after retry_is_concurrent_use_error(): detectsInternalClientError("cannot switch to state … another operation is in progress")viaDBAPIError.__cause___is_transient_connection_error(): combines stale + concurrent-use checksfetch_one,fetch_all,execute— retry-once on any transient connection errordescribe_memes model update:
meta-llama/llama-3.2-11b-vision-instruct:free(404 on OpenRouter)qwen/qwen3.6-plus:free(1M context, strong structured output)Minor cleanup:
.gitignore: add.env.*pattern, preserve.env.examplewrapped.py: line-length formatting onlyagents/comms/AGENTS.md: voice/tone docs for comms agentspecs/issues.md: mark H1 asyncpg issue as done, add M7/M8 backlog itemsPre-Landing Review
No issues found. DB retry logic is consistent with existing
_is_stale_connection_errorpattern. All three DB functions updated uniformly.Test plan
InternalClientErrorevents from /tgbot/webhookdocker compose exec app pytest— all integration tests pass🤖 Generated with Claude Code