Skip to content

fix(db): retry on asyncpg InternalClientError (concurrent connection use)#160

Closed
ohld wants to merge 8 commits intoproductionfrom
fix/asyncpg-concurrent-connection-retry
Closed

fix(db): retry on asyncpg InternalClientError (concurrent connection use)#160
ohld wants to merge 8 commits intoproductionfrom
fix/asyncpg-concurrent-connection-retry

Conversation

@ohld
Copy link
Copy Markdown
Member

@ohld ohld commented Apr 6, 2026

Summary

Fixes asyncpg concurrent connection errors still appearing in Sentry (issue #7343447228, culprit: /tgbot/webhook, count=8 total as of 2026-04-06).

DB connection retry (primary fix):

  • Extended _mark_bad_connection_as_disconnect to invalidate connections for both ConnectionDoesNotExistError and InternalClientError — pool discards the bad connection after retry
  • Added _is_concurrent_use_error(): detects InternalClientError("cannot switch to state … another operation is in progress") via DBAPIError.__cause__
  • Added _is_transient_connection_error(): combines stale + concurrent-use checks
  • Applied to all three DB helpers: fetch_one, fetch_all, execute — retry-once on any transient connection error

describe_memes model update:

  • Removed meta-llama/llama-3.2-11b-vision-instruct:free (404 on OpenRouter)
  • Added qwen/qwen3.6-plus:free (1M context, strong structured output)

Minor cleanup:

  • .gitignore: add .env.* pattern, preserve .env.example
  • wrapped.py: line-length formatting only
  • agents/comms/AGENTS.md: voice/tone docs for comms agent
  • specs/issues.md: mark H1 asyncpg issue as done, add M7/M8 backlog items

Pre-Landing Review

No issues found. DB retry logic is consistent with existing _is_stale_connection_error pattern. All three DB functions updated uniformly.

Test plan

  • Deploy and monitor Sentry — no new InternalClientError events from /tgbot/webhook
  • docker compose exec app pytest — all integration tests pass
  • Check describe_memes flow in Prefect — qwen3.6-plus successfully describes memes

🤖 Generated with Claude Code

CTO Agent and others added 8 commits April 1, 2026 20:18
…user hang

meme_stats: RECENT_MEME_IDS had `OR sent_at > ...` which forced a full
sequential scan on user_meme_reaction (22M rows) even though reacted_at
has an index. The OR condition prevents index usage — Postgres can't use
ix_user_meme_reaction_reacted_at when one branch has no index. Dropping
the sent_at branch lets the query use the index, cutting RECENT_MEME_IDS
lookup from ~200s to <1s. Skip signals for memes sent-but-not-reacted
are captured on the next 15-min run once the user reacts to anything.

broadcast: sequential per-user loop had no timeout — when check_queue()
triggered generate_recommendations() under DB pool exhaustion (e.g. stats
flow running concurrently), a single user could block for 3.5+ minutes,
consuming the entire 600s flow budget. Added asyncio.wait_for(timeout=20)
per user so one slow user is logged and skipped rather than hanging all.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
… upserts

Root cause: concurrent stats flow runs (Prefect retries + overlapping schedules)
both execute INSERT...ON CONFLICT DO UPDATE on the same stats table rows. When
two transactions acquire row locks in different orders, PostgreSQL deadlocks one
of them and rolls it back.

Scenario: calculate_meme_stats has retries=2, retry_delay=30s. A run at :03 that
fails at :06 retries at :06:30. If still running at :18, both instances upsert
the same recently-active meme_ids into meme_stats → deadlock. Same applies to
meme_source_stats (full table scan, retries=2) and user_meme_source_stats.

Fix 1 (database.py): _is_deadlock_error() + retry with 100ms/200ms backoff in
execute(). PostgreSQL's own recommendation: retry the victim transaction.

Fix 2 (meme.py, meme_source.py): Add ORDER BY to the final SELECT in both stats
upserts. Consistent row ordering reduces the probability of circular lock waits
when two transactions do happen to overlap.

Fix 3 (user.py, user_meme_source.py): Correct misleading docstrings claiming
"no deadlock risk" — Prefect retries make concurrent runs possible.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Remove nvidia/nemotron-nano-12b-v2-vl:free (invalid JSON/unterminated strings)
- Remove google/gemma-3-4b-it:free (invalid JSON escape sequences)
- Add meta-llama/llama-3.2-11b-vision-instruct:free as third fallback
- Add 3-stage JSON recovery: standard parse → escape-fix → regex extraction

Fixes 10% success rate (3/30 memes) caused by unreliable free model output.
Circuit breaker must be manually resumed after deploy.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…use)

Under high webhook concurrency the SQLAlchemy pool can hand out an asyncpg
connection whose previous async commit hasn't fully completed, causing
"cannot switch to state 15; another operation is in progress". Mark these
connections as disconnect (pool eviction) and retry once, matching existing
stale-connection handling.

Sentry: 7343447228

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
…6-plus

meta-llama/llama-3.2-11b-vision-instruct:free returns HTTP 404 on
OpenRouter (model removed from free tier). Replace with
qwen/qwen3.6-plus:free which supports image+video input and 1M context.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@ohld
Copy link
Copy Markdown
Member Author

ohld commented Apr 6, 2026

Closing — all changes in this branch were already merged via PR #159 (merged 2026-04-06). The diff against production is empty.

The CI test failure (Alembic migration cycle in revisions a1b2c3d4e5f6..f6a7b8c9d0e1) is a pre-existing issue on production — not introduced by this branch. Production CI is also failing with the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant