Migrate image/oldimage queries to file/filerevision/filetypes#505
Migrate image/oldimage queries to file/filerevision/filetypes#505
Conversation
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both get_files() and get_file_info(). In the old image table schema the current revision was always live; in filerevision, file_latest can theoretically point to a suppressed revision. Without this guard such a file would be excluded from results by the inner JOIN rather than returning NULL metadata — a regression introduced by the migration. Also add a comment to make_entry() explaining that CSV-imported entries intentionally receive file_id=NULL (file_id only comes from wikireplica imports, not the CSV path). Refs #505 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both get_files() and get_file_info(). In the old image table schema the current revision was always live; in filerevision, file_latest can theoretically point to a suppressed revision. Without this guard such a file would be excluded from results by the inner JOIN rather than returning NULL metadata — a regression introduced by the migration. Also add a comment to make_entry() explaining that CSV-imported entries intentionally receive file_id=NULL (file_id only comes from wikireplica imports, not the CSV path). Refs #505 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deployment notes for reviewersSeparate PR needed: setuptools pinThis PR branch includes a Production deployment stepsRun these in order when deploying to prod (montage, not montage-beta):
|
5be508f to
13b622e
Compare
cffi==1.16.0 has no pre-built wheel for Python 3.13 and fails to compile from source on Toolforge (missing libffi-dev headers). Bumped to 1.17.1 which ships a Python 3.13 wheel. setuptools 82.0.0 removed pkg_resources entirely. python-graph-core uses pkg_resources for namespace package declarations, causing ModuleNotFoundError on startup. Pinned to 81.0.0 until python-graph-core is replaced (#421). Discovered during Python 3.13 Toolforge deployment of #505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add tools/migrate_prod_db.sql and revert_prod_db.sql for production MySQL migration (idempotent, exact inverses of each other) - Add comment to labs.py clarifying that alias `oi` mirrors the old oldimage role but is a filerevision subquery, not the oldimage table - Change fixture file_ids (conftest.py) for SELECTED and REUPLOAD from 99999/88888 to 1/2, structurally below the FIXTURE_FILE_INFOS range (1000+) so collisions are impossible regardless of future additions Part of #505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix tablename: actual SQLAlchemy model uses `entries` not `entry` - Rename index to ix_entries_file_id to match tablename convention - Both scripts remain idempotent (IF NOT EXISTS / IF EXISTS) Part of #505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…evision/filetypes Closes #425. Relates to #505. Changes: - labs.py: new query using file/filerevision/filetypes tables; adds file_id; SELECT DISTINCT to prevent duplicates from multiple linktarget rows - rdb.py: add file_id column to Entry; deduplicate entries case-insensitively in add_entries() to match MariaDB utf8mb4_unicode_ci collation - loaders.py: pass file_id through make_entry(); update export dicts - tests: update fixtures and assertions for new query shape - tools/migrate_prod_db.sql, revert_prod_db.sql: production schema migration - requirements.txt: cffi 1.17.1, setuptools pin for Python 3.13 - deployment.md: update python3.9 → python3.13 references Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2063852 to
ef40cd7
Compare
Beta validation complete (2026-04-19)Validated end-to-end on montage-beta running MariaDB (10.6.22, utf8mb4_unicode_ci) — same stack as production. Query parity
Import pipeline
Migration SQL
Production readiness
|
Summary
Wikimedia is removing the
imageandoldimagetables from wikireplicas on 28 May 2026 (cloud-announce, T28741). This PR rewrites the two affected query functions inlabs.pyto use the newfile,filerevision, andfiletypestables before that deadline.Closes #504.
What changed
labs.py:get_files()andget_file_info()rewritten to usefile/filerevision/filetypes. All output key aliases (img_*,oi_archive_name,rec_img_*) preserved — no downstream changes required. Categorylinks query updated to uselinktargetjoin (same fix as Fix categorylinks query for MediaWiki link target normalization #469).rdb.py: Added nullablefile_id(BIGINT) column to theEntrymodel.loaders.py:make_entry()now storesfile_idon new entries and uses.get() or Nonefor MIME fields to handle NULL gracefully.conftest.py: Fixture dicts updated withfile_idandoi_archive_namefields to match reallabs.pyoutput. AddedREUPLOAD_FILE_INFOfixture.test_make_entry_reupload,test_get_files_parity(xfail — Toolforge only),test_get_files_info_by_name.tools/migrate_beta_db.py: Atomic migration script for the beta SQLite DB (addsfile_idcolumn + index).tools/revert_beta_db.py: Revert script using the SQLite 12-step schema-alteration procedure for full safety.Implementation notes
GROUP BY non-determinism fixed: The old query used
GROUP BY img_namewithORDER BY oi_timestamp ASC, which was non-deterministic when a file had multipleoldimagerows. Replaced with a correlatedMIN(fr_id)subquery that deterministically selects the earliest non-deleted revision per file (original uploader / original timestamp).Regression fix — suppressed latest revision:
file.file_latestcan point to a suppressed revision infilerevision(unlike the oldimagetable which always held live data). The main JOIN therefore includesAND fr.fr_deleted = 0to skip suppressed revisions.get_files_legacy()is intentionally kept: A verbatim copy of the oldimage/oldimagequery, retained solely to powertest_get_files_parity(markedxfaillocally, passes on Toolforge withTOOLFORGE=1). While both table sets are live until 28 May, this test asserts the new query returns the same filenames as the old one. Do not removeget_files_legacy()in this PR — it and the parity test are removed together in a follow-up after 28 May.MIME NULL handling: The new
LEFT JOIN filetypescan theoretically return NULL forimg_major_mime/img_minor_mimeiffile.file_typehas no match infiletypes. Files with unknown MIME are kept and stored as NULL rather than silently dropped. This surfaces a pre-existing bug inautodisqualify_by_filetype()(out of scope, tracked as follow-up).Index strategy: Toolforge runs MariaDB 10.6.22, which does not support partial unique indexes. The migration therefore creates a plain non-unique index on
file_id; import-time uniqueness is enforced at the application layer.Migration
Beta uses SQLite; prod uses MySQL/MariaDB. Run before deploying.
Beta (SQLite) — use the included script:
Prod (MySQL/MariaDB):
A revert script (
tools/revert_beta_db.py) is included for the beta SQLite DB. Both scripts are idempotent and tested locally (roundtrip: pre-migration schema/indexes/data identical to post-revert).Test plan
pytest montage/tests/— 8 passed, 1 xfailedtools/test_labs_queries_pr505.py) — all checks passed on Toolforge beta (parity, row shape,file_idpopulated, attribution for reuploaded file verified)file_idis populated on newly imported entriestools/test_labs_queries_pr505.py— all checks passed (11866 files, parity confirmed, attribution verified)Planning pipeline modelled after work by @alissayarmantho in ashishg/dp#1966 — thank you Alissa.
🤖 Generated with Claude Code