Skip to content

Migrate image/oldimage queries to file/filerevision/filetypes#505

Open
lgelauff wants to merge 1 commit intomasterfrom
migrate-image-to-filerevision
Open

Migrate image/oldimage queries to file/filerevision/filetypes#505
lgelauff wants to merge 1 commit intomasterfrom
migrate-image-to-filerevision

Conversation

@lgelauff
Copy link
Copy Markdown
Collaborator

@lgelauff lgelauff commented Apr 17, 2026

Summary

Wikimedia is removing the image and oldimage tables from wikireplicas on 28 May 2026 (cloud-announce, T28741). This PR rewrites the two affected query functions in labs.py to use the new file, filerevision, and filetypes tables before that deadline.

Closes #504.

What changed

  • labs.py: get_files() and get_file_info() rewritten to use file/filerevision/filetypes. All output key aliases (img_*, oi_archive_name, rec_img_*) preserved — no downstream changes required. Categorylinks query updated to use linktarget join (same fix as Fix categorylinks query for MediaWiki link target normalization #469).
  • rdb.py: Added nullable file_id (BIGINT) column to the Entry model.
  • loaders.py: make_entry() now stores file_id on new entries and uses .get() or None for MIME fields to handle NULL gracefully.
  • conftest.py: Fixture dicts updated with file_id and oi_archive_name fields to match real labs.py output. Added REUPLOAD_FILE_INFO fixture.
  • New tests: test_make_entry_reupload, test_get_files_parity (xfail — Toolforge only), test_get_files_info_by_name.
  • tools/migrate_beta_db.py: Atomic migration script for the beta SQLite DB (adds file_id column + index).
  • tools/revert_beta_db.py: Revert script using the SQLite 12-step schema-alteration procedure for full safety.

Implementation notes

GROUP BY non-determinism fixed: The old query used GROUP BY img_name with ORDER BY oi_timestamp ASC, which was non-deterministic when a file had multiple oldimage rows. Replaced with a correlated MIN(fr_id) subquery that deterministically selects the earliest non-deleted revision per file (original uploader / original timestamp).

Regression fix — suppressed latest revision: file.file_latest can point to a suppressed revision in filerevision (unlike the old image table which always held live data). The main JOIN therefore includes AND fr.fr_deleted = 0 to skip suppressed revisions.

get_files_legacy() is intentionally kept: A verbatim copy of the old image/oldimage query, retained solely to power test_get_files_parity (marked xfail locally, passes on Toolforge with TOOLFORGE=1). While both table sets are live until 28 May, this test asserts the new query returns the same filenames as the old one. Do not remove get_files_legacy() in this PR — it and the parity test are removed together in a follow-up after 28 May.

MIME NULL handling: The new LEFT JOIN filetypes can theoretically return NULL for img_major_mime/img_minor_mime if file.file_type has no match in filetypes. Files with unknown MIME are kept and stored as NULL rather than silently dropped. This surfaces a pre-existing bug in autodisqualify_by_filetype() (out of scope, tracked as follow-up).

Index strategy: Toolforge runs MariaDB 10.6.22, which does not support partial unique indexes. The migration therefore creates a plain non-unique index on file_id; import-time uniqueness is enforced at the application layer.

Migration

Beta uses SQLite; prod uses MySQL/MariaDB. Run before deploying.

Beta (SQLite) — use the included script:

python3 tools/migrate_beta_db.py

Prod (MySQL/MariaDB):

ALTER TABLE entries ADD COLUMN file_id BIGINT DEFAULT NULL;
ALTER TABLE entries ADD INDEX ix_entry_file_id (file_id);

A revert script (tools/revert_beta_db.py) is included for the beta SQLite DB. Both scripts are idempotent and tested locally (roundtrip: pre-migration schema/indexes/data identical to post-revert).

Test plan

  • pytest montage/tests/ — 8 passed, 1 xfailed
  • Standalone wikireplica query test (tools/test_labs_queries_pr505.py) — all checks passed on Toolforge beta (parity, row shape, file_id populated, attribution for reuploaded file verified)
  • Migration script applied to montage-beta DB — succeeded
  • montage-beta deployed and running on PR branch
  • Run a category import on beta and verify file_id is populated on newly imported entries
  • Verify original uploader is correct for a known reuploaded file via the UI
  • Run parity test on Toolforge: tools/test_labs_queries_pr505.py — all checks passed (11866 files, parity confirmed, attribution verified)

Planning pipeline modelled after work by @alissayarmantho in ashishg/dp#1966 — thank you Alissa.

🤖 Generated with Claude Code

lgelauff added a commit that referenced this pull request Apr 17, 2026
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both
get_files() and get_file_info(). In the old image table schema the
current revision was always live; in filerevision, file_latest can
theoretically point to a suppressed revision. Without this guard such
a file would be excluded from results by the inner JOIN rather than
returning NULL metadata — a regression introduced by the migration.

Also add a comment to make_entry() explaining that CSV-imported entries
intentionally receive file_id=NULL (file_id only comes from wikireplica
imports, not the CSV path).

Refs #505

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff lgelauff marked this pull request as draft April 17, 2026 19:45
lgelauff added a commit that referenced this pull request Apr 18, 2026
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both
get_files() and get_file_info(). In the old image table schema the
current revision was always live; in filerevision, file_latest can
theoretically point to a suppressed revision. Without this guard such
a file would be excluded from results by the inner JOIN rather than
returning NULL metadata — a regression introduced by the migration.

Also add a comment to make_entry() explaining that CSV-imported entries
intentionally receive file_id=NULL (file_id only comes from wikireplica
imports, not the CSV path).

Refs #505

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lgelauff added a commit that referenced this pull request Apr 18, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff
Copy link
Copy Markdown
Collaborator Author

Deployment notes for reviewers

Separate PR needed: setuptools pin

This PR branch includes a setuptools==81.0.0 pin in requirements.txt that is not related to the migration — it was added to fix a Python 3.13 Toolforge deployment issue (pkg_resources was removed in setuptools 82). This should be extracted into a separate PR targeting master before or alongside this one, and removed once python-graph-core is replaced (tracked in #421).


Production deployment steps

Run these in order when deploying to prod (montage, not montage-beta):

  1. Check prod is not in active use

    https://montage.toolforge.org/v1/logs/audit
    

    Check create_date on recent entries — proceed only if no active campaigns.

  2. SSH into prod

    ssh <username>@login.toolforge.org
    become montage
  3. Pull the new code

    cd ~/www/python/src
    git pull origin master
  4. Run the DB migration (MySQL/MariaDB — run before restarting the service)

    sql local
    USE s52584__montage;
    ALTER TABLE entries ADD COLUMN file_id BIGINT DEFAULT NULL;
    ALTER TABLE entries ADD INDEX ix_entry_file_id (file_id);
    EXIT;

    Both statements are safe to run on a live DB — ALTER TABLE ... ADD COLUMN with a default of NULL does not lock rows in MariaDB 10.6.

  5. Rebuild the frontend (if frontend changed)

    toolforge webservice node18 shell -m 2G
    cd ~/www/python/src/frontend
    npm install
    npm run toolforge:build
    exit
  6. Install any new Python packages

    toolforge webservice python3.13 shell
    pip install -r ~/www/python/src/requirements.txt
    exit
  7. Restart the service

    toolforge webservice python3.13 restart
  8. Verify

@lgelauff lgelauff force-pushed the migrate-image-to-filerevision branch from 5be508f to 13b622e Compare April 18, 2026 14:30
lgelauff added a commit that referenced this pull request Apr 18, 2026
cffi==1.16.0 has no pre-built wheel for Python 3.13 and fails to
compile from source on Toolforge (missing libffi-dev headers).
Bumped to 1.17.1 which ships a Python 3.13 wheel.

setuptools 82.0.0 removed pkg_resources entirely. python-graph-core
uses pkg_resources for namespace package declarations, causing
ModuleNotFoundError on startup. Pinned to 81.0.0 until
python-graph-core is replaced (#421).

Discovered during Python 3.13 Toolforge deployment of #505.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff lgelauff marked this pull request as ready for review April 19, 2026 06:20
@lgelauff lgelauff requested a review from mahmoud April 19, 2026 06:20
lgelauff added a commit that referenced this pull request Apr 19, 2026
- Add tools/migrate_prod_db.sql and revert_prod_db.sql for production
  MySQL migration (idempotent, exact inverses of each other)
- Add comment to labs.py clarifying that alias `oi` mirrors the old
  oldimage role but is a filerevision subquery, not the oldimage table
- Change fixture file_ids (conftest.py) for SELECTED and REUPLOAD from
  99999/88888 to 1/2, structurally below the FIXTURE_FILE_INFOS range
  (1000+) so collisions are impossible regardless of future additions

Part of #505.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lgelauff added a commit that referenced this pull request Apr 19, 2026
- Fix tablename: actual SQLAlchemy model uses `entries` not `entry`
- Rename index to ix_entries_file_id to match tablename convention
- Both scripts remain idempotent (IF NOT EXISTS / IF EXISTS)

Part of #505.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…evision/filetypes

Closes #425. Relates to #505.

Changes:
- labs.py: new query using file/filerevision/filetypes tables; adds file_id;
  SELECT DISTINCT to prevent duplicates from multiple linktarget rows
- rdb.py: add file_id column to Entry; deduplicate entries case-insensitively
  in add_entries() to match MariaDB utf8mb4_unicode_ci collation
- loaders.py: pass file_id through make_entry(); update export dicts
- tests: update fixtures and assertions for new query shape
- tools/migrate_prod_db.sql, revert_prod_db.sql: production schema migration
- requirements.txt: cffi 1.17.1, setuptools pin for Python 3.13
- deployment.md: update python3.9 → python3.13 references

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lgelauff lgelauff force-pushed the migrate-image-to-filerevision branch from 2063852 to ef40cd7 Compare April 19, 2026 12:12
@lgelauff
Copy link
Copy Markdown
Collaborator Author

Beta validation complete (2026-04-19)

Validated end-to-end on montage-beta running MariaDB (10.6.22, utf8mb4_unicode_ci) — same stack as production.

Query parity

  • get_files() (new) and get_files_legacy() (old) return identical results for a real WLM category (WLM France 2025: 9,844 files, 0 discrepancy).
  • compare_queries.py confirms 0 files only in new, 0 files only in legacy.

Import pipeline

  • Category import via UI completed successfully (9,844/9,844 entries, zero gaps).
  • Two bugs found and fixed during testing:
    • Duplicate linktarget rows caused duplicate entries → fixed with SELECT DISTINCT in get_files().
    • MariaDB utf8mb4_unicode_ci case-insensitive unique index caused IntegrityError on case-variant filenames → fixed with case-insensitive dedup in add_entries() in rdb.py.

Migration SQL

  • migrate_prod_db.sql (add file_id column + index) runs cleanly on MariaDB.
  • revert_prod_db.sql (drop column + index) is cleanly reversible — verified by running revert + re-apply cycle on beta.

Production readiness

  • PR is ready to merge. After merge: run migrate_prod_db.sql on prod DB, deploy to prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate image/oldimage to file/filerevision tables (wikireplica deadline: 28 May 2026)

1 participant