fix(loaders): remove silent flags clobber in make_entry, normalize load_category return signature#507
Open
ayushshukla1807 wants to merge 3 commits intohatnote:migrate-image-to-filerevisionfrom
Conversation
The `image` and `oldimage` tables are being removed from Wikimedia wikireplicas on 28 May 2026 (T28741). Rewrites `get_files()` and `get_file_info()` in `labs.py` to use the new `file`, `filerevision`, and `filetypes` tables. All output key aliases (`img_*`, `oi_archive_name`, `rec_img_*`) are preserved so downstream code is unaffected. Also adds `file_id` (the Commons file identity) as a nullable column on the `Entry` model, threaded through `make_entry()` for all new imports. Key implementation notes: - GROUP BY non-determinism fixed: replaced with a correlated MIN(fr_id) subquery that deterministically selects the earliest non-deleted revision per file (original uploader / original timestamp). - `get_files_legacy()` is a temporary verbatim copy of the old query, kept alive solely to power `test_get_files_parity` (xfail) while both table sets are live. Remove together with that test after 28 May 2026. - The new `filetypes` LEFT JOIN can produce NULL MIME values (the old `image` table never did). Files with unknown MIME are kept (stored as NULL) rather than silently dropped. This surfaces a pre-existing bug in `autodisqualify_by_filetype()` where SQL NOT IN on NULL evaluates to NULL, silently skipping disqualification — left as a follow-up issue. - Toolforge runs MariaDB 10.6.22 (partial unique indexes require 10.8+), so the migration SQL uses a plain non-unique index on `file_id`; import-time uniqueness enforced at the application layer. Migration SQL: `.claude/features/migrate-image-to-filerevision/migration.sql` Run on production before deploying. Closes hatnote#504 Refs T28741 (https://phabricator.wikimedia.org/T28741) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both get_files() and get_file_info(). In the old image table schema the current revision was always live; in filerevision, file_latest can theoretically point to a suppressed revision. Without this guard such a file would be excluded from results by the inner JOIN rather than returning NULL metadata — a regression introduced by the migration. Also add a comment to make_entry() explaining that CSV-imported entries intentionally receive file_id=NULL (file_id only comes from wikireplica imports, not the CSV path). Refs hatnote#505 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ad_category return signature Two small bugs in loaders.py noticed while working on the migration: 1. make_entry() had a dead code block on lines 75-76 that would overwrite legitimate reupload flags (set a few lines earlier) if an edict happened to carry a top-level 'flags' key. No caller path ever sets this key in an edict, so the block was useless at best and destructive at worst. Removed. 2. load_category() was the only loader function returning a bare list instead of the standard (entries, warnings) tuple used by load_full_csv, load_partial_csv, load_by_filename and load_name_list. The inconsistency was quietly masking import errors. Normalized it to (entries, warnings) and updated the rdb.py call site (add_entries_from_cat) to unpack correctly and surface any skipped-entry count in the audit log.
2063852 to
ef40cd7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two small bugs I spotted while going through the migration code.
1. Silent flags clobber in
make_entry()(loaders.py)There was a dead block at the bottom of
make_entry()that would overwrite the reupload flags dict if anedicthappened to carry a top-levelflagskey. No caller ever sets that key — looked through every import path — so the check was useless at best and potentially destructive if something ever did passflagsin. Removed it.2.
load_category()return inconsistency (loaders.py + rdb.py)Every other loader (
load_full_csv,load_by_filename, etc.) returns(entries, warnings).load_category()was the odd one out — just returned a bare list. The inconsistency was silently swallowing import errors during category imports. Normalized the return to(entries, warnings)and updated theadd_entries_from_cat()call site inrdb.pyto unpack correctly. Also surfaced any skipped-entry count in the audit log message so organizers can tell if something got dropped.Tests pass locally: