Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs#43
Conversation
…lysis module (nf-core#38) Replace the blunt getAllFilesFromDir ignore-list workaround in the CELLPROFILER_ANALYSIS real test with fine-grained, content-aware assertions: - Image.csv: parse header + data row; assert stable columns exist, stable field values match (Metadata_Plate, Metadata_Well, Count_Cells > 0), and volatile PathName_*Outlines columns match /.*\/outlines/ pattern rather than pinning the work-dir hash - Experiment.csv: scan line-by-line for CellProfiler_Version row and assert value is 4.2.8; Run_Timestamp is deliberately not asserted - PNG outlines: assert exists() + size() > 1000 instead of md5 hash, which drifts across libpng versions between macOS and Linux CI - Nuclei/Cells/Cytoplasm: assert existence and column schema (ImageNumber, ObjectNumber); floating-point drift tracked separately in nf-core#41 Remove **/Image.csv and **/Experiment.csv from tests/.nftignore since the module test now handles those files directly. Update snapshot to contain only versions.yml (PNG md5s removed). Made-with: Cursor
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
There was a problem hiding this comment.
Pull request overview
Updates nf-test coverage for the CELLPROFILER_ANALYSIS module by replacing coarse md5 snapshots of unstable outputs with targeted, content-aware assertions, and adjusts snapshot/ignore configuration to reflect the new strategy.
Changes:
- Reworked
modules/local/cellprofiler/analysis/tests/main.nf.testto assert stable CSV fields, tolerate volatile path/timestamp fields, and validate PNG outputs via existence/size checks. - Regenerated
modules/local/cellprofiler/analysis/tests/main.nf.test.snapto drop stale PNG md5 entries and keep deterministicversions.yml. - Updated
tests/.nftignoreto stop excludingImage.csv/Experiment.csv.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/.nftignore |
Removes ignore patterns for Image.csv and Experiment.csv, affecting which files are included in snapshot hashing. |
modules/local/cellprofiler/analysis/tests/main.nf.test.snap |
Updates the analysis module snapshot to retain only deterministic artifacts (versions.yml). |
modules/local/cellprofiler/analysis/tests/main.nf.test |
Replaces snapshot(process.out).match()-style assertions with granular checks for CSV content and PNG sanity. |
Comments suppressed due to low confidence (1)
tests/.nftignore:7
- Removing
**/Image.csvand**/Experiment.csvfrom.nftignorewill cause md5-based snapshots that useignoreFile: 'tests/.nftignore'(e.g., the pipeline test) to start hashing these known non-deterministic files again (work-dir paths and timestamps), making snapshots flaky/fail. Either keep these patterns until the pipeline-level snapshot strategy is updated, or add a pipeline-level preprocessing/intelligent assertion approach for these files before un-ignoring them.
.DS_Store
# Cells.csv, Cytoplasm.csv, Nuclei.csv contain per-cell measurements whose
# floating-point values drift across BLAS implementations (macOS Accelerate
# vs Linux OpenBLAS in CI). Tracked in #41.
**/Cells.csv
**/Cytoplasm.csv
**/Nuclei.csv
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| ["Nuclei.csv", "Cells.csv", "Cytoplasm.csv"].each { csvName -> | ||
| def header = analysisDir.resolve(csvName).readLines()[0].split(",") as List | ||
| assert "ImageNumber" in header | ||
| assert "ObjectNumber" in header | ||
| } |
There was a problem hiding this comment.
readLines()[0] loads the entire per-cell CSV into memory just to read the header. These files can be large; prefer reading only the first line (e.g., newReader().readLine() / eachLine with an early break) to avoid unnecessary memory/time overhead in tests.
kenibrewer
left a comment
There was a problem hiding this comment.
Nice work on this, Mrunali! The module-level assertions are well structured — the per-output approach (header check + stable-field equality + regex for volatile fields + size sanity check for PNGs) is exactly the direction issue #38 was pointing at, and it reads cleanly. Thank you for taking this on.
I have one thing I'd like to sort out before we merge, plus a few smaller suggestions. Happy to pair on any of these if it'd help.
One thing to fix before merging — the tests/.nftignore change
The module test now handles Image.csv and Experiment.csv intelligently, which is great. But the pipeline-level test (tests/default.nf.test) also reads those same output files and md5-snapshots them, via this line:
def stable_path = getAllFilesFromDir(params.outdir, ignoreFile: 'tests/.nftignore')That means tests/.nftignore has two consumers: the module test (no longer needs the entries) and the pipeline test (still does). When **/Image.csv and **/Experiment.csv are removed from .nftignore, the pipeline test starts hashing the volatile content (PathName_*Outlines work-dir paths, Run_Timestamp) and the snapshot will mismatch on every run.
Why CI is green: nf-test's --changed-since HEAD^ selects tests by walking script dependencies, and it doesn't see tests/.nftignore as an input to tests/default.nf.test, so the pipeline test wasn't picked up in CI for this PR. The full local suite (nf-test test tests/default.nf.test modules/local --profile test,docker per CLAUDE.md) is where it'll show up.
Suggested fix: put **/Image.csv and **/Experiment.csv back in tests/.nftignore, with a brief comment that they remain because the pipeline-level snapshot still uses md5 hashing on these files. The two layers of testing have different jobs — the module test verifies content, the pipeline test verifies file layout — and .nftignore is the standard nf-core mechanism for letting the pipeline test skip files with volatile bytes. You can see the same pattern in nf-core/sarek, nf-core/rnaseq, etc.
Smaller suggestions
- Hardcoded CellProfiler version:
cpVersionLine == "CellProfiler_Version,4.2.8"will break the next time we bump CellProfiler — someone will have to update both the snap and this assertion. A regex likecpVersionLine ==~ /^CellProfiler_Version,\d+\.\d+(\.\d+)?$/keeps the test decoupled from the specific version. Up to you. Count_Cellsas float:imageRow["Count_Cells"].toFloat() > 0works, but cell counts are integers —.toInteger()is a touch more honest about what's being checked.- Asymmetric split:
imageLines[0].split(",")(header) vsimageLines[1].split(",", -1)(data row). The-1is there to preserve trailing empty fields, which is the right choice — applying it to both keeps them symmetric. Probably doesn't matter in practice for CellProfiler's CSV header, but it's a small consistency win. - Index-then-check ordering (also called out by Copilot):
imageRowis built usingimageLines[1]beforeassert imageLines.size() == 2runs. IfImage.csvever came back with one line, you'd get anIndexOutOfBoundsExceptioninstead of a clear assertion message. Easy reorder: assert size first, then buildimageRow. cpVersionLine != null: redundant once you assert equality (or the regex above), since equality already implies non-null. Fine to leave for readability — just flagging it.
The Copilot suggestions about streaming reads with eachLine/newReader are theoretical — these test files are small and the current readLines() reads finish in milliseconds. Don't feel any pressure to refactor that.
What's good
- Clear, well-commented assertions — easy to read, easy to update later
- Right call to keep
versions.ymlas md5 (it really is deterministic) - Existence + size > 1000 for PNGs is the right level of strictness given the libpng drift issue
- Schema-only checks for Cells/Cytoplasm/Nuclei.csv is the right call until #41 is sorted
This is real progress on #38. Once the .nftignore piece is sorted, this is good to merge.
Progress on #38
This PR implements line-level and schema-based assertions for
modules/local/cellprofiler/analysis/tests/main.nf.test, replacing the bluntsnapshot(process.out).match()approach introduced as a workaround in #37.What changed
modules/local/cellprofiler/analysis/tests/main.nf.testgetAllFilesFromDir(..., ignore: [...])+snapshot().match()with per-output intelligent assertions insideassertAll()Image.csv: parsed line-by-line usingreadLines(); stable columns (Metadata_Plate,Metadata_Well) are exact-matched;Count_Cellsis sanity-checked as > 0; volatilePathName_*columns are regex-matched (==~ /.*\/outlines/) to tolerate work-dir hash changesExperiment.csv: scanned for theCellProfiler_Versionline only;Run_Timestampis deliberately not assertedexists()+size() > 1000checks to tolerate libpng byte drift across macOS and LinuxCells/Cytoplasm/Nuclei.csv: existence and header schema (ImageNumber,ObjectNumber) asserted only; floating-point measurement values not checked pending CellProfiler per-cell measurement CSVs (Cells/Cytoplasm/Nuclei.csv) drift across platforms #41versions.yml: retained as md5 snapshot (fully deterministic)main.nf.test.snapcellprofiler - analysissnapshot entry; regenerated with--update-snapshotto containversions.ymlonlytests/.nftignore**/Image.csvand**/Experiment.csvexclusions since the module test now handles these files directly