Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs by emrunali · Pull Request #43 · nf-core/cellpainting

emrunali · 2026-04-28T18:54:20Z

Progress on #38
This PR implements line-level and schema-based assertions for modules/local/cellprofiler/analysis/tests/main.nf.test, replacing the blunt snapshot(process.out).match() approach introduced as a workaround in #37.

What changed

modules/local/cellprofiler/analysis/tests/main.nf.test

Replaced getAllFilesFromDir(..., ignore: [...]) + snapshot().match() with per-output intelligent assertions inside assertAll()
Image.csv: parsed line-by-line using readLines(); stable columns (Metadata_Plate, Metadata_Well) are exact-matched; Count_Cells is sanity-checked as > 0; volatile PathName_* columns are regex-matched (==~ /.*\/outlines/) to tolerate work-dir hash changes
Experiment.csv: scanned for the CellProfiler_Version line only; Run_Timestamp is deliberately not asserted
PNG outlines: replaced md5 snapshot with exists() + size() > 1000 checks to tolerate libpng byte drift across macOS and Linux
Cells/Cytoplasm/Nuclei.csv: existence and header schema (ImageNumber, ObjectNumber) asserted only; floating-point measurement values not checked pending CellProfiler per-cell measurement CSVs (Cells/Cytoplasm/Nuclei.csv) drift across platforms #41
versions.yml: retained as md5 snapshot (fully deterministic)

main.nf.test.snap

Removed stale PNG md5 hashes from the cellprofiler - analysis snapshot entry; regenerated with --update-snapshot to contain versions.yml only

tests/.nftignore

Removed **/Image.csv and **/Experiment.csv exclusions since the module test now handles these files directly

…lysis module (nf-core#38) Replace the blunt getAllFilesFromDir ignore-list workaround in the CELLPROFILER_ANALYSIS real test with fine-grained, content-aware assertions: - Image.csv: parse header + data row; assert stable columns exist, stable field values match (Metadata_Plate, Metadata_Well, Count_Cells > 0), and volatile PathName_*Outlines columns match /.*\/outlines/ pattern rather than pinning the work-dir hash - Experiment.csv: scan line-by-line for CellProfiler_Version row and assert value is 4.2.8; Run_Timestamp is deliberately not asserted - PNG outlines: assert exists() + size() > 1000 instead of md5 hash, which drifts across libpng versions between macOS and Linux CI - Nuclei/Cells/Cytoplasm: assert existence and column schema (ImageNumber, ObjectNumber); floating-point drift tracked separately in nf-core#41 Remove **/Image.csv and **/Experiment.csv from tests/.nftignore since the module test now handles those files directly. Update snapshot to contain only versions.yml (PNG md5s removed). Made-with: Cursor

nf-core-bot · 2026-04-28T18:54:55Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copilot

Pull request overview

Updates nf-test coverage for the CELLPROFILER_ANALYSIS module by replacing coarse md5 snapshots of unstable outputs with targeted, content-aware assertions, and adjusts snapshot/ignore configuration to reflect the new strategy.

Changes:

Reworked modules/local/cellprofiler/analysis/tests/main.nf.test to assert stable CSV fields, tolerate volatile path/timestamp fields, and validate PNG outputs via existence/size checks.
Regenerated modules/local/cellprofiler/analysis/tests/main.nf.test.snap to drop stale PNG md5 entries and keep deterministic versions.yml.
Updated tests/.nftignore to stop excluding Image.csv / Experiment.csv.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`tests/.nftignore`	Removes ignore patterns for `Image.csv` and `Experiment.csv`, affecting which files are included in snapshot hashing.
`modules/local/cellprofiler/analysis/tests/main.nf.test.snap`	Updates the analysis module snapshot to retain only deterministic artifacts (`versions.yml`).
`modules/local/cellprofiler/analysis/tests/main.nf.test`	Replaces `snapshot(process.out).match()`-style assertions with granular checks for CSV content and PNG sanity.

Comments suppressed due to low confidence (1)

tests/.nftignore:7

Removing **/Image.csv and **/Experiment.csv from .nftignore will cause md5-based snapshots that use ignoreFile: 'tests/.nftignore' (e.g., the pipeline test) to start hashing these known non-deterministic files again (work-dir paths and timestamps), making snapshots flaky/fail. Either keep these patterns until the pipeline-level snapshot strategy is updated, or add a pipeline-level preprocessing/intelligent assertion approach for these files before un-ignoring them.

.DS_Store
# Cells.csv, Cytoplasm.csv, Nuclei.csv contain per-cell measurements whose
# floating-point values drift across BLAS implementations (macOS Accelerate
# vs Linux OpenBLAS in CI). Tracked in #41.
**/Cells.csv
**/Cytoplasm.csv
**/Nuclei.csv

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T18:59:19Z

+                {
+                    ["Nuclei.csv", "Cells.csv", "Cytoplasm.csv"].each { csvName ->
+                        def header = analysisDir.resolve(csvName).readLines()[0].split(",") as List
+                        assert "ImageNumber"  in header
+                        assert "ObjectNumber" in header
+                    }


readLines()[0] loads the entire per-cell CSV into memory just to read the header. These files can be large; prefer reading only the first line (e.g., newReader().readLine() / eachLine with an early break) to avoid unnecessary memory/time overhead in tests.

kenibrewer

Nice work on this, Mrunali! The module-level assertions are well structured — the per-output approach (header check + stable-field equality + regex for volatile fields + size sanity check for PNGs) is exactly the direction issue #38 was pointing at, and it reads cleanly. Thank you for taking this on.

I have one thing I'd like to sort out before we merge, plus a few smaller suggestions. Happy to pair on any of these if it'd help.

One thing to fix before merging — the `tests/.nftignore` change

The module test now handles Image.csv and Experiment.csv intelligently, which is great. But the pipeline-level test (tests/default.nf.test) also reads those same output files and md5-snapshots them, via this line:

def stable_path = getAllFilesFromDir(params.outdir, ignoreFile: 'tests/.nftignore')

That means tests/.nftignore has two consumers: the module test (no longer needs the entries) and the pipeline test (still does). When **/Image.csv and **/Experiment.csv are removed from .nftignore, the pipeline test starts hashing the volatile content (PathName_*Outlines work-dir paths, Run_Timestamp) and the snapshot will mismatch on every run.

Why CI is green: nf-test's --changed-since HEAD^ selects tests by walking script dependencies, and it doesn't see tests/.nftignore as an input to tests/default.nf.test, so the pipeline test wasn't picked up in CI for this PR. The full local suite (nf-test test tests/default.nf.test modules/local --profile test,docker per CLAUDE.md) is where it'll show up.

Suggested fix: put **/Image.csv and **/Experiment.csv back in tests/.nftignore, with a brief comment that they remain because the pipeline-level snapshot still uses md5 hashing on these files. The two layers of testing have different jobs — the module test verifies content, the pipeline test verifies file layout — and .nftignore is the standard nf-core mechanism for letting the pipeline test skip files with volatile bytes. You can see the same pattern in nf-core/sarek, nf-core/rnaseq, etc.

Smaller suggestions

Hardcoded CellProfiler version: cpVersionLine == "CellProfiler_Version,4.2.8" will break the next time we bump CellProfiler — someone will have to update both the snap and this assertion. A regex like cpVersionLine ==~ /^CellProfiler_Version,\d+\.\d+(\.\d+)?$/ keeps the test decoupled from the specific version. Up to you.
Count_Cells as float: imageRow["Count_Cells"].toFloat() > 0 works, but cell counts are integers — .toInteger() is a touch more honest about what's being checked.
Asymmetric split: imageLines[0].split(",") (header) vs imageLines[1].split(",", -1) (data row). The -1 is there to preserve trailing empty fields, which is the right choice — applying it to both keeps them symmetric. Probably doesn't matter in practice for CellProfiler's CSV header, but it's a small consistency win.
Index-then-check ordering (also called out by Copilot): imageRow is built using imageLines[1] before assert imageLines.size() == 2 runs. If Image.csv ever came back with one line, you'd get an IndexOutOfBoundsException instead of a clear assertion message. Easy reorder: assert size first, then build imageRow.
cpVersionLine != null: redundant once you assert equality (or the regex above), since equality already implies non-null. Fine to leave for readability — just flagging it.

The Copilot suggestions about streaming reads with eachLine/newReader are theoretical — these test files are small and the current readLines() reads finish in milliseconds. Don't feel any pressure to refactor that.

What's good

Clear, well-commented assertions — easy to read, easy to update later
Right call to keep versions.yml as md5 (it really is deterministic)
Existence + size > 1000 for PNGs is the right level of strictness given the libpng drift issue
Schema-only checks for Cells/Cytoplasm/Nuclei.csv is the right call until #41 is sorted

This is real progress on #38. Once the .nftignore piece is sorted, this is good to merge.

emrunali requested review from Copilot and kenibrewer April 28, 2026 18:54

Copilot started reviewing on behalf of emrunali April 28, 2026 18:54 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

kenibrewer requested changes Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs#43

Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs#43
emrunali wants to merge 1 commit intonf-core:devfrom
emrunali:test/intelligent-assertions-analysis-38

emrunali commented Apr 28, 2026 •

edited

Loading

Uh oh!

nf-core-bot commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

kenibrewer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

emrunali commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Uh oh!

nf-core-bot commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

kenibrewer left a comment

Choose a reason for hiding this comment

One thing to fix before merging — the tests/.nftignore change

Smaller suggestions

What's good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

emrunali commented Apr 28, 2026 •

edited

Loading

One thing to fix before merging — the `tests/.nftignore` change