Skip to content

Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs#43

Open
emrunali wants to merge 1 commit intonf-core:devfrom
emrunali:test/intelligent-assertions-analysis-38
Open

Progress on #38 — Replace coarse md5 snapshots with intelligent assertions for CellProfiler outputs#43
emrunali wants to merge 1 commit intonf-core:devfrom
emrunali:test/intelligent-assertions-analysis-38

Conversation

@emrunali
Copy link
Copy Markdown

@emrunali emrunali commented Apr 28, 2026

Progress on #38
This PR implements line-level and schema-based assertions for modules/local/cellprofiler/analysis/tests/main.nf.test, replacing the blunt snapshot(process.out).match() approach introduced as a workaround in #37.

What changed

modules/local/cellprofiler/analysis/tests/main.nf.test

  • Replaced getAllFilesFromDir(..., ignore: [...]) + snapshot().match() with per-output intelligent assertions inside assertAll()
  • Image.csv: parsed line-by-line using readLines(); stable columns (Metadata_Plate, Metadata_Well) are exact-matched; Count_Cells is sanity-checked as > 0; volatile PathName_* columns are regex-matched (==~ /.*\/outlines/) to tolerate work-dir hash changes
  • Experiment.csv: scanned for the CellProfiler_Version line only; Run_Timestamp is deliberately not asserted
  • PNG outlines: replaced md5 snapshot with exists() + size() > 1000 checks to tolerate libpng byte drift across macOS and Linux
  • Cells/Cytoplasm/Nuclei.csv: existence and header schema (ImageNumber, ObjectNumber) asserted only; floating-point measurement values not checked pending CellProfiler per-cell measurement CSVs (Cells/Cytoplasm/Nuclei.csv) drift across platforms #41
  • versions.yml: retained as md5 snapshot (fully deterministic)

main.nf.test.snap

  • Removed stale PNG md5 hashes from the cellprofiler - analysis snapshot entry; regenerated with --update-snapshot to contain versions.yml only

tests/.nftignore

  • Removed **/Image.csv and **/Experiment.csv exclusions since the module test now handles these files directly

…lysis module (nf-core#38)

Replace the blunt getAllFilesFromDir ignore-list workaround in the
CELLPROFILER_ANALYSIS real test with fine-grained, content-aware assertions:

- Image.csv: parse header + data row; assert stable columns exist, stable
  field values match (Metadata_Plate, Metadata_Well, Count_Cells > 0), and
  volatile PathName_*Outlines columns match /.*\/outlines/ pattern rather
  than pinning the work-dir hash
- Experiment.csv: scan line-by-line for CellProfiler_Version row and assert
  value is 4.2.8; Run_Timestamp is deliberately not asserted
- PNG outlines: assert exists() + size() > 1000 instead of md5 hash, which
  drifts across libpng versions between macOS and Linux CI
- Nuclei/Cells/Cytoplasm: assert existence and column schema (ImageNumber,
  ObjectNumber); floating-point drift tracked separately in nf-core#41

Remove **/Image.csv and **/Experiment.csv from tests/.nftignore since the
module test now handles those files directly. Update snapshot to contain
only versions.yml (PNG md5s removed).

Made-with: Cursor
@emrunali emrunali requested review from Copilot and kenibrewer April 28, 2026 18:54
@nf-core-bot
Copy link
Copy Markdown
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates nf-test coverage for the CELLPROFILER_ANALYSIS module by replacing coarse md5 snapshots of unstable outputs with targeted, content-aware assertions, and adjusts snapshot/ignore configuration to reflect the new strategy.

Changes:

  • Reworked modules/local/cellprofiler/analysis/tests/main.nf.test to assert stable CSV fields, tolerate volatile path/timestamp fields, and validate PNG outputs via existence/size checks.
  • Regenerated modules/local/cellprofiler/analysis/tests/main.nf.test.snap to drop stale PNG md5 entries and keep deterministic versions.yml.
  • Updated tests/.nftignore to stop excluding Image.csv / Experiment.csv.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/.nftignore Removes ignore patterns for Image.csv and Experiment.csv, affecting which files are included in snapshot hashing.
modules/local/cellprofiler/analysis/tests/main.nf.test.snap Updates the analysis module snapshot to retain only deterministic artifacts (versions.yml).
modules/local/cellprofiler/analysis/tests/main.nf.test Replaces snapshot(process.out).match()-style assertions with granular checks for CSV content and PNG sanity.
Comments suppressed due to low confidence (1)

tests/.nftignore:7

  • Removing **/Image.csv and **/Experiment.csv from .nftignore will cause md5-based snapshots that use ignoreFile: 'tests/.nftignore' (e.g., the pipeline test) to start hashing these known non-deterministic files again (work-dir paths and timestamps), making snapshots flaky/fail. Either keep these patterns until the pipeline-level snapshot strategy is updated, or add a pipeline-level preprocessing/intelligent assertion approach for these files before un-ignoring them.
.DS_Store
# Cells.csv, Cytoplasm.csv, Nuclei.csv contain per-cell measurements whose
# floating-point values drift across BLAS implementations (macOS Accelerate
# vs Linux OpenBLAS in CI). Tracked in #41.
**/Cells.csv
**/Cytoplasm.csv
**/Nuclei.csv

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/local/cellprofiler/analysis/tests/main.nf.test
Comment thread modules/local/cellprofiler/analysis/tests/main.nf.test
Comment on lines +115 to +120
{
["Nuclei.csv", "Cells.csv", "Cytoplasm.csv"].each { csvName ->
def header = analysisDir.resolve(csvName).readLines()[0].split(",") as List
assert "ImageNumber" in header
assert "ObjectNumber" in header
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readLines()[0] loads the entire per-cell CSV into memory just to read the header. These files can be large; prefer reading only the first line (e.g., newReader().readLine() / eachLine with an early break) to avoid unnecessary memory/time overhead in tests.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@kenibrewer kenibrewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, Mrunali! The module-level assertions are well structured — the per-output approach (header check + stable-field equality + regex for volatile fields + size sanity check for PNGs) is exactly the direction issue #38 was pointing at, and it reads cleanly. Thank you for taking this on.

I have one thing I'd like to sort out before we merge, plus a few smaller suggestions. Happy to pair on any of these if it'd help.

One thing to fix before merging — the tests/.nftignore change

The module test now handles Image.csv and Experiment.csv intelligently, which is great. But the pipeline-level test (tests/default.nf.test) also reads those same output files and md5-snapshots them, via this line:

def stable_path = getAllFilesFromDir(params.outdir, ignoreFile: 'tests/.nftignore')

That means tests/.nftignore has two consumers: the module test (no longer needs the entries) and the pipeline test (still does). When **/Image.csv and **/Experiment.csv are removed from .nftignore, the pipeline test starts hashing the volatile content (PathName_*Outlines work-dir paths, Run_Timestamp) and the snapshot will mismatch on every run.

Why CI is green: nf-test's --changed-since HEAD^ selects tests by walking script dependencies, and it doesn't see tests/.nftignore as an input to tests/default.nf.test, so the pipeline test wasn't picked up in CI for this PR. The full local suite (nf-test test tests/default.nf.test modules/local --profile test,docker per CLAUDE.md) is where it'll show up.

Suggested fix: put **/Image.csv and **/Experiment.csv back in tests/.nftignore, with a brief comment that they remain because the pipeline-level snapshot still uses md5 hashing on these files. The two layers of testing have different jobs — the module test verifies content, the pipeline test verifies file layout — and .nftignore is the standard nf-core mechanism for letting the pipeline test skip files with volatile bytes. You can see the same pattern in nf-core/sarek, nf-core/rnaseq, etc.

Smaller suggestions

  • Hardcoded CellProfiler version: cpVersionLine == "CellProfiler_Version,4.2.8" will break the next time we bump CellProfiler — someone will have to update both the snap and this assertion. A regex like cpVersionLine ==~ /^CellProfiler_Version,\d+\.\d+(\.\d+)?$/ keeps the test decoupled from the specific version. Up to you.
  • Count_Cells as float: imageRow["Count_Cells"].toFloat() > 0 works, but cell counts are integers — .toInteger() is a touch more honest about what's being checked.
  • Asymmetric split: imageLines[0].split(",") (header) vs imageLines[1].split(",", -1) (data row). The -1 is there to preserve trailing empty fields, which is the right choice — applying it to both keeps them symmetric. Probably doesn't matter in practice for CellProfiler's CSV header, but it's a small consistency win.
  • Index-then-check ordering (also called out by Copilot): imageRow is built using imageLines[1] before assert imageLines.size() == 2 runs. If Image.csv ever came back with one line, you'd get an IndexOutOfBoundsException instead of a clear assertion message. Easy reorder: assert size first, then build imageRow.
  • cpVersionLine != null: redundant once you assert equality (or the regex above), since equality already implies non-null. Fine to leave for readability — just flagging it.

The Copilot suggestions about streaming reads with eachLine/newReader are theoretical — these test files are small and the current readLines() reads finish in milliseconds. Don't feel any pressure to refactor that.

What's good

  • Clear, well-commented assertions — easy to read, easy to update later
  • Right call to keep versions.yml as md5 (it really is deterministic)
  • Existence + size > 1000 for PNGs is the right level of strictness given the libpng drift issue
  • Schema-only checks for Cells/Cytoplasm/Nuclei.csv is the right call until #41 is sorted

This is real progress on #38. Once the .nftignore piece is sorted, this is good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants