Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add known_hash file() option #5807

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

edmundmiller
Copy link
Contributor

Overview

This PR adds a new file hash verification feature to Nextflow's file() operator. This feature allows users to verify file integrity by checking file hashes against expected values. This feature is particularly useful when working with downloaded files or when ensuring data integrity is critical.

Implementation Details

  • Added FileHashVerifier class that supports multiple hash algorithms:
    • MD5
    • SHA-1
    • SHA-256
    • SHA-384
    • SHA-512
  • Extended the file() operator to accept a known_hash parameter in the format algorithm:hash
  • Added comprehensive test coverage in FileHashVerifierTest
  • Updated documentation in working-with-files.md

Usage Example

// Verify a file with MD5
file('data.txt', known_hash: 'md5:d41d8cd98f00b204e9800998ecf8427e')
// Verify a file with SHA-256
file('data.txt', known_hash: 'sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855')

Limitations

  • Hash verification is not supported when using glob patterns that match multiple files

  • Throws an IllegalArgumentException if:

    • Hash format is invalid
    • Hash algorithm is not supported
    • File hash doesn't match the expected value

    Inspiration

This feature was inspired by the Pooch library, which provides similar functionality for Python.

@edmundmiller edmundmiller requested a review from a team as a code owner February 21, 2025 15:53
Copy link

netlify bot commented Feb 21, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 6368f7a
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67b8a19a17578f000808d3f0
😎 Deploy Preview https://deploy-preview-5807--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bentsherman bentsherman linked an issue Feb 21, 2025 that may be closed by this pull request
@bentsherman
Copy link
Member

I think Jordi took a crack at this a few years ago: #4415

I guess now we have two approaches we can compare 😅


```nextflow
// Verify a file with MD5
file('some/path/to/my_file.file', known_hash: 'md5:d41d8cd98f00b204e9800998ecf8427e')
Copy link
Member

@pditommaso pditommaso Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file('some/path/to/my_file.file', known_hash: 'md5:d41d8cd98f00b204e9800998ecf8427e')
file('some/path/to/my_file.file', checksum: 'md5:d41d8cd98f00b204e9800998ecf8427e')

checksum may be preferable

@bentsherman
Copy link
Member

I think I like Jordi's approach better. The problem with this PR is that it verifies the checksum when you call file(), but file() should only give you a reference, it should not trigger a download.

@edmundmiller would you be willing to test Jordi's PR and build on it instead? If you can confirm that it works then that will make it easier to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option to the file method to check for md5sum
3 participants