Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressable data store #5715

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from
Draft

Addressable data store #5715

wants to merge 28 commits into from

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Jan 27, 2025

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

  • CID store is specified by workflow.data.store.location
  • Workflow Hash is created based on the workflow and parameters description
  • workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
  • references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
  • CID NIO Filesystem to access data based on CIS URLs
  • nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

  • Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso marked this pull request as draft January 27, 2025 13:15
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 34cc0b1
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67c327a700b9b7000831feeb

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member Author

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

@jorgee
Copy link
Contributor

jorgee commented Feb 13, 2025

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

The following exception is reported then using nextflow cid log but not config is provided

Mar-01 16:16:26.201 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.NullPointerException: Cannot invoke "nextflow.data.cid.CidStore.getHistoryFile()" because "store" is null
	at nextflow.cli.CmdCid$CmdLog.printHistory(CmdCid.groovy:124)
	at nextflow.cli.CmdCid$CmdLog.apply(CmdCid.groovy:117)
	at nextflow.cli.CmdCid.run(CmdCid.groovy:81)
	at nextflow.cli.Launcher.run(Launcher.groovy:504)
	at nextflow.cli.Launcher.main(Launcher.groovy:659)

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

pditommaso commented Mar 1, 2025

Minor to do:

  • Fix cid log exception when config is missing
  • Add nextflow cid -h help
  • When output files are store on S3 the path in the metadata object is not including the s3:// prefix (see snippet 1)
  • workflow.data.location should allow remote path e.g. S3
  • Decouple CidFileSystem from underling local file system (ideally it should depend only on CidStore) [MAJOR]

snippet 1

{
    "type": "TaskOutput",
    "path": "/nextflow-ci/work/81/dda64885bd6f31254afd55868fc52e/multiqc_report.html",
    "checksum": "0a6ad9fb0a405182fd3a0304a1c202f2",
    "source": "cid://81dda64885bd6f31254afd55868fc52e",
    "size": 0,
    "createdAt": 1740844324000,
    "modifiedAt": 1740844324000,
    "annotations": null
}

return new CidHistoryFile(metaLocation.resolve(HISTORY_FILE_NAME))
}

static Path getMetadataPath(DataConfig config){ config.store.location.resolve('.meta') }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an attribute of DataStoreOpts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants