Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storeDir does not skip staging of inputs #5468

Open
anoronh4 opened this issue Nov 4, 2024 · 6 comments · May be fixed by #5759
Open

storeDir does not skip staging of inputs #5468

anoronh4 opened this issue Nov 4, 2024 · 6 comments · May be fixed by #5759
Assignees
Labels

Comments

@anoronh4
Copy link

anoronh4 commented Nov 4, 2024

Bug report

Expected behavior and actual behavior

I am expecting storeDir to help skip processes that are already performed. However I am finding that it does not skip staging of the input files when they are remote. If the input file is very large, it will delay the downstream steps of the pipeline to download something that takes up space on the local disk and isn't necessarily used, because the task is skipped.

Steps to reproduce the problem

process A {
cache true
storeDir "storeDir"
input:
tuple val(meta), path(x)
output:
path(x)
script:
"""
echo hi
"""
}

workflow {
remotepath="https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/Mouse_GRCm39_M31_CTAT_lib_Nov092022.source.tar.gz"
A([[:],remotepath])
}

Program output

If I run for the first time it will download and perform the task, as expected. If I immediately re-run with -resume it will skip the process immediately and the pipeline will complete very quickly, also as expected. However, when i remove the folder work/stage-* and then try to resume, I get this:

N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [nauseous_archimedes] DSL2 - revision: cc0b1bbdb5
[skipped  ] process > A [100%] 1 of 1, stored: 1 ✔
Staging foreign file: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/Mouse_GRCm39_M31_CTAT_lib_Nov092022.source.tar.gz
[skipping] Stored process > A

This file is only 1.9 Gb which is big enough to notice a delay but it's not very long in the grand scheme of things. But in our pipeline we have a 30 Gb input reference file which takes anywhere from 4-22 minutes to download depending on IO speeds. Is there any way to skip staging? Seems like an unnecessary step in this context.

Environment

  • Nextflow version: 23.10.1
  • Java version: 11
  • Operating system: Linux
  • Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
@anoronh4
Copy link
Author

I want to add an additional consideration that the pipeline cannot run without access to internet even when the output of a process is in storeDir. Some of our clusters have limited access to internet.

@bentsherman
Copy link
Member

I think it is because remote file staging happens before checking the storeDir:

// -- map the inputs to a map and use to delegate closure values interpolation
final secondPass = [:]
int count = makeTaskContextStage1(task, secondPass, values)
makeTaskContextStage2(task, secondPass, count)
// verify that `when` guard, when specified, is satisfied
if( !checkWhenGuard(task) )
return
// -- resolve the task command script
task.resolve(taskBody)
// -- verify if exists a stored result for this case,
// if true skip the execution and return the stored data
if( checkStoredOutput(task) )
return
def hash = createTaskHashKey(task)
checkCachedOrLaunchTask(task, hash, resumable)

Remote file staging happens in makeTaskContextStage2().

I don't see why we couldn't move the remote file staging to after the storeDir check or even later (i.e. right before submitting the task?)

@bentsherman
Copy link
Member

@jorgee related to #5727, here is another reason why I think we should just move the foreign file staging to right before submitting the task

@jorgee
Copy link
Contributor

jorgee commented Feb 5, 2025

I will try to move the staging call to the the launch and check if there is not a side effect

@jorgee jorgee self-assigned this Feb 5, 2025
@jorgee jorgee linked a pull request Feb 6, 2025 that will close this issue
@jorgee jorgee linked a pull request Feb 6, 2025 that will close this issue
@jorgee
Copy link
Contributor

jorgee commented Feb 6, 2025

Foreign file staging is currently managed in TaskProcessor.makeTaskContextStage2, It is done it two parts: here it is normalizing the path, checking if the data is in the staging cache and creating the FileCopy if a staging copy is required. It is also setting the resolved path name as input parameter. Later in the same function, it is doing the real copy. It is required before the hash, because it uses the file attributes.
The good thing is that storeDir is before the hash, so we can modify the makeContextStage2 to return the staging batch and do the copy just before the hash computation. I have created PR #5759 to fix the issue.

@jorgee jorgee added the planned label Feb 6, 2025
@jorgee
Copy link
Contributor

jorgee commented Feb 24, 2025

@anoronh4 I was testing the PR with the example you provided, and I have noted the output is the same file as the input. To avoid a copy Nextflow is creating the output file in the storeDir as a symlink pointing to the file in the staging directory. So, when you remove the stage directory, you are also removing the file in the storeDir. In the real case where you had the issue, are you also doing this definition or is the output file different from the input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants