-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storeDir does not skip staging of inputs #5468
Comments
I want to add an additional consideration that the pipeline cannot run without access to internet even when the output of a process is in storeDir. Some of our clusters have limited access to internet. |
I think it is because remote file staging happens before checking the storeDir: nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy Lines 639 to 657 in 0a29236
Remote file staging happens in I don't see why we couldn't move the remote file staging to after the storeDir check or even later (i.e. right before submitting the task?) |
I will try to move the staging call to the the launch and check if there is not a side effect |
Foreign file staging is currently managed in TaskProcessor.makeTaskContextStage2, It is done it two parts: here it is normalizing the path, checking if the data is in the staging cache and creating the FileCopy if a staging copy is required. It is also setting the resolved path name as input parameter. Later in the same function, it is doing the real copy. It is required before the hash, because it uses the file attributes. |
@anoronh4 I was testing the PR with the example you provided, and I have noted the output is the same file as the input. To avoid a copy Nextflow is creating the output file in the storeDir as a symlink pointing to the file in the staging directory. So, when you remove the stage directory, you are also removing the file in the storeDir. In the real case where you had the issue, are you also doing this definition or is the output file different from the input? |
Bug report
Expected behavior and actual behavior
I am expecting storeDir to help skip processes that are already performed. However I am finding that it does not skip staging of the input files when they are remote. If the input file is very large, it will delay the downstream steps of the pipeline to download something that takes up space on the local disk and isn't necessarily used, because the task is skipped.
Steps to reproduce the problem
Program output
If I run for the first time it will download and perform the task, as expected. If I immediately re-run with
-resume
it will skip the process immediately and the pipeline will complete very quickly, also as expected. However, when i remove the folderwork/stage-*
and then try to resume, I get this:This file is only 1.9 Gb which is big enough to notice a delay but it's not very long in the grand scheme of things. But in our pipeline we have a 30 Gb input reference file which takes anywhere from 4-22 minutes to download depending on IO speeds. Is there any way to skip staging? Seems like an unnecessary step in this context.
Environment
The text was updated successfully, but these errors were encountered: