Description
Bug report
Expected behavior and actual behavior
I am expecting storeDir to help skip processes that are already performed. However I am finding that it does not skip staging of the input files when they are remote. If the input file is very large, it will delay the downstream steps of the pipeline to download something that takes up space on the local disk and isn't necessarily used, because the task is skipped.
Steps to reproduce the problem
process A {
cache true
storeDir "storeDir"
input:
tuple val(meta), path(x)
output:
path(x)
script:
"""
echo hi
"""
}
workflow {
remotepath="https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/Mouse_GRCm39_M31_CTAT_lib_Nov092022.source.tar.gz"
A([[:],remotepath])
}
Program output
If I run for the first time it will download and perform the task, as expected. If I immediately re-run with -resume
it will skip the process immediately and the pipeline will complete very quickly, also as expected. However, when i remove the folder work/stage-*
and then try to resume, I get this:
N E X T F L O W ~ version 23.10.1
Launching `main.nf` [nauseous_archimedes] DSL2 - revision: cc0b1bbdb5
[skipped ] process > A [100%] 1 of 1, stored: 1 ✔
Staging foreign file: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/Mouse_GRCm39_M31_CTAT_lib_Nov092022.source.tar.gz
[skipping] Stored process > A
This file is only 1.9 Gb which is big enough to notice a delay but it's not very long in the grand scheme of things. But in our pipeline we have a 30 Gb input reference file which takes anywhere from 4-22 minutes to download depending on IO speeds. Is there any way to skip staging? Seems like an unnecessary step in this context.
Environment
- Nextflow version: 23.10.1
- Java version: 11
- Operating system: Linux
- Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)