You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some pipeline updates only affect parts of data. For example, if a new pipeline version improves the cleaning of location names, sequence alignments are not affected and do not need to be reprocessed.
To avoid redundant processing, the backend should be able to return the processed data of the previous pipeline version if such data exist and the pipeline requests it. This allows the pipeline to copy over certain data instead of regenerating everything from the original data.
The text was updated successfully, but these errors were encountered:
Thanks for raising the issue. IMO we should think about if this is something we want to support. IMO the idea of processedData that is only a product of unprocessedData is quite nice. But ofc we should consider any trade-offs that result.
This suggestion came to my mind when I was thinking about processing of raw reads and having more compute-intense pipelines. While this feature might not be too relevant for fast pipelines, I imagine that it would be important for long-running pipelines.
It feels nicer for something like this (avoiding repeating a heavy duty step of the preprocessing pipeline) to be achieved by something like:
when the preprocessing pipeline makes e.g. BAMs from a FASTQ it stores them under a name which is a hash of the FASTQ data, with a salt according to the exact pipeline used to align
before mapping a FASTQ, the preprocessing pipeline first checks if a BAM with that hash already exists, and if so skips to just using that file
If we went with #3344 (comment) we could still broadly achieve that, as long as the preprocessing pipeline had some space in which it could decide to store a pointer from the hash to the final S3 location
True – we can also leave this up to the pipeline. Sounds equally good to me. Let's put a "discussion" label and see whether other people have thoughts on this.
Some pipeline updates only affect parts of data. For example, if a new pipeline version improves the cleaning of location names, sequence alignments are not affected and do not need to be reprocessed.
To avoid redundant processing, the backend should be able to return the processed data of the previous pipeline version if such data exist and the pipeline requests it. This allows the pipeline to copy over certain data instead of regenerating everything from the original data.
The text was updated successfully, but these errors were encountered: