Give processed data of previous pipeline version to pipeline if requested #3648

chaoran-chen · 2025-02-09T20:55:46Z

Some pipeline updates only affect parts of data. For example, if a new pipeline version improves the cleaning of location names, sequence alignments are not affected and do not need to be reprocessed.

To avoid redundant processing, the backend should be able to return the processed data of the previous pipeline version if such data exist and the pipeline requests it. This allows the pipeline to copy over certain data instead of regenerating everything from the original data.

theosanderson · 2025-02-09T21:21:40Z

Thanks for raising the issue. IMO we should think about if this is something we want to support. IMO the idea of processedData that is only a product of unprocessedData is quite nice. But ofc we should consider any trade-offs that result.

chaoran-chen · 2025-02-09T21:32:19Z

This suggestion came to my mind when I was thinking about processing of raw reads and having more compute-intense pipelines. While this feature might not be too relevant for fast pipelines, I imagine that it would be important for long-running pipelines.

theosanderson · 2025-02-09T22:57:48Z

It feels nicer for something like this (avoiding repeating a heavy duty step of the preprocessing pipeline) to be achieved by something like:

when the preprocessing pipeline makes e.g. BAMs from a FASTQ it stores them under a name which is a hash of the FASTQ data, with a salt according to the exact pipeline used to align
before mapping a FASTQ, the preprocessing pipeline first checks if a BAM with that hash already exists, and if so skips to just using that file

If we went with #3344 (comment) we could still broadly achieve that, as long as the preprocessing pipeline had some space in which it could decide to store a pointer from the hash to the final S3 location

chaoran-chen · 2025-02-09T23:04:58Z

True – we can also leave this up to the pipeline. Sounds equally good to me. Let's put a "discussion" label and see whether other people have thoughts on this.

chaoran-chen added backend related to the loculus backend component feature Feature proposal labels Feb 9, 2025

chaoran-chen added the discussion Open questions label Feb 9, 2025

chaoran-chen mentioned this issue Feb 10, 2025

Raw data sharing #3344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give processed data of previous pipeline version to pipeline if requested #3648

Give processed data of previous pipeline version to pipeline if requested #3648

chaoran-chen commented Feb 9, 2025 •

edited

Loading

theosanderson commented Feb 9, 2025

chaoran-chen commented Feb 9, 2025

theosanderson commented Feb 9, 2025 •

edited

Loading

chaoran-chen commented Feb 9, 2025

Give processed data of previous pipeline version to pipeline if requested #3648

Give processed data of previous pipeline version to pipeline if requested #3648

Comments

chaoran-chen commented Feb 9, 2025 • edited Loading

theosanderson commented Feb 9, 2025

chaoran-chen commented Feb 9, 2025

theosanderson commented Feb 9, 2025 • edited Loading

chaoran-chen commented Feb 9, 2025

chaoran-chen commented Feb 9, 2025 •

edited

Loading

theosanderson commented Feb 9, 2025 •

edited

Loading