Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give processed data of previous pipeline version to pipeline if requested #3648

Open
chaoran-chen opened this issue Feb 9, 2025 · 4 comments
Labels
backend related to the loculus backend component discussion Open questions feature Feature proposal

Comments

@chaoran-chen
Copy link
Member

chaoran-chen commented Feb 9, 2025

Some pipeline updates only affect parts of data. For example, if a new pipeline version improves the cleaning of location names, sequence alignments are not affected and do not need to be reprocessed.

To avoid redundant processing, the backend should be able to return the processed data of the previous pipeline version if such data exist and the pipeline requests it. This allows the pipeline to copy over certain data instead of regenerating everything from the original data.

@chaoran-chen chaoran-chen added backend related to the loculus backend component feature Feature proposal labels Feb 9, 2025
@theosanderson
Copy link
Member

Thanks for raising the issue. IMO we should think about if this is something we want to support. IMO the idea of processedData that is only a product of unprocessedData is quite nice. But ofc we should consider any trade-offs that result.

@chaoran-chen
Copy link
Member Author

This suggestion came to my mind when I was thinking about processing of raw reads and having more compute-intense pipelines. While this feature might not be too relevant for fast pipelines, I imagine that it would be important for long-running pipelines.

@theosanderson
Copy link
Member

theosanderson commented Feb 9, 2025

It feels nicer for something like this (avoiding repeating a heavy duty step of the preprocessing pipeline) to be achieved by something like:

  • when the preprocessing pipeline makes e.g. BAMs from a FASTQ it stores them under a name which is a hash of the FASTQ data, with a salt according to the exact pipeline used to align
  • before mapping a FASTQ, the preprocessing pipeline first checks if a BAM with that hash already exists, and if so skips to just using that file

If we went with #3344 (comment) we could still broadly achieve that, as long as the preprocessing pipeline had some space in which it could decide to store a pointer from the hash to the final S3 location

@chaoran-chen
Copy link
Member Author

True – we can also leave this up to the pipeline. Sounds equally good to me. Let's put a "discussion" label and see whether other people have thoughts on this.

@chaoran-chen chaoran-chen added the discussion Open questions label Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend related to the loculus backend component discussion Open questions feature Feature proposal
Projects
None yet
Development

No branches or pull requests

2 participants