Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw data / files sharing #3664

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Raw data / files sharing #3664

wants to merge 2 commits into from

Conversation

chaoran-chen
Copy link
Member

@chaoran-chen chaoran-chen commented Feb 10, 2025

This is a rough sketch of the implementation of the raw data/files sharing. Based on feedback, further thoughts and looking through the code, I further refined the proposal that I posted in #3344 (comment).

The basic flow:

  1. To submit files (which may contain raw reads), the client (of the user) calls /files/request-uploads, providing the number of files that it wants to upload.
  2. The backend generates S3 pre-signed URLs to upload files to <private bucket>/files_uploads/<file id> where the file id is a UUID. It sends the pre-signed URLs to the client.
  3. The client uploads the files to S3.
  4. The client calls /files/confirm-uploads with the IDs of the uploaded files.
  5. The backend checks that the files exist and moves them to <private bucket>/files/<file id>.
  6. The client calls /submit and provides in the metadata file in the column file the file ID for each entry.
  7. The backend checks that all mentioned files exist (i.e., uploaded and confirmed) and belong to the same group.
  8. The pipeline asks the backend for unprocessed entries.
  9. The backend creates pre-signed URLs to download the original files and sends them together with the metadata and FASTA to the pipeline.
  10. The pipeline calls /files/request-uploads to get pre-signed URLs to upload processed files, and uploads the processed files, and calls /files/confirm-uploads to confirm the uploads.
  11. The pipeline calls the /submit-processed-data with the processed metadata, consensus sequences, and associated file IDs.
  12. The backend checks that all files exist.
  13. The user can review the processed data. The backend creates pre-signed URLs to download the processed files.
  14. The user approves.
  15. The SILO importer calls /get-released-data
  16. The backend fetches the released data from the database. It copies all associated, processed files to the public bucket if they are not there yet. It then includes the public URLs in the response.

Some notes:

  • Each file is owned by a group. The backend ensures that files are only linked with sequence entries of the same group.
  • A file can be reused: it is possible to link to the same file from different sequence entries, versions, processed files, etc. That means that a user can revise the metadata without re-uploading the files.
  • By moving the files from /files_uploads to /files, we ensure that the file cannot be edited afterward. This is needed because it is not possible to invalidate a pre-signed URL.
  • We have to implement a garbage collector to delete unused/unlinked files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overarching question I'd have is whether we should think about implementing hashing as a part of this workflow. So in

The backend checks that the files exist and moves them to /files/.

we could instead:

The backend checks that the files exist and moves them to /files/[hash of file contents] and stores the hash with the file id

That way, if users uploaded a second version with an identical file, even if the pipeline hadn't implemented any special logic, we wouldn't store it twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! This would need more computation and require the backend to fetch the file. A simple move within a bucket (if I understand correctly, but not 100% sure) would be just a renaming and cost nothing. But if users do tend to upload the same file, it would indeed save storage. – I'm not certain what's better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked ChatGPT and got pointed to https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html which is apparently an opt-in feature. But we'd want to see how widely it was supported by other cloud providers etc.

[skip ci]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants