Raw data / files sharing #3664

chaoran-chen · 2025-02-10T22:56:53Z

This is a rough sketch of the implementation of the raw data/files sharing. Based on feedback, further thoughts and looking through the code, I further refined the proposal that I posted in #3344 (comment).

The basic flow:

To submit files (which may contain raw reads), the client (of the user) calls /files/request-uploads, providing the number of files that it wants to upload.
The backend generates S3 pre-signed URLs to upload files to <private bucket>/files_uploads/<file id> where the file id is a UUID. It sends the pre-signed URLs to the client.
The client uploads the files to S3.
The client calls /files/confirm-uploads with the IDs of the uploaded files.
The backend checks that the files exist and moves them to <private bucket>/files/<file id>.
The client calls /submit and provides in the metadata file in the column file the file ID for each entry.
The backend checks that all mentioned files exist (i.e., uploaded and confirmed) and belong to the same group.
The pipeline asks the backend for unprocessed entries.
The backend creates pre-signed URLs to download the original files and sends them together with the metadata and FASTA to the pipeline.
The pipeline calls /files/request-uploads to get pre-signed URLs to upload processed files, and uploads the processed files, and calls /files/confirm-uploads to confirm the uploads.
The pipeline calls the /submit-processed-data with the processed metadata, consensus sequences, and associated file IDs.
The backend checks that all files exist.
The user can review the processed data. The backend creates pre-signed URLs to download the processed files.
The user approves.
The SILO importer calls /get-released-data
The backend fetches the released data from the database. It copies all associated, processed files to the public bucket if they are not there yet. It then includes the public URLs in the response.

Some notes:

Each file is owned by a group. The backend ensures that files are only linked with sequence entries of the same group.
A file can be reused: it is possible to link to the same file from different sequence entries, versions, processed files, etc. That means that a user can revise the metadata without re-uploading the files.
By moving the files from /files_uploads to /files, we ensure that the file cannot be edited afterward. This is needed because it is not possible to invalidate a pre-signed URL.
We have to implement a garbage collector to delete unused/unlinked files.

theosanderson · 2025-02-12T13:10:56Z

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt

An overarching question I'd have is whether we should think about implementing hashing as a part of this workflow. So in

The backend checks that the files exist and moves them to /files/.

we could instead:

The backend checks that the files exist and moves them to /files/[hash of file contents] and stores the hash with the file id

That way, if users uploaded a second version with an identical file, even if the pipeline hadn't implemented any special logic, we wouldn't store it twice.

Interesting idea! This would need more computation and require the backend to fetch the file. A simple move within a bucket (if I understand correctly, but not 100% sure) would be just a renaming and cost nothing. But if users do tend to upload the same file, it would indeed save storage. – I'm not certain what's better.

I asked ChatGPT and got pointed to https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html which is apparently an opt-in feature. But we'd want to see how widely it was supported by other cloud providers etc.

[skip ci]

chaoran-chen mentioned this pull request Feb 10, 2025

Raw data sharing #3344

Open

chaoran-chen force-pushed the files-sharing branch from 10caa62 to d40dc62 Compare February 10, 2025 23:00

theosanderson reviewed Feb 12, 2025

View reviewed changes

Sketch backend changes

f3bc711

[skip ci]

chaoran-chen force-pushed the files-sharing branch from d40dc62 to f3bc711 Compare February 15, 2025 21:47

wip

342a646

[skip ci]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw data / files sharing #3664

Raw data / files sharing #3664

chaoran-chen commented Feb 10, 2025 •

edited

Loading

theosanderson Feb 12, 2025

chaoran-chen Feb 12, 2025

theosanderson Feb 12, 2025

Raw data / files sharing #3664

Are you sure you want to change the base?

Raw data / files sharing #3664

Conversation

chaoran-chen commented Feb 10, 2025 • edited Loading

theosanderson Feb 12, 2025

Choose a reason for hiding this comment

chaoran-chen Feb 12, 2025

Choose a reason for hiding this comment

theosanderson Feb 12, 2025

Choose a reason for hiding this comment

chaoran-chen commented Feb 10, 2025 •

edited

Loading