|
| 1 | +--- |
| 2 | +title: "Add data" |
| 3 | +description: "An introduction to adding pipeline input data in Seqera Platform" |
| 4 | +date: "21 Jul 2024" |
| 5 | +tags: [platform, data, data explorer, datasets] |
| 6 | +--- |
| 7 | + |
| 8 | +import Tabs from '@theme/Tabs'; |
| 9 | +import TabItem from '@theme/TabItem'; |
| 10 | + |
| 11 | +Most bioinformatics pipelines require an input of some sort. This is typically a samplesheet where each row consists of a sample, the location of files for that sample (such as FASTQ files), and other sample details. Reliable shared access to pipeline input data is crucial to simplify data management, minimize user data-input errors, and facilitate reproducible workflows. |
| 12 | + |
| 13 | +In Platform, samplesheets and other data can be made easily accessible in one of two ways: |
| 14 | +- Use **Data Explorer** to browse and interact with remote data from AWS S3, Azure Blob Storage, and Google Cloud Storage repositories, directly in your organization workspace. |
| 15 | +- Use **Datasets** to upload structured data to your workspace in CSV (Comma-Separated Values) or TSV (Tab-Separated Values) format. |
| 16 | + |
| 17 | +## Data Explorer |
| 18 | + |
| 19 | +For pipeline runs in the cloud, users typically need access to buckets or blob storage to upload files (such as samplesheets and reference data) for analysis and to view pipeline results. Managing credentials and permissions for multiple users and training users to navigate cloud consoles and CLIs can be complicated. Data Explorer provides the simplified alternative of viewing your data directly in Platform. |
| 20 | + |
| 21 | +### Add a cloud bucket |
| 22 | + |
| 23 | +Private cloud storage buckets accessible by the [credentials](../../credentials/overview.mdx) in your workspace are added to Data Explorer automatically by default. However, you can also add custom directory paths within buckets to your workspace to simplify direct access. |
| 24 | + |
| 25 | +To add individual buckets (or directory paths within buckets): |
| 26 | + |
| 27 | +1. From the **Data Explorer** tab, select **Add cloud bucket**. |
| 28 | +1. Specify the bucket details: |
| 29 | + - The cloud **Provider**. |
| 30 | + - An existing cloud **Bucket path**. |
| 31 | + - A unique **Name** for the bucket. |
| 32 | + - The **Credentials** used to access the bucket. For public cloud buckets, select **Public** from the dropdown menu. |
| 33 | + - An optional bucket **Description**. |
| 34 | +1. Select **Add**. |
| 35 | + |
| 36 | +  |
| 37 | + |
| 38 | +You can now use this data in your analysis without the need to interact with cloud consoles or CLI tools. |
| 39 | + |
| 40 | +#### Public data sources |
| 41 | + |
| 42 | +Select **Public** from the credentials dropdown menu to add public cloud storage buckets from resources such as: |
| 43 | + |
| 44 | +- [The Cancer Genome Atlas (TCGA)](https://registry.opendata.aws/tcga/) |
| 45 | +- [1000 Genomes Project](https://registry.opendata.aws/1000-genomes/) |
| 46 | +- [NCBI SRA](https://registry.opendata.aws/ncbi-sra/) |
| 47 | +- [Genome in a Bottle Consortium](https://docs.opendata.aws/giab/readme.html) |
| 48 | +- [MSSNG Database](https://cloud.google.com/life-sciences/docs/resources/public-datasets/mssng) |
| 49 | +- [Genome Aggregation Database (gnomAD)](https://cloud.google.com/life-sciences/docs/resources/public-datasets/gnomad) |
| 50 | + |
| 51 | +### View pipeline outputs |
| 52 | + |
| 53 | +In Data Explorer, you can: |
| 54 | + |
| 55 | + - **View bucket details**: |
| 56 | + Select the information icon next to a bucket in the list to view the cloud provider, bucket address, and credentials. |
| 57 | + |
| 58 | +  |
| 59 | + |
| 60 | + - **View bucket contents**: |
| 61 | + Select a bucket name from the list to view the bucket contents. The file type, size, and path of objects are displayed in columns next to the object name. For example, view the outputs of your [nf-core/rnaseq](./comm-showcase.mdx#launch-the-nf-corernaseq-pipeline) run: |
| 62 | + |
| 63 | +  |
| 64 | + |
| 65 | + - **Preview files**: |
| 66 | + Select a file to open a preview window that includes a **Download** button. For example, view the resultant gene counts of the salmon quantification step of your [nf-core/rnaseq](./comm-showcase.mdx#launch-the-nf-corernaseq-pipeline) run: |
| 67 | + |
| 68 | +  |
| 69 | + |
| 70 | +## Datasets |
| 71 | + |
| 72 | +Datasets in Platform are CSV (comma-separated values) and TSV (tab-separated values) files stored in a workspace. You can select stored datasets as input data when launching a pipeline. |
| 73 | + |
| 74 | +<details> |
| 75 | + <summary>**Example: nf-core/rnaseq test samplesheet**</summary> |
| 76 | + |
| 77 | + The [nf-core/rnaseq](https://github.com/nf-core/rnaseq) pipeline works with input datasets (samplesheets) containing sample names, FASTQ file locations, and indications of strandedness. The Seqera Community Showcase sample dataset for nf-core/rnaseq specifies the paths to 7 small sub-sampled FASTQ files from a yeast RNAseq dataset: |
| 78 | + |
| 79 | + **Example nf-core/rnaseq dataset** |
| 80 | + |
| 81 | + | sample | fastq_1 | fastq_2 | strandedness | |
| 82 | + | ------------------- | ------------------------------------ | ------------------------------------ | ------------ | |
| 83 | + | WT_REP1 | s3://nf-core-awsmegatests/rnaseq/... | s3://nf-core-awsmegatests/rnaseq/... | reverse | |
| 84 | + | WT_REP1 | s3://nf-core-awsmegatests/rnaseq/... | s3://nf-core-awsmegatests/rnaseq/... | reverse | |
| 85 | + | WT_REP2 | s3://nf-core-awsmegatests/rnaseq/... | s3://nf-core-awsmegatests/rnaseq/... | reverse | |
| 86 | + | RAP1_UNINDUCED_REP1 | s3://nf-core-awsmegatests/rnaseq/... | | reverse | |
| 87 | + | RAP1_UNINDUCED_REP2 | s3://nf-core-awsmegatests/rnaseq/... | | reverse | |
| 88 | + | RAP1_UNINDUCED_REP2 | s3://nf-core-awsmegatests/rnaseq/... | | reverse | |
| 89 | + | RAP1_IAA_30M_REP1 | s3://nf-core-awsmegatests/rnaseq/... | s3://nf-core-awsmegatests/rnaseq/... | reverse | |
| 90 | + |
| 91 | +</details> |
| 92 | + |
| 93 | +Download the nf-core/rnaseq [samplesheet_test.csv](samplesheet_test.csv). |
| 94 | + |
| 95 | +### Add a dataset |
| 96 | + |
| 97 | +From the **Datasets** tab, select **Add Dataset**. |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | +Specify the following dataset details: |
| 102 | + |
| 103 | +- A **Name** for the dataset, such as `nf-core-rnaseq-test-dataset`. |
| 104 | +- A **Description** for the dataset. |
| 105 | +- Select the **First row as header** option to prevent Platform from parsing the header row of the samplesheet as sample data. |
| 106 | +- Select **Upload file** and browse to your CSV or TSV file in local storage, or simply drag and drop it into the box. |
| 107 | + |
| 108 | +Notice the location of the files in the nf-core/rnaseq example dataset point to a path on S3. This could also be a path to a shared filesystem, if using an HPC compute environment. Nextflow will use these paths to stage the files into the task working directory. |
| 109 | + |
| 110 | +:::info |
| 111 | +Platform does not store the data used for analysis in pipelines. The datasets must provide the locations of data that is stored on your own infrastructure. |
| 112 | +::: |
0 commit comments