Skip to content

Commit

Permalink
Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
anngvu committed Feb 14, 2025
1 parent a14a1a3 commit 3690afc
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 15 deletions.
12 changes: 12 additions & 0 deletions man/make_case_list_maf.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 19 additions & 15 deletions vignettes/bringing-portal-data-to-other-platforms-cbioportal.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,17 @@ knitr::opts_chunk$set(
)
```

**Document Status:** Draft
**Document Status:** Working
**Estimated Reading Time:** 8 min

## Special acknowledgments

Functionality demonstrated in this vignette benefited greatly from code originally written by [hhunterzinck](https://github.com/hhunterzinck).
Utils demonstrated in this vignette benefited greatly from code originally written by [hhunterzinck](https://github.com/hhunterzinck).

## Important note

The requirements for cBioPortal change, just like with any software or database.
The package is updated to keep up on a yearly submission basis, but there may be occasional points in time when the workflow is out-of-date with this external system.

## Intro

Expand Down Expand Up @@ -47,7 +52,10 @@ syn_login()
## Create a new study dataset

First create the study dataset "package" where we can put together the data.
Each study dataset combines multiple data types -- clinical, gene expression, gene variants, etc.
Each study dataset combines multiple data types -- clinical, gene expression, gene variants, etc.
Meta can be edited after the file has been created.
This will also set the working directory to the new study directory.


```{r cbp_new_study, eval=FALSE}
Expand All @@ -64,15 +72,15 @@ These functions download data files and create the meta for them.

Note that:

- These should be run with the working directory set to the study dataset directory as set up above to ensure consistent metadata.
- These should be run with the working directory set to the study directory as set up above to ensure consistent metadata.
- **Defaults are for known NF-OSI processed data outputs**.
- If these defaults don't apply because of changes in the scenario, take a look at the lower-level utils `make_meta_*` or edit the files manually after.
- Data types can vary in how much additional work is needed in remapping, reformatting, custom sanity checks, etc.

### Add mutations data

- `maf_data` references a final merged maf output file from the NF-OSI processing pipeline OK for public release.
- This data file type requires no further modifications except renaming.
- `maf_data` references a final merged maf output file from the NF-OSI processing pipeline (vcf2maf) OK for public release.
- Under the hood, a required case list file is also generated.

```{r add_maf, eval=FALSE}
Expand Down Expand Up @@ -109,10 +117,8 @@ cbp_add_expression(mrna_data,

### Add clinical data

- Clinical data **should be added last**, after all other data has been added, for sample checks to work properly.
- `clinical_data` is prepared from an existing Synapse table. The table can be a subsetted version of those released in the study dataset, or pass in a query that can be used for getting the subset. For example, the full clinical cohort comprises patients 1-50, but the dataset can only release data for patients 1-20 for expression data and data patients 15-20 for cna data. Here, `clinical_data` can be a smaller table of just those 1-30, or it can be the original table but pass in a suitable additional filter, e.g. `where release = 'batch1'`.
- Clinical data requires mapping to be as consistent with other public datasets as possible. `ref_map` defines the mapping of clinical variables from the NF-OSI data dictionary to cBioPortal's. Only variables in the mapping are exported to cBioPortal. Follow link below to inspect the default file and format used.
- Clinical data **should be added last**, after all other data has been added, for sample checks to work properly.

```{r add_clinical, eval=FALSE}
Expand All @@ -124,15 +130,13 @@ cbp_add_clinical(clinical_data, ref_map)

## Validation

There are additional steps such as generating case lists and validation that have to be done _outside_ of the package with a cBioPortal backend, where each portal may have specific configurations (such as genomic reference) to validate against.
See the [general docs for dataset validation](https://docs.cbioportal.org/using-the-dataset-validator/).
Validation has to be done with a cBioPortal instance. Each portal may have specific configurations (such as genomic reference) to validate against.

For the _public_ portal, the suggested step using the public server is given below.

Assuming your present working directory is `~/datahub/public` and a study folder called `npst_nfosi_ntap_2022` has been placed into it, mount the dataset into the container and run validation like:
For an example simple *offline* validation, assuming you are at `~/datahub/public` and a study folder called `npst_nfosi_ntap_2022` has been placed into it, mount the dataset into the container and run validation like:
```
STUDY=npst_nfosi_ntap_2022
sudo docker run --rm -v $(pwd):/datahub cbioportal/cbioportal:5.4.7 validateStudies.py -d /datahub -l $STUDY -u http://cbioportal.org -html /datahub/$STUDY/html_report
sudo docker run --rm -v $(pwd):/datahub cbioportal/cbioportal:6.0.25 validateData.py -s datahub/$STUDY -n -v
```

The html report will list issues by data types to help with any corrections needed.
**See the [general docs for dataset validation](https://docs.cbioportal.org/using-the-dataset-validator) for more examples.**

0 comments on commit 3690afc

Please sign in to comment.