diff --git a/docs/yaml_docs/spatial_deconvolution.md b/docs/yaml_docs/spatial_deconvolution.md index 1e3fef01..61e1ffa2 100644 --- a/docs/yaml_docs/spatial_deconvolution.md +++ b/docs/yaml_docs/spatial_deconvolution.md @@ -1,125 +1,206 @@ + # Spatial Deconvolution YAML -In this documentation, the parameters of the `deconvolution_spatial` yaml file are explained. -This file is generated running `panpipes deconvolution config`. -In general, the user can leave parameters empty to use defaults.
The individual steps run by the pipeline are described in the [spatial deconvolution workflow](../workflows/deconvolute_spatial.md). +In this documentation, the parameters of the `deconvolution_spatial` configuration yaml file are explained. +This file is generated running `panpipes deconvolution_spatial config`.
The individual steps run by the pipeline are described in the [spatial deconvolution workflow](../workflows/deconvolute_spatial.md). + +When running the deconvolution workflow, panpipes provides a basic `pipeline.yml` file. +To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data. +However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html). +You can download the different deconvolution pipeline.yml files here: +- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes deconvolution_spatial config`: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_deconvolution_spatial/pipeline.yml) +- `pipeline.yml` file for [Deconvoluting spatial data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/deconvolution/deconvoluting_spatial_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/deconvolution/pipeline.yml) ## 0. Compute Resource Options -| `resources` | | -| --- | --- | -| `threads_high` | __`int`__ (default: 1)
Number of threads used for high intensity computing tasks. | -| `threads_medium` | __`int`__ (default: 1)
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. | -| `threads_low` | __`int`__ (default: 1)
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.| +resources
+Computing resources to use, specifically the number of threads used for parallel jobs. +Specified by the following three parameters: + - threads_high `Integer`, Default: 1
+ Number of threads used for high intensity computing tasks. + + - threads_medium `Integer`, Default: 1
+ Number of threads used for medium intensity computing tasks. + For each thread, there must be enough memory to load your mudata and do computationally light tasks. + + - threads_low `Integer`, Default: 1
+ Number of threads used for low intensity computing tasks. + For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. + +condaenv `String`
+ Path to conda environment that should be used to run panpipes. + Leave blank if running native or your cluster automatically inherits the login node environment -| | | -| ---- | --- | -| `condaenv` | __`str`__ (default: None)
Path to conda environment that should be used to run `Panpipes`. Leave blank if running native or your cluster automatically inherits the login node environment. | ## 1. Input Options With the `deconvolution_spatial` workflow, one or multiple spatial slides can be deconvoluted in one run. For that, a `MuData` object for each slide is expected, with the spatial data saved in `mdata.mod["spatial"]`. The spatial slides are deconvoluted **using the same reference**. For the reference, one `MuData` with the gene expression data saved in `mdata.mod["rna"]` is expected as input. Please note, that the same parameter setting is used for each slide.
For the **spatial** input, the workflow, therefore, reads in **all `.h5mu` objects of a directory** (see below). **The spatial and single-cell data thus need to be saved in different folders.** +
+ +input
+ - spatial `String`, Mandatory parameter
+ Path to folder containing one or multiple `MuDatas` of spatial data. The pipeline is reading in all `MuData` files in that folder and assuming that they are `MuDatas` of spatial slides. + + - singlecell `String`, Mandatory parameter
+ Path to the MuData **file** (not folder) of the reference single-cell data. -| `input` | | -| ---- | --- | -| `spatial` | __`str`__ (not optional)
Path to folder containing one or multiple `MuDatas` of spatial data. The pipeline is reading in all `MuData` files in that folder and assuming that they are `MuDatas` of spatial slides.| -| `singlecell` | __`str`__ (not optional)
Path to the MuData **file** (not folder) of the reference single-cell data.| ## 2. Cell2Location Options For each deconvolution method you can specify whether to run it or not: -| | | -| ---- | --- | -| `run` | __`bool`__ (default: None)
Whether to run Cell2location| +
+ +run `Boolean`, Default: None
+ Whether to run Cell2location -### Feature Selection + +### 2.1 Feature Selection You can select genes that are used for deconvolution in two ways. The first option is to provide a reduced feature set as a csv-file that is then used for deconvolution. The second option is to perform gene selection [according to Cell2Location](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html).
Please note, that gene selection is **not optional**. If no csv-file is provided, feature selection [according to Cell2Location.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html) is performed. +
+ +feature_selection
+ - gene_list `String`, Default: None
+ Path to a csv file containing a reduced feature set. A header in the csv is expected in the first row. All genes of that gene list need to be present in both, spatial slides and scRNA-Seq reference. + + - remove_mt `Boolean`, Default: True
+ Whether to remove mitochondrial genes from the dataset. This step is performed **before** running gene selection. + + - cell_count_cutoff `Integer`, Default: 15
+ All genes detected in less than cell_count_cutoff cells will be excluded. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html) + + - cell_percentage_cutoff2 `Float`, Default: 0.05
+ All genes detected in at least this percentage of cells will be included. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html) + + - nonz_mean_cutoff `Float`, Default: 1.12
+ Genes detected in the number of cells between the above-mentioned cutoffs are selected only when their average expression in non-zero cells is above this cutoff. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html) + + +### 2.2 Reference Model + +reference
+ - labels_key `String`, Default: None
+ Key in `.obs` for label (cell type) information. + + - batch_key `String`, Default: None
+ Key in `.obs` for batch information. + + - layer `String`, Default: None
+ Layer in `.layers` to use for the reference model. If None, `.X` will be used. Please note, that Cell2Location expects raw counts as input. + - categorical_covariate_key `String`, Default: None
+ Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). -| `feature_selection` | | -| ---- | --- | -| `gene_list` | __`str`__ (default: None)
Path to a csv file containing a reduced feature set. A header in the csv is expected in the first row. All genes of that gene list need to be present in both, spatial slides and scRNA-Seq reference.| -| `remove_mt` | __`bool`__ (default: True)
Whether to remove mitochondrial genes from the dataset. This step is performed **before** running gene selection. | -| `cell_count_cutoff` | __`int`__ (default: 15)
All genes detected in less than cell_count_cutoff cells will be excluded. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html)| -| `cell_percentage_cutoff2` | __`float`__ (default: 0.05)
All genes detected in at least this percentage of cells will be included. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html)| -| `nonz_mean_cutoff` | __`float`__ (default: 1.12)
Genes detected in the number of cells between the above-mentioned cutoffs are selected only when their average expression in non-zero cells is above this cutoff. Parameter of the [Cell2Location's gene selection function.](https://cell2location.readthedocs.io/en/latest/cell2location.utils.filtering.html) | + - continuous_covariate_keys `String`, Default: None
+ Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + - max_epochs `Integer`, Default: _np.min([round((20000 / n_cells) * 400), 400])_
+ Number of epochs. -### Reference Model + - use_gpu `Boolean`, Default: True
+ Whether to use GPU for training. + -| `reference` | | -| ---- | --- | -| `labels_key` | __`str`__ (default: None)
Key in `.obs` for label (cell type) information. | -| `batch_key` | __`str`__ (default: None)
Key in `.obs` for batch information. | -| `layer` | __`float`__ (default: None)
Layer in `.layers` to use for the reference model. If None, `.X` will be used. Please note, that Cell2Location expects raw counts as input.| -| `categorical_covariate_keys` | __`str`__ (default: None)
Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space).| -| `continuous_covariate_keys` | __`str`__ (default: None)
Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)| -| `max_epochs` | __`int`__ (default: _np.min([round((20000 / n_cells) * 400), 400])_)
Number of epochs.| -| `use_gpu` | __`bool`__ (default: True)
Whether to use GPU for training. | +### 2.3 Spatial Model -### Spatial Model +spatial
+ - batch_key `String`, Default: None
+ Key in `.obs` for batch information. + - layer `String`, Default: None
+ Layer in `.layers` to use for the reference model. If None, `.X` will be used. Please note, that Cell2Location expects raw counts as input. -| `spatial` | | -| ---- | --- | -| `batch_key` | __`str`__ (default: None)
Key in `.obs` for batch information. | -| `layer` | __`float`__ (default: None)
Layer in `.layers` to use for the reference model. If None, `.X` will be used. Please note, that Cell2Location expects raw counts as input.| -| `categorical_covariate_keys` | __`str`__ (default: None)
Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space).| -| `continuous_covariate_keys` | __`str`__ (default: None)
Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)| -| `N_cells_per_location` | __`int`__ (not optional)
Expected cell abundance per voxel. Please refer to the [Cell2Location documentation](https://cell2location.readthedocs.io/en/latest/index.html) for more information. | -| `detection_alpha` | __`float`__ (not optional)
Regularization of with-in experiment variation in RNA detection sensitivity. Please refer to the [Cell2Location documentation](https://cell2location.readthedocs.io/en/latest/index.html) for more information. | -| `max_epochs` | __`int`__ (default: _np.min([round((20000 / n_cells) * 400), 400])_)
Number of epochs.| -| `use_gpu` | __`bool`__ (default: True)
Whether to use GPU for training. | + - categorical_covariate_key `String`, Default: None
+ Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). + + - continuous_covariate_keys `String`, Default: None
+ Comma-separated without spaces, e.g. _key1,key2,key3_. Keys in `.obs` that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + + - N_cells_per_location `Integer`, Mandatory parameter
+ Expected cell abundance per voxel. Please refer to the [Cell2Location documentation](https://cell2location.readthedocs.io/en/latest/index.html) for more information. + + - detection_alpha `Float`, Mandatory parameter
+ Regularization of with-in experiment variation in RNA detection sensitivity. Please refer to the [Cell2Location documentation](https://cell2location.readthedocs.io/en/latest/index.html) for more information. + + - max_epochs `Integer`, Default: _np.min([round((20000 / n_cells) * 400), 400])_
+ Number of epochs. + + - use_gpu `Boolean`, Default: True
+ Whether to use GPU for training. -###
-You can specify whether both models should be saved with the following parameter: -| | | -| ---- | --- | -| `save_models` | __`bool`__ (default: False)
Whether to save the reference & spatial mapping models| +You can specify whether both models (spatial and reference) should be saved with the following parameter: +
+ +save_models, Default: False
+ Whether to save the reference & spatial mapping models. ## 3. Tangram Options For each deconvolution method you can specify whether to run it or not: -| | | -| ---- | --- | -| `run` | __`bool`__ (default: None)
Whether to run Tangram| +
+run `Boolean`, Default: None
+ Whether to run Tangram -### Feature Selection -You can select genes that are used for deconvolution in two ways. The first option is to provide a reduced feature set as a csv-file that is then used for deconvolution. The second option is to perform gene selection via [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) **on the reference scRNA-Seq data**, as [suggested by Tangram](https://tangram-sc.readthedocs.io/en/latest/tutorial_sq_link.html#Pre-processing). The top `n_genes` of each group make up the reduced gene set.
Please note, that gene selection is **not optional**. If no csv-file is provided, feature selection via [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) is performed. +### 3.1 Feature Selection +You can select genes that are used for deconvolution in two ways. The first option is to provide a reduced feature set as a csv-file that is then used for deconvolution. The second option is to perform gene selection via [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) **on the reference scRNA-Seq data**, as [suggested by Tangram](https://tangram-sc.readthedocs.io/en/latest/tutorial_sq_link.html#Pre-processing). The top `n_genes` of each group make up the reduced gene set.
Please note, that gene selection is **not optional**. If no csv-file is provided, feature selection via [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) is performed. +
-| `feature_selection` | | -| ---- | --- | -| `gene_list` | __`str`__ (default: None)
Path to a csv file containing a reduced feature set. A header in the csv is expected in the first row. All genes of that gene list need to be present in both, spatial slides and scRNA-Seq reference.| +feature_selection
+ - gene_list `String`, Default: None
+ Path to a csv file containing a reduced feature set. A header in the csv is expected in the first row. All genes of that gene list need to be present in both, spatial slides and scRNA-Seq reference. ___Parameters for `scanpy.tl.rank_genes_groups` gene selection___ -| `rank_genes` | | -| ---- | --- | -| `labels_key` | __`str`__ (default: None)
Which column in `.obs` of the reference to use for the `groupby` parameter of [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) .| -| `layer` | __`str`__ (default: None)
Which layer of the reference to use for [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html). If None, `.X` is used.| -| `n_genes` | __`int`__ (default: 100)
How many top genes to select of each `groupby` group| -| `test_method` | __`str ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']`__ (default: 't-test_overestim_var')
Which test method to use.| -| `correction_method` | __`str ['benjamini-hochberg', 'bonferroni']`__ (default: ' benjamini-hochberg')
Which p-value correction method to use. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'. | - -### Model - -| `model` | | -| ---- | --- | -| `labels_key` | __`str`__ (default: None)
Key in `.obs` for label (cell type) information. | -| `num_epochs` | __`int`__ (default: 1000)
Number of epochs. | -| `device` | __`str`__ (default: 'cpu')
Which device to use. | -| `kwargs` | In `kwargs`, the user has the possibility to specify parameters for [tangram.mapping_utils.map_cells_to_space](https://tangram-sc.readthedocs.io/en/latest/classes/tangram.mapping_utils.map_cells_to_space.html?highlight=mapping_utils%20map_cells_to_space#tangram.mapping_utils.map_cells_to_space). You can add or remove any parameters of the function.| + - rank_genes
+ - labels_key `String`, Default: None
+ Which column in `.obs` of the reference to use for the `groupby` parameter of [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html). + + - layer `String`, Default: None
+ Which layer of the reference to use for [scanpy.tl.rank_genes_groups](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html). If None, `.X` is used. + + - n_genes `Integer`, Default: 100
+ How many top genes to select of each `groupby` group. + + - test_method `['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']`, Default: 't-test_overestim_var'
+ Which test method to use. + + - correction_method `['benjamini-hochberg', 'bonferroni']`, Default: ' benjamini-hochberg'
+ Which p-value correction method to use. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'. + + +### 3.2 Model + +model
+ - labels_key `String`, Default: None
+ Key in `.obs` for label (cell type) information. + + - num_epochs `Integer`, Default: 1000
+ Number of epochs. + + - device `String`, Default: 'cpu'
+ Which device to use. + + - kwargs
+ In `kwargs`, the user has the possibility to specify parameters for [tangram.mapping_utils.map_cells_to_space](https://tangram-sc.readthedocs.io/en/latest/classes/tangram.mapping_utils.map_cells_to_space.html?highlight=mapping_utils%20map_cells_to_space#tangram.mapping_utils.map_cells_to_space). You can add or remove any parameters of the function. + diff --git a/docs/yaml_docs/spatial_preprocess.md b/docs/yaml_docs/spatial_preprocess.md index b443d4e2..1e2dac66 100644 --- a/docs/yaml_docs/spatial_preprocess.md +++ b/docs/yaml_docs/spatial_preprocess.md @@ -1,41 +1,69 @@ + # Spatial Preprocessing YAML -In this documentation, the parameters of the `preprocess_spatial` yaml file are explained. -This file is generated running `panpipes preprocess_spatial config`. In general, the user can leave parameters empty to use defaults.
The individual steps run by the pipeline are described in the [spatial preprocess workflow](../workflows/preprocess_spatial.md). +In this documentation, the parameters of the `preprocess_spatial` configuration yaml file are explained. +This file is generated running `panpipes preprocess_spatial config`.
The individual steps run by the pipeline are described in the [spatial preprocessing workflow](../workflows/preprocess_spatial.md). +When running the preprocess workflow, panpipes provides a basic `pipeline.yml` file. +To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data. +However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html). +You can download the different preprocess pipeline.yml files here: +- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes preprocess_spatial config`: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_preprocess_spatial/pipeline.yml) +- `pipeline.yml` file for [Preprocessing spatial data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/preprocess_spatial_data/preprocess_spatial_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/preprocess_spatial_data/pipeline.yml) ## 0. Compute Resource Options -| `resources` | | -| --- | --- | -| `threads_high` | __`int`__ (default: 1)
Number of threads used for high intensity computing tasks. | -| `threads_medium` | __`int`__ (default: 1)
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. | -| `threads_low` | __`int`__ (default: 1)
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.| +resources
+Computing resources to use, specifically the number of threads used for parallel jobs. +Specified by the following three parameters: + - threads_high `Integer`, Default: 1
+ Number of threads used for high intensity computing tasks. -| | | -| ---- | --- | -| `condaenv` | __`str`__ (default: None)
Path to conda environment that should be used to run `Panpipes`. Leave blank if running native or your cluster automatically inherits the login node environment. | + - threads_medium `Integer`, Default: 1
+ Number of threads used for medium intensity computing tasks. + For each thread, there must be enough memory to load your mudata and do computationally light tasks. + + - threads_low `Integer`, Default: 1
+ Number of threads used for low intensity computing tasks. + For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. + +condaenv `String`
+ Path to conda environment that should be used to run panpipes. + Leave blank if running native or your cluster automatically inherits the login node environment ## 1. Input Options With the preprocess_spatial workflow, one or multiple `MuData` objects can be preprocessed in one run. The workflow **reads in all `.h5mu` objects of a directory**. The `MuData` objects in the directory need to be of the same assay (vizgen or visium). The workflow then runs the preprocessing of each `MuData` object separately with the same parameters that are specified in the yaml file. +
+ +input_dir `String`, Mandatory parameter
+ Path to the folder containing all input `h5mu` files. + +assay [`'visium'`, `'vizgen'`], Default: `'visium'`
+ Spatial transcriptomics assay of the `h5mu` files in `input_dir`. -| | | -| ---- | --- | -| `input_dir` | __`str`__ (not optional)
Path to the folder containing all input `h5mu` files. | -| `assay` | __`str` [`'visium'`, `'vizgen'`]__ (default: 'visium')
Spatial transcriptomics assay of the `h5mu` files in `input_dir`.| ## 2. Filtering Options +filtering
+ - run `Boolean`, Default: False
+ Whether to run filtering. **If `False`, will not filter the data and will not produce post-filtering plots.** -| `filtering` | | -| --- | --- | -| `run` | __`bool`__ (default: False)
Whether to run filtering. **If `False`, will not filter the data and will not produce post-filtering plots.** | -| `keep_barcodes` | __`str`__ (default: None)
Path to a csv-file that has **no header** containing barcodes you want to keep. Barcodes that are not in the file, will be removed from the dataset before filtering the dataset with the thresholds specified below. | + - keep_barcodes `String`, Default: None
+ Path to a csv-file that has **no header** containing barcodes you want to keep. Barcodes that are not in the file, will be removed from the dataset before filtering the dataset with the thresholds specified below. +
With the parameters below you can specify thresholds for filtering. The filtering is fully customisable to any columns in `.obs` or `.var`. You are not restricted by the columns given as default. When specifying a column name, please make sure it exactly matches the column name in the h5mu object.
Please slso make sure, that the specified metrics are present in all `h5mu` objects of the `input_dir`, i.e. the `MuData` objects for that the preprocessing is run. @@ -60,51 +88,65 @@ With the parameters below you can specify thresholds for filtering. The filterin ## 3. Post-Filter Plotting The parameters below specify which metrics of the filtered data to plot. As for the [QC](./spatial_qc.md), violin and spatial embedding plots are generated for each slide separately. +
-| `plotqc` | | -| --- | --- | -| `grouping_var` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _sample_id,batch_ of categorical columns in `.obs`. One violin will be created for each group in the violin plot. Not mandatory, can be left empty. | -| `spatial_metrics` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _total_counts,n_genes_by_counts_ of columns in `.obs` or `.var`.
Specifies which metrics to plot. If metric is present in both, `.obs` and `.var`, **both will be plotted.** | +plotqc
+ - grouping_var `String`, Default: None
+ Comma-separated string without spaces, e.g. _sample_id,batch_ of categorical columns in `.obs`. One violin will be created for each group in the violin plot. Not mandatory, can be left empty. + - spatial_metrics `String`, Default: None
+ Comma-separated string without spaces, e.g. _total_counts,n_genes_by_counts_ of columns in `.obs` or `.var`.
Specifies which metrics to plot. If metric is present in both, `.obs` and `.var`, **both will be plotted.** + ## 4. Normalization, HVG Selection, and PCA Options -### **Normalization and HVG Selection**
+### **4.1 Normalization and HVG Selection** +`Panpipes` offers two different normalization and HVG selection flavours, `'seurat'` and `'squidpy'`.
The `'seurat'` flavour first selects HVGs on the raw counts using analytic Pearson residuals, i.e. [scanpy.experimental.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.highly_variable_genes.html). Afterwards, analytic Pearson residual normalization is applied, i.e. [scanpy.experimental.pp.normalize_pearson_residuals](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.normalize_pearson_residuals.html). Parameters of both functions can be specified by the user in the yaml file.
The `'squidpy'` flavour runs the basic scanpy normalization and HVG selection functions, i.e. [scanpy.pp.normalize_total](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html), [scanpy.pp.log1p](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.log1p.html), and [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). +
+ +norm_hvg_flavour[`'squidpy'`, `'seurat'`], Default: None
+ Normalization and HVG selection flavour to use. If None, will not run normalization nor HVG selection. +
+ +___Parameters for `norm_hvg_flavour` == `'squidpy'`___
+ +squidpy_hvg_flavour[`'seurat'`,`'cellranger'`,`'seurat_v3'`], Default: 'seurat'
+ Flavour to select HVGs, i.e.`flavor` parameter of the function [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). + +min_mean`Float`, Default: 0.05
+ Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). + +max_mean`Float`, Default: 1.5
+ Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). + +min_disp`Float`, Default: 0.5
+ Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). + +___Parameters for `norm_hvg_flavour` == `'seurat'`___
-`Panpipes` offers two different normalization and HVG selection flavours, `'seurat'` and `'squidpy'`.
The `'seurat'` flavour first selects HVGs on the raw counts using analytic Pearson residuals, i.e. [scanpy.experimental.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.highly_variable_genes.html). Afterwards, analytic Pearson residual normalization is applied, i.e. [scanpy.experimental.pp.normalize_pearson_residuals](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.normalize_pearson_residuals.html). Parameters of both functions can be specified by the user in the yaml file.
The `'squidpy'` flavour runs the basic scanpy normalization and HVG selection functions, i.e. [scanpy.pp.normalize_total](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html), [scanpy.pp.log1p](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.log1p.html), and [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html).
+theta`Float`, Default: 100
+ The negative binomial overdispersion parameter for pearson residuals. The same value is used for [HVG selection]((https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.highly_variable_genes.html)) and [normalization](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.normalize_pearson_residuals.html). +clip`Float`, Default: None
+ Specifies clipping of the residuals.
`clip` can be specified as:
-| `spatial` | | -| --- | --- | -| `norm_hvg_flavour` | __`str` [`'squidpy'`, `'seurat'`]__ (default: None)
Normalization and HVG selection flavour to use. If None, will not run normalization nor HVG selection. | +___Parameters for both `norm_hvg_flavour` flavours___
-___Parameters for `norm_hvg_flavour` == `'squidpy'`___ -| | | -| --- | --- | -| `squidpy_hvg_flavour` | __`str` [`'seurat'`,`'cellranger'`,`'seurat_v3'`]__ (default: 'seurat')
Flavour to select HVGs, i.e.`flavor` parameter of the function [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html).| -| `min_mean` | __`float`__ (default: 0.05)
Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html).| -| `max_mean` | __`float`__ (default: 1.5)
Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). | -| `min_disp` | __`float`__ (default: 0.5)
Parameter in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html). | +n_top_genes`Integer`, Default: 2000
+ Number of genes to select. Mandatory for `norm_hvg_flavour='seurat'` and `squidpy_hvg_flavour='seurat_v3'`. -___Parameters for `norm_hvg_flavour` == `'seurat'`___ -| | | -| --- | --- | -| `theta` | __`float`__ (default: 100)
The negative binomial overdispersion parameter for pearson residuals. The same value is used for [HVG selection]((https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.highly_variable_genes.html)) and [normalization](https://scanpy.readthedocs.io/en/stable/generated/scanpy.experimental.pp.normalize_pearson_residuals.html). | -| `clip` | __`float`__ (default: None)
Specifies clipping of the residuals.
`clip` can be specified as:
| +filter_by_hvg`Boolean`, Default: False
+ Subset the data to the HVGs. -___Parameters for both `norm_hvg_flavour` flavours___ -| | | -| --- | --- | -| `n_top_genes` | __`int`__ (default: 2000)
Number of genes to select. Mandatory for `norm_hvg_flavour='seurat'` and `squidpy_hvg_flavour='seurat_v3'`.| -| `filter_by_hvg` | __`bool`__ (default: False)
Subset the data to the HVGs.
| -| `hvg_batch_key` | __`str`__ (default: None)
If specified, HVGs are selected within each batch separately and merged. | +hvg_batch_key`String`, Default: None
+ If specified, HVGs are selected within each batch separately and merged. -### **PCA** +### **4.2 PCA** After normalization and HVG selection, PCA is run and the PCA and elbow plot are plotted. For that, the user can specify the number of PCs for the PCA computation and for the elbow plot, i.e. the same number is used for both. +
-| | | -| --- | --- | -| `n_pcs` | __`int`__ (default: 50)
Number of PCs to compute. | +n_pcs`Integer`, Default: 50
+ Number of PCs to compute. diff --git a/docs/yaml_docs/spatial_qc.md b/docs/yaml_docs/spatial_qc.md index 6c933d7f..a2795c53 100644 --- a/docs/yaml_docs/spatial_qc.md +++ b/docs/yaml_docs/spatial_qc.md @@ -1,50 +1,86 @@ + # Spatial QC YAML -In this documentation, the parameters of the `qc_spatial` yaml file are explained. -This file is generated running `panpipes qc_spatial config`. -In general, the user can leave parameters empty to use defaults.
The individual steps run by the pipeline are described in the [spatial QC workflow](../workflows/ingest_spatial.md). +In this documentation, the parameters of the `qc_spatial` configuration yaml file are explained. +This file is generated running `panpipes qc_spatial config`.
The individual steps run by the pipeline are described in the [spatial QC workflow](../workflows/ingest_spatial.md). + +When running the qc workflow, panpipes provides a basic `pipeline.yml` file. +To run the workflow on your own data, you need to specify the parameters described below in the `pipeline.yml` file to meet the requirements of your data. +However, we do provide pre-filled versions of the `pipeline.yml` file for individual [tutorials](https://panpipes-pipelines.readthedocs.io/en/latest/tutorials/index.html). +You can download the different ingestion pipeline.yml files here: +- Basic `pipeline.yml` file (not prefilled) that is generated when calling `panpipes qc_spatial config`: [Download here](https://github.com/DendrouLab/panpipes/blob/main/panpipes/panpipes/pipeline_qc_spatial/pipeline.yml) +- `pipeline.yml` file for [Ingesting 10X Visium data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_visium_data/Ingesting_visium_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/ingesting_visium_data/pipeline.yml) +- `pipeline.yml` file for [Ingesting MERFISH data Tutorial](https://panpipes-tutorials.readthedocs.io/en/latest/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.html): [Download here](https://github.com/DendrouLab/panpipes-tutorials/blob/main/docs/ingesting_merfish_data/pipeline.yml) ## 0. Compute Resource Options -| `resources` | | -| --- | --- | -| `threads_high` | __`int`__ (default: 1)
Number of threads used for high intensity computing tasks. For each thread, there must be enough memory to load all input files and create MuDatas. | -| `threads_medium` | __`int`__ (default: 1)
Number of threads used for medium intensity computing tasks. For each thread, there must be enough memory to load your mudata and do computationally light tasks. | -| `threads_low` | __`int`__ (default: 1)
Number of threads used for low intensity computing tasks. For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two.| +resources
+Computing resources to use, specifically the number of threads used for parallel jobs. +Specified by the following three parameters: + - threads_high `Integer`, Default: 1
+ Number of threads used for high intensity computing tasks. + For each thread, there must be enough memory to load all your input files at once and create the MuData object. + + - threads_medium `Integer`, Default: 1
+ Number of threads used for medium intensity computing tasks. + For each thread, there must be enough memory to load your mudata and do computationally light tasks. + - threads_low `Integer`, Default: 1
+ Number of threads used for low intensity computing tasks. + For each thread, there must be enough memory to load text files and do plotting, requires much less memory than the other two. -| | | -| ---- | --- | -| `condaenv` | __`str`__ (default: None)
Path to conda environment that should be used to run `Panpipes`. Leave blank if running native or your cluster automatically inherits the login node environment | +condaenv `String`
+ Path to conda environment that should be used to run panpipes. + Leave blank if running native or your cluster automatically inherits the login node environment ## 1. Loading Options -| | | -| ---- | --- | -| `project` | __`str`__ (default: None)
Project name | -| `submission_file` | __`str`__ (not optional)
Path to the submission file. The submission file specifies the input files. Please refer to the [general guidelines](../usage/setup_for_spatial_workflows.md) for details on the format of the file. | +project `String`, Default: None
+ Project name. + +submission_file `String`, Mandatory parameter
+ Path to the submission file. The submission file specifies the input files. Please refer to the [general guidelines](../usage/setup_for_spatial_workflows.md) for details on the format of the file. ## 2. QC Options This part of the workflow allows to generate additional QC metrics that can be used for filtering/preprocessing. Basic QC metrics using [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html) are in every case calculated. The computation of the additional QC metrics is **optional**. Please, leave the parameters empty to avoid running. +
+ +ccgenes `String`, Default: None
+ Path to tsv-file used to run the function [scanpy.tl.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html). It is expected, that the tsv-file has two columns with names `cc_phase` and `gene_name`. `cc_phase` can either be `s` or `g2m`.**Varying the column names or `cc_phase` values will result in an error.** Please refer to the [general guidelines](../usage/gene_list_format.md) for more information on the tsv file.
Instead of a path, the user can specify the parameter as "default" which then uses a [provided tsv file](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/cell_cycle_genes.tsv). + +custom_genes_file `String`, Default: None
+ Path to csv-file containing a gene list with columns `group` and `feature`. **Varying the column names will result in an error.** Please refer to the [general guidelines](../usage/gene_list_format.md) for more information about the file.
The gene list is used to calculate the proportions of genes of a group in the cells/spots. More precise, the groups & genes are used for the `qc_vars` parameter of the function [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html) which accordingly calculates proportions.
Additionally the gene list is used to compute gene scores with [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html)
Instead of a path, the user can specify the parameter as "default" which then uses a [provided csv file](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv). + +calc_proportions `String`, Default: None
+ Comma-separated string without spaces, e.g. _mito,hp,rp_.
For which groups of the csv-file specified in `custom_genes_file` to calculate percentages. + +score_genes `String`, Default: None
+ Comma-separated string without spaces, e.g. _mito,hp,rp_.
For which groups of the csv-file specified in `custom_genes_file` to run [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html) + +
+The following parameters specify the QC metrics to plot in violin and spatial embedding plots. Plots are generated for each slide specified in the submission file separately.
+
-| | | -| ---- | --- | -| `ccgenes` | __`str`__ (default: None)
Path to tsv-file used to run the function [scanpy.tl.score_genes_cell_cycle](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes_cell_cycle.html). It is expected, that the tsv-file has two columns with names `cc_phase` and `gene_name`. `cc_phase` can either be `s` or `g2m`.**Varying the column names or `cc_phase` values will result in an error.** Please refer to the [general guidelines](../usage/gene_list_format.md) for more information on the tsv file.
Instead of a path, the user can specify the parameter as "default" which then uses a [provided tsv file](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/cell_cycle_genes.tsv).| -| `custom_genes_file` | __`str`__ (default: None)
Path to csv-file containing a gene list with columns `group` and `feature`. **Varying the column names will result in an error.** Please refer to the [general guidelines](../usage/gene_list_format.md) for more information about the file.
The gene list is used to calculate the proportions of genes of a group in the cells/spots. More precise, the groups & genes are used for the `qc_vars` parameter of the function [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html) which accordingly calculates proportions.
Additionally the gene list is used to compute gene scores with [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html)
Instead of a path, the user can specify the parameter as "default" which then uses a [provided csv file](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv).| -| `calc_proportions` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _mito,hp,rp_.
For which groups of the csv-file specified in `custom_genes_file` to calculate percentages. | -| `score_genes` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _mito,hp,rp_.
For which groups of the csv-file specified in `custom_genes_file` to run [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html) | +plotqc
+ - grouping_var `String`, Default: None
+ Comma-separated string without spaces, e.g. _sample_id,batch_ of categorical columns in `.obs`. One violin will be created for each group in the violin plot. Not mandatory, can be left empty. -The following parameters specify the QC metrics to plot in violin and spatial embedding plots. Plots are generated for each slide specified in the submission file separately. + - spatial_metrics `String`, Default: None
+ Comma-separated string without spaces, e.g. _total_counts,n_genes_by_counts_ of columns in `.obs` or `.var`.
Specifies which metrics to plot. If metric is present in both, `.obs` and `.var`, **both will be plotted.** -| `plotqc` | | -| --- | --- | -| `grouping_var` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _sample_id,batch_ of categorical columns in `.obs`. One violin will be created for each group in the violin plot. Not mandatory, can be left empty.| -| `spatial_metrics` | __`str`__ (default: None)
Comma-separated string without spaces, e.g. _total_counts,n_genes_by_counts_ of columns in `.obs` or `.var`.
Specifies which metrics to plot. If metric is present in both, `.obs` and `.var`, **both will be plotted.**|