Merge branch 'main' of github.com:vadimnazarov/panpipes

DendrouLab · Feb 4, 2025 · 410c897 · 410c897
2 parents b76bfa9 + 304a9f8
commit 410c897
Show file tree

Hide file tree

Showing 16 changed files with 307 additions and 9 deletions.
diff --git a/docs/yaml_docs/index.rst b/docs/yaml_docs/index.rst
@@ -14,5 +14,6 @@ Workflows configuration files
     spatial_deconvolution
     pipeline_visualization_yml
     pipeline_refmap_yml
+    threads_tasks_panpipes
 
 
diff --git a/docs/yaml_docs/pipeline_clustering_yml.md b/docs/yaml_docs/pipeline_clustering_yml.md
@@ -31,7 +31,7 @@ You can download the different clustering pipeline.yml files here:
 ## Compute resources options
 
 - <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs, Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 2<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/pipeline_ingestion_yml.md b/docs/yaml_docs/pipeline_ingestion_yml.md
@@ -31,7 +31,7 @@ You can download the different ingestion `pipeline.yml` files here:
 ## Compute resources options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs. Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/pipeline_integration_yml.md b/docs/yaml_docs/pipeline_integration_yml.md
@@ -23,7 +23,7 @@ For more information on functionalities implemented in `panpipes` to read the co
 ## Compute resources options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs. Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
    Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/pipeline_preprocess_yml.md b/docs/yaml_docs/pipeline_preprocess_yml.md
@@ -27,7 +27,7 @@ You can download the different preprocess `pipeline.yml` files here:
 ## Compute resources options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs.Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 2<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/pipeline_refmap_yml.md b/docs/yaml_docs/pipeline_refmap_yml.md
@@ -27,7 +27,7 @@ You can download the different refmap `pipeline.yml` files here:
 ## Compute resources options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs.  Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
 Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/pipeline_visualization_yml.md b/docs/yaml_docs/pipeline_visualization_yml.md
@@ -24,7 +24,7 @@ You can download the different ingestion `pipeline.yml` files here:
 
 ## Compute resources options
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs.  Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/spatial_deconvolution.md b/docs/yaml_docs/spatial_deconvolution.md
@@ -26,7 +26,7 @@ For more information on functionalities implemented in `panpipes` to read the co
 ## 0. Compute Resource Options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs.  Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/spatial_preprocess.md b/docs/yaml_docs/spatial_preprocess.md
@@ -28,7 +28,7 @@ You can download the different preprocess pipeline.yml files here:
 ## 0. Compute Resource Options
 
 <span class="parameter">resources</span><br>
-Computing resources to use, specifically the number of threads used for parallel jobs.
+Computing resources to use, specifically the number of threads used for parallel jobs.  Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/spatial_qc.md b/docs/yaml_docs/spatial_qc.md
@@ -29,6 +29,7 @@ For more information on functionalities implemented in `panpipes` to read the co
 
 <span class="parameter">resources</span><br>
 Computing resources to use, specifically the number of threads used for parallel jobs.
+Check [threads_tasks_panpipes](./threads_tasks_panpipes.md) for more information on which threads each specific task requires.
 Specified by the following three parameters:
   - <span class="parameter">threads_high</span> `Integer`, Default: 1<br>
         Number of threads used for high intensity computing tasks. 

diff --git a/docs/yaml_docs/threads_tasks_panpipes.md b/docs/yaml_docs/threads_tasks_panpipes.md
@@ -0,0 +1,271 @@
+# Threads for individual workflow tasks 
+
+<table>
+  <tr>
+    <th colspan="3">Task ingest</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th>threads_medium</th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>Creating h5mu from filtered data files</td>
+    <td>load_mudatas</td>
+    <td>run_repertoire_qc</td>
+  </tr>
+  <tr>
+    <td>Creating h5mu from bg data files</td>
+    <td>load_bg_mudatas</td>
+    <td>run_atac_qc</td>
+  </tr>
+  <tr>
+    <td>rna QC</td>
+    <td>downsample_bg_mudatas</td>
+    <td>plot_qc</td>
+  </tr>
+  <tr>
+    <td>prot QC</td>
+    <td>run_scrublet</td>
+    <td>10X metrics plotting</td>
+  </tr>
+  <tr>
+    <td>prot QC</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan="3">Task preprocess</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th>threads_medium</th>
+    <th>threads_low</th>
+    <th></th>
+    <th></th>
+  </tr>
+  <tr>
+    <td>assess background</td>
+    <td></td>
+    <td>filter_mudata</td>
+  </tr>
+  <tr>
+    <td>rna_preprocess</td>
+    <td></td>
+    <td>downsample</td>
+  </tr>
+  <tr>
+    <td>prot_preprocess</td>
+    <td></td>
+    <td>postfilterplot</td>
+  </tr>
+  <tr>
+    <td>atac_preprocess</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan="3">Task integration</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th>threads_medium</th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>run_no_batch_correct_rna</td>
+    <td>Evaluation</td>
+    <td>run_lisi</td>
+  </tr>
+  <tr>
+    <td>run_bbknn_rna</td>
+    <td>plot_umaps</td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_harmony_rna</td>
+    <td>run_scib_metrics</td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_combat_rna</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_scanorama_rna</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_scvi_rna</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_no_batch_correct_prot</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_harmony_prot</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_bbknn_prot</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_combat_prot</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_no_batch_correct_atac</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_harmony_atac</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_bbknn_atac</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_totalvi</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_multivi</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_mofa</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_wnn</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>merge_integration</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan="3">Task clustering</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th>threads_medium</th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>run_neighbors</td>
+    <td>run_clustering</td>
+    <td>plot_clustree</td>
+  </tr>
+  <tr>
+    <td>run_umap</td>
+    <td>collate_mdata</td>
+    <td>aggregate_clusters</td>
+  </tr>
+  <tr>
+    <td>find_markers</td>
+    <td>plot_cluster_umaps</td>
+    <td></td>
+  </tr>
+  <tr>
+    <td></td>
+    <td>plot_markers</td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan="3">Task vis</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th></th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>plot_custom_markers_per_group</td>
+    <td></td>
+    <td>plot_metrics</td>
+  </tr>
+  <tr>
+    <td>plot_custom_markers_umap</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>plot_categorical_umaps</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>write_obs</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>plot_scatters</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan = "3"> Task refmap </th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th><th>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_refmap_scvi</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <td>run_scib_refmap</td>
+    <td></td>
+    <td></td>
+  </tr>
+  <tr>
+    <th colspan="3">Task preprocess spatial</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th></th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>spatial_preprocess</td>
+    <td></td>
+    <td>filter_mudata</td>
+  </tr>
+  <tr>
+    <th colspan="3">Task Spatial</th>
+  </tr>
+  <tr>
+    <th>threads_high</th>
+    <th></th>
+    <th>threads_low</th>
+  </tr>
+  <tr>
+    <td>load_mudata</td>
+    <td></td>
+    <td>plotQC_spatial</td>
+  </tr>
+</table>
diff --git a/panpipes/panpipes/pipeline_ingest.py b/panpipes/panpipes/pipeline_ingest.py
@@ -104,6 +104,12 @@ def unfilt_file():
 def gen_load_filtered_anndata_jobs():
     caf = pd.read_csv(PARAMS["submission_file"], sep="\t")
 
+    duplicated_rows = caf.duplicated()
+
+    if duplicated_rows.any():
+        print(f"Duplicated rows found and removed: {duplicated_rows.sum()} rows.")
+        caf = caf.drop_duplicates()
+
     return gen_load_anndata_jobs(
         caf,
         load_raw=False,

diff --git a/panpipes/panpipes/pipeline_integration.py b/panpipes/panpipes/pipeline_integration.py
@@ -817,6 +817,7 @@ def plot_umaps(infile, outfile):
 
 #this can follow now any mtd generation, but it will collate only RNA jobs for lisi
 @follows(collate_integration_outputs)
+@active_if(PARAMS['lisi_run'])
 @transform(collate_integration_outputs, 
            formatter(),  'logs/7_lisi.log')
 def run_lisi(infile, outfile):
@@ -834,6 +835,7 @@ def run_lisi(infile, outfile):
 
 
 @follows(collate_integration_outputs)
+@active_if(PARAMS['scib_run'])
 @transform(collate_integration_outputs, formatter(),  'logs/scib.log')
 def run_scib_metrics(infile, outfile):
     cell_mtd_file = sprefix + "_cell_mtd.csv"