Introduction

This is the order of commands to replicate the analyses performed.

Any time I'm running snakemake all, the pertinent output files are listed under rule all in the respective Snakefile of that directory.

Many of the Snakefiles have include: statements at the top. Source code for those dependency Snakemake pipelines can be found in supplementary_snakemake_workflows

The exact python environment is described in py11.yml. To replicate this environment, do:

mamba env create -f py311.yml

Genotype Preparation

`Locus_Extraction`

The code in this sub-directory takes the Rahmioglu lead SNPs and creates BED files that have a 500kb window on other side (in both build 37 and build 38)

snakemake all

`1KG_LD_Testing`

Within the 500kb windows from the previous section, we use 1KG to find LD tag SNPs for each GWAS lead SNP. SNPs were considered tag SNPs if the R2 was at least 0.1

snakemake all

Cluster Preparation

`Cluster_Training`

This section details how we defined the clusters using endometriosis cases from the non-genotyped PMBB.

Make sure that Feature_Extraction/config_pheno.yaml is up to date
snakemake Pheno/FULL_PMBB_all_cleaned_phenos.csv
Notebook: construct_input_data_no_snps_full_PMBB.ipynb
Notebook: input_data_feature_figures.ipynb <- Figure S1 and S2
Notebook: cluster_method_testing.ipynb <- Figure 1, Table S2
Notebook: cluster_training_spectral_k5.ipynb <- Assigns clusters and Creates Model Pickle
Notebook: spectral_clustering_vis.ipynb <- Figure 2
Notebook: endo_subtype_cluster_tests.ipynb <- Figures 3 and 4, Table S3 and S4

`Feature_Extraction`

This is where we extract the EHR features for the other three local datasets (PMBB, eMERGE, and UKBB). Also, this is where we examine the input feature prevalence among the datasets.

snakemake all <- Table 1 and S5
Notebook: get_dataset_prevalences.ipynb <- Figure S2

Candidate Gene Association Testing

`Association_Testing`

In this top directory we first collect all of the phenotype and sample size information

snakemake all <- Tables 1 and 2

Then in each sub-directory for PMBB, eMERGE, and UKBB, we do:

snakemake sample_size_table.csv
snakemake all

For AOU, I uploaded the tarball summary stats downloaded from the workbench and ran the following scripts:

bash make_sumstats_dirs.sh
python get_aou_sample_sizes_from_sumstats.py

For BioVU, I was able to copy the summary stats from Box:

bash rclone_copy_from_box.sh
bash link_box_sumstats.sh
snakemake all (I had to do some post-processing and help James Jaworski with the random sampling for the negative control)

`Association_Testing_Negative_Control`

Handled the same way as the previous section, excludes the overall endometriosis phenotype

Meta-Analyses

`Meta_Analysis`

Here, we use plink --meta to meta-analyze the effects from the individual studies.

snakemake all <- prepares input files
Notebook: Perform_Meta_Analysis.ipynb

`Meta_Analysis_Negative_Control`

Same as previous section

Examining Cluster Genetic Heterogeneity

`Cluster_Heterogeneity`

Looking at genetic differences between the clusters, incorporating information from the positive and negative controls

Notebook: Test_Cluster_Heterogeneity.ipynb <- Figures 5, 6, S5, Table S6

Testing Clustering Methodology in Other Datasets

`Cluster_Validation`

To evaluate robustness of the patterns we observed, we applied the same clustering method to the five genetic + EHR datasets.

Notebook: clustering_in_other_datasets.ipynb <- Local application in G-PMBB, eMERGE, UKBB
Script: clustering_script_for_other_data.py <- External application in AOU and BioVU
Notebook: compare_dataset_clusters.ipynb <- Comparing and visualizing results (Figure S4)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Genotype Preparation

`Locus_Extraction`

`1KG_LD_Testing`

Cluster Preparation

`Cluster_Training`

`Feature_Extraction`

Candidate Gene Association Testing

`Association_Testing`

`Association_Testing_Negative_Control`

Meta-Analyses

`Meta_Analysis`

`Meta_Analysis_Negative_Control`

Examining Cluster Genetic Heterogeneity

`Cluster_Heterogeneity`

Testing Clustering Methodology in Other Datasets

`Cluster_Validation`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
1KG_LD_Testing		1KG_LD_Testing
Association_Testing		Association_Testing
Association_Testing_Negative_Control		Association_Testing_Negative_Control
Cluster_Heterogeneity		Cluster_Heterogeneity
Cluster_Training		Cluster_Training
Cluster_Validation		Cluster_Validation
Feature_Extraction		Feature_Extraction
Locus_Extraction		Locus_Extraction
Manuscript_Plots		Manuscript_Plots
Manuscript_Tables		Manuscript_Tables
Meta_Analysis		Meta_Analysis
Meta_Analysis_Negative_Control		Meta_Analysis_Negative_Control
PRS		PRS
Rahmioglu_Sumstats		Rahmioglu_Sumstats
supplementary_snakemake_workflows		supplementary_snakemake_workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
old_manuscript_figures.ipynb		old_manuscript_figures.ipynb
py311.yml		py311.yml

License

Setia-Verma-Lab/endometriosis_subtyping

Folders and files

Latest commit

History

Repository files navigation

Introduction

Genotype Preparation

Locus_Extraction

1KG_LD_Testing

Cluster Preparation

Cluster_Training

Feature_Extraction

Candidate Gene Association Testing

Association_Testing

Association_Testing_Negative_Control

Meta-Analyses

Meta_Analysis

Meta_Analysis_Negative_Control

Examining Cluster Genetic Heterogeneity

Cluster_Heterogeneity

Testing Clustering Methodology in Other Datasets

Cluster_Validation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Locus_Extraction`

`1KG_LD_Testing`

`Cluster_Training`

`Feature_Extraction`

`Association_Testing`

`Association_Testing_Negative_Control`

`Meta_Analysis`

`Meta_Analysis_Negative_Control`

`Cluster_Heterogeneity`

`Cluster_Validation`

Packages