Skip to content

PMBB-Informatics-and-Genomics/pmbb-nf-toolkit-exwas-meta-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Documentation for ExWAS Meta-Analysis

Module Overview

This module performs many different types of statistical tests for meta-analyzing effects and/or p-values from exome-wide association studies (ExWASs). It looks at both gene-burden region-based tests and single rare variants, performing analyses including Fisher’s tests and Inverse Variance-Weighted meta-analysis. <<<<<<< HEAD

Example Module Config File

Example nextflow.config File

Cloning Github Repository

735cd94435c42746838e37ba6a951cf6801a3667

Software Requirements

<<<<<<< HEAD

=======

  • Navigate to relevant workflow directory...

Software Requirements

735cd94435c42746838e37ba6a951cf6801a3667

Commands for Running the Workflow

  • Singularity Command: singularity build exwas_meta.sif docker://pennbiobank/exwas-meta:latest <<<<<<< HEAD

  • Docker Command: docker pull pennbiobank/exwas-meta:latest

  • Pull from Google Container Registry: docker pull gcr.io/verma-pmbb-codeworks-psom-bf87/exwas-meta:latest

  • Run Command: nextflow run /path/to/toolkit/module/exwas_meta_analysis.nf

  • Common nextflow run flags:

    • -resume flag picks up workflow where it left off

    • -stub performs a dry run, checks channels without executing code

    • -profile selects the compute profiles in nextflow.config

    • -profile standard uses the Docker image to execute processes

    • -profile cluster uses the Singularity container and submits processes to a queue

    • -profile all_of_us uses the Docker image on All of Us Workbench

  • More info: Nextflow documentation

Detailed Pipeline Steps

Part I: Setup

  1. Start your own tools directory and go there. You may do this in your project analysis directory, but it often makes sense to clone into a general tools location
# Make a directory to clone the pipeline into
TOOLS_DIR="/path/to/tools/directory"
mkdir $TOOLS_DIR
cd $TOOLS_DIR
  1. Download the source code by cloning from git
git clone None
cd $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta
  1. Build the singularity image
    • you may call the image whatever you like, and store it wherever you like. Just make sure you specify the name in nextflow.conf
    • this does NOT have to be done for every saige-based analysis, but it is good practice to re-build every so often as we update regularly.
cd $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta
singularity build exwas_meta.sif docker://pennbiobank/exwas-meta:latest

Part II: Configure your run

  1. Make a separate analysis/run/working directory.
    • The quickest way to get started, is to run the analysis in the folder the pipeline is run. However, subsequent analyses will over-write results from previous analyses.
    • ❗This step is optional, but We Highly recommend making a tools directory separate from your run directory. We recommend storing the nextflow.conf in here as it shouldn't change between runs.
WDIR="/path/to/analysis/run1"
mkdir -p $WDIR
cd $WDIR
  1. Fill out the nextflow.config file for your system.

    • See Nextflow configuration documentation for information on how to configure this file. An example can be found on our GitHub: Nextflow Config.
    • ❗IMPORTANTLY, you must configure a user-defined profile for your run environments (local, docker, saige, cluster, etc.). If multiple profiles are specified, run with a specific profile using nextflow run -profile $MY_PROFILE.
    • For singularity, The profile's attribute process.container should be set to '/path/to/exwas_meta.sif' (replace /path/to with the location where you built the image above). See Nextflow Executor Information for more details.
    • ⚠️As this file remains mostly unchanged for your system, We recommend storing this file in the tools/pipeline directory and passing it to the pipeline with -c /path/to/nextflow.config.
  2. Create a pipeline-specific .config file specifying your run parameters and input files. See Below for workflow-specific parameters and what they mean.

    • Everything in here can be configured in nextflow.config, however we find it easier to separate the system-level profiles from the individual run parameters.

    • Examples can be found in our Pipeline-Specific Example Config Files.

    • you can compartamentalize your config file as much as you like by passing

    • There are 2 ways to specify the config file during a run:

      • with the -c option on the command line: nextflow run -c ExWAS_Meta_Analysis/exwas_meta_analysis.config
      • in the nextflow.config: at the top of the file add: includeConfig ExWAS_Meta_Analysis/exwas_meta_analysis.config

Part III: Run your analysis

❗We HIGHLY recommend doing a STUB run to test the analysis using the -stub flag. This is a dry run to make sure your environment, parameters, and input_files are specified and formatted correctly.❗We also HIGHLY recommend doing a TEST run with the included test data in $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta/test_datawe have several pre-configured analyses runs with input data and fully-specified config files.

# run an exwas stub
nextflow run $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta/exwas_meta_analysis.nf \
   -profile cluster \
   -c /path/to/nextflow.config \
   -c ExWAS_Meta_Analysis/exwas_meta_analysis.config \
   -stub

# run an exwas for real
nextflow run $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta/exwas_meta_analysis.nf \
   -profile cluster \
   -c /path/to/nextflow.config \
   -c ExWAS_Meta_Analysis/exwas_meta_analysis.config

# resume an exwas run if it was interrupted or ran into an error
nextflow run $TOOLS_DIR/pmbb-nf-toolkit-exwas-meta/exwas_meta_analysis.nf \
   -profile cluster \
   -c /path/to/nextflow.config \
   -c ExWAS_Meta_Analysis/exwas_meta_analysis.config \
   -resume

Pipeline Parameters

=======

  • Docker Command: docker pull pennbiobank/exwas-meta:latest

  • Pull from Google Container Registry: docker pull gcr.io/verma-pmbb-codeworks-psom-bf87/exwas-meta:latest

  • Run Command: nextflow run /path/to/toolkit/module/exwas_meta_analysis.nf

  • Common nextflow run flags:

    • -resume flag picks up workflow where it left off

    • -stub performs a dry run, checks channels without executing code

    • -profile selects the compute profiles in nextflow.config

    • -profile standard uses the Docker image to execute processes

    • -profile cluster uses the Singularity container and submits processes to a queue

    • -profile all_of_us uses the Docker image on All of Us Workbench

  • More info: Nextflow documentation

Input Files for ExWAS_Meta-Analysis

  • ExWAS Singles Summary Statistics

    • Input summary statistics of the singles tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header:

    #CHROM  BP      A1      A2      BETA    OR      SE      P       A1_FREQ N       N_CASES N_CONTROLS
    21      46482292        G       T       0.06082811129594274     1.0627162295714525      0.061462174537242044    0.48893376459755766     0.10478298166200951     49575   12384      37191
    21      38975062        T       C       0.17065668690623242     1.1860834811261391      0.705489644020694       0.7748787400623763      0.026454221287319335    49575   12384      37191
    21      44118844        A       C       0.1934617601789459      1.2134429838217287      0.09367821095381963     0.09458445160078595     0.2519137577341014      49575   12384      37191
    21      24024049        G       A       0.2981114423114272      1.3473119270688356      0.08850429395727123     0.0027432656092397415   0.07047045978001559     49575   12384      37191
    
    
  • ExWAS Regions Summary Statistics

    • Input summary statistics of the regions tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header:

    BETA    OR      SE      P       N       N_CASES N_CONTROLS      REGION  MAX_MAF ANNOT
    0.10749207338367353     1.113482034566681       0.10861255486849287     0.48893376459755766     16082   5316    10766   ENSG00000160256 0.01    pLOF
    0.2752070632715491      1.3168033082414594      1.1376977756875413      0.7748787400623763      16082   5316    10766   ENSG00000160256 0.01    damaging_missense
    0.09866625123248854     1.103697880275314       0.04777625246679377     0.09458445160078595     16082   5316    10766   ENSG00000160256 0.01    other_missense
    0.1869359835432516      1.2055501075333186      0.05549816240001925     0.0027432656092397415   16082   5316    10766   ENSG00000160256 0.01    synonymous
    
    
  • Gene Location File

    • CSV file of

    • Type: Data Table

    • Format: tsv

    • File Header:

735cd94435c42746838e37ba6a951cf6801a3667

Input Files for ExWAS_Meta-Analysis

```
gene_id chromosome  seq_region_start    seq_region_end  gene_symbol
GENE1   1   1   90  GS1
GENE2   2   91  100 GS2
```

Output Files for ExWAS_Meta-Analysis

  • ExWAS Regions Summary Statistics

<<<<<<< HEAD * Input summary statistics of the regions tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

* Type: Summary Statistics

* Format: tsv.gz

* File Header:


```
BETA    OR      SE      P       N       N_CASES N_CONTROLS      REGION  MAX_MAF ANNOT
0.10749207338367353     1.113482034566681       0.10861255486849287     0.48893376459755766     16082   5316    10766   ENSG00000160256 0.01    pLOF
0.2752070632715491      1.3168033082414594      1.1376977756875413      0.7748787400623763      16082   5316    10766   ENSG00000160256 0.01    damaging_missense
0.09866625123248854     1.103697880275314       0.04777625246679377     0.09458445160078595     16082   5316    10766   ENSG00000160256 0.01    other_missense
0.1869359835432516      1.2055501075333186      0.05549816240001925     0.0027432656092397415   16082   5316    10766   ENSG00000160256 0.01    synonymous

```
  • ExWAS Singles Summary Statistics

    • Input summary statistics of the singles tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header: =======

  • Meta-Analysis Sample Sizes

    • A table containing the maximum sample size of each of the meta-analyses based on the input cohorts and phenotypes. The actual numbers for each test may vary if there is missingness for certain variants, but this captures the largest sample size.

    • Type: Summary Table

    • Format: csv

    • File Header:

    ANALYSIS,PHENO,N_Samples
    AFR_EUR,AAA,31265
    AFR_EUR,AAA,31265
    AFR_EUR,BMI_median,38134
    AFR_EUR,BMI_median,38134
    
    
  • Singles Meta-Analysis Top Hits Table

    • A FILTERED top hits csv summary file of results including cohort, phenotype, gene, group annotation, p-values, and other counts. One single summary file will be aggregated from all the “top hits” in each “Singles (Variant) Summary Statistics” file.

    • Type: Summary Table

    • Format: csv

735cd94435c42746838e37ba6a951cf6801a3667

* File Header:

```
#CHROM  BP      A1      A2      BETA    OR      SE      P       A1_FREQ N       N_CASES N_CONTROLS
21      46482292        G       T       0.06082811129594274     1.0627162295714525      0.061462174537242044    0.48893376459755766     0.10478298166200951     49575   12384      37191
21      38975062        T       C       0.17065668690623242     1.1860834811261391      0.705489644020694       0.7748787400623763      0.026454221287319335    49575   12384      37191
21      44118844        A       C       0.1934617601789459      1.2134429838217287      0.09367821095381963     0.09458445160078595     0.2519137577341014      49575   12384      37191
21      24024049        G       A       0.2981114423114272      1.3473119270688356      0.08850429395727123     0.0027432656092397415   0.07047045978001559     49575   12384      37191

```

<<<<<<< HEAD

  • Gene Location File

    • CSV file of

    • Type: Data Table

    • Format: tsv

    • File Header:

    gene_id chromosome  seq_region_start    seq_region_end  gene_symbol
    GENE1   1   1   90  GS1
    GENE2   2   91  100 GS2
    

Output Files for ExWAS_Meta-Analysis

  • Singles Meta-Analysis Summary Statistics

    • A gzipped, unfiltered TSV (tab-separated) file of the results for the variant (singles) analysis. One file will be created for each unique Cohort, Phenotype, and analysis (regular, cauchy, rare, ultra rare) combination.

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header: =======

    chr,pos,effect_allele,other_allele,analysis,phenotype,p_single_stouffer_meta,p_single_stouffer_N_eff,p_single_stouffer_N_studies,beta_single_inv_var_meta,se_single_inv_var_meta,p_single_inv_var_meta,N_eff_inv_var_meta,N_studies_inv_var_meta,p_single_chi2_stat,p_single_heterogeneity
    1,69745,T,C,ALL_M,AAA,0.579096,15172.0,2,-1.06221,1.91491,0.5790965097927716,15172.0,2,,
    1,930282,A,G,ALL_M,LDL_median,0.4106934,12364.0,2,-8.493,10.3236,0.410691052172855,12364.0,2,,
    1,935839,T,C,ALL,T2D,0.619276,39632.0,2,-0.334515,0.673235,0.6192757781492377,39632.0,2,,
    1,935849,C,G,ALL,T2D,0.1225267999999999,39632.0,2,-0.515683,0.333937,0.1225272089986704,39632.0,2,,
    
  • Singles Meta-Analysis Summary Statistics

    • A gzipped, unfiltered TSV (tab-separated) file of the results for the variant (singles) analysis. One file will be created for each unique Cohort, Phenotype, and analysis (regular, cauchy, rare, ultra rare) combination.

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header:

    phenotype|chromosome|base_pair_location|variant_id        |other_allele|effect_allele|effect_allele_count|effect_allele_frequency|missing_rate|beta      |standard_error|t_statistic|variance|p_value   |p_value_na|is_spa_test|allele_freq_case|allele_freq
    T2Diab   |21        |41801254          |21_41801254_TCTG_T|TCTG        |T            |277                |0.0046611              |0.0         |-0.099231 |0.167775      |-3.52526   |35.5258 |0.5542179 |0.5542179 |False      |0.00426841      |0.00474126
    T2Diab   |21        |41801360          |21_41801360_C_T   |C           |T            |41                 |0.00068991             |0.0         |-0.864441 |0.633121      |-3.98228   |5.38237 |0.08606924|0.08606924|False      |0.000297796     |0.000769948
    T2Diab   |21        |41801603          |21_41801603_C_T   |C           |T            |24                 |0.00040385             |0.0         |0.322923  |0.570593      |0.991852   |3.07148 |0.5714322 |0.5714322 |False      |0.000496327     |0.000384974
    T2Diab   |21        |41801645          |21_41801645_G_A   |G           |A            |58                 |0.000975971            |0.0         |0.0167811 |0.35132       |0.135962   |8.10206 |0.9619027 |0.9619027 |False      |0.00109192      |0.000952304
    
      * Parallel By: Cohort, Phenotype
    
  • Singles Meta-Analysis QQ Plots

    • A QQ Plot of the Null Model vs Log10P results of the analysis for variants. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold.

    • Type: QQ Plot

    • Format: png

735cd94435c42746838e37ba6a951cf6801a3667

    * Parallel By: Cohort, Phenotype

<<<<<<< HEAD phenotype|chromosome|base_pair_location|variant_id |other_allele|effect_allele|effect_allele_count|effect_allele_frequency|missing_rate|beta |standard_error|t_statistic|variance|p_value |p_value_na|is_spa_test|allele_freq_case|allele_freq T2Diab |21 |41801254 |21_41801254_TCTG_T|TCTG |T |277 |0.0046611 |0.0 |-0.099231 |0.167775 |-3.52526 |35.5258 |0.5542179 |0.5542179 |False |0.00426841 |0.00474126 T2Diab |21 |41801360 |21_41801360_C_T |C |T |41 |0.00068991 |0.0 |-0.864441 |0.633121 |-3.98228 |5.38237 |0.08606924|0.08606924|False |0.000297796 |0.000769948 T2Diab |21 |41801603 |21_41801603_C_T |C |T |24 |0.00040385 |0.0 |0.322923 |0.570593 |0.991852 |3.07148 |0.5714322 |0.5714322 |False |0.000496327 |0.000384974 T2Diab |21 |41801645 |21_41801645_G_A |G |A |58 |0.000975971 |0.0 |0.0167811 |0.35132 |0.135962 |8.10206 |0.9619027 |0.9619027 |False |0.00109192 |0.000952304

    * Parallel By: Cohort, Phenotype
  • Regions Meta-Analysis Top Hits Table

    • A FILTERED top hits csv summary file of results including cohort, phenotype, gene, group annotation, p-values, and other counts. One single summary file will be aggregated from all the “top hits” in each “Regions Summary Statistics” file.

    • Type: Summary Table

    • Format: csv

    • File Header: =======

  • Singles Meta-Analysis Manhattan Plots

    • A dot plot (manhattan plot) of significant variants associated with a phenotype. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold.

    • Type: Manhattan Plot

    • Format: png

      • Parallel By: Cohort, Phenotype
  • Regions Meta-Analysis Top Hits Table

    • A FILTERED top hits csv summary file of results including cohort, phenotype, gene, group annotation, p-values, and other counts. One single summary file will be aggregated from all the “top hits” in each “Regions Summary Statistics” file.

735cd94435c42746838e37ba6a951cf6801a3667

* Type: Summary Table

<<<<<<< HEAD ``` region,annot_group,max_maf,chr,pos_start,pos_stop,gene_symbol,analysis,phenotype,p_burden_stouffer_meta,p_burden_stouffer_N_eff,p_burden_stouffer_N_studies,beta_burden_inv_var_meta,se_burden_inv_var_meta,p_burden_inv_var_meta,N_eff_inv_var_meta,N_studies_inv_var_meta,p_burden_chi2_stat,p_burden_heterogeneity ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,Leave_EUR_Out,T2D,0.606368639793734,11702.0,3,0.0336482650736499,0.0653029773095249,0.6063686397937349,11702.0,3,, ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,AFR_EUR,LDL_median,0.4348806366999341,24879.0,2,-0.2393382100454541,0.3751452681678022,0.5234814383226609,24879.0,2,38.57666608752835, ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,ALL_F,LDL_median,0.0654224944953294,12515.0,2,-0.9459153671090688,0.5391298446427648,0.0793410420455428,12515.0,2,25.260017289413383, ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,ALL_M,AAA,0.4794196005233119,15172.0,2,-0.0451279677642891,0.0638088900735573,0.4794196005233122,15172.0,2,,

```
  • Singles Meta-Analysis Manhattan Plots =======
    • Format: csv

    • File Header:

735cd94435c42746838e37ba6a951cf6801a3667

* A dot plot (manhattan plot) of significant variants associated with a phenotype. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold. 

<<<<<<< HEAD * Type: Manhattan Plot

* Format: png

    * Parallel By: Cohort, Phenotype
  • Singles Meta-Analysis QQ Plots

    • A QQ Plot of the Null Model vs Log10P results of the analysis for variants. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold.

    • Type: QQ Plot

    • Format: png

      • Parallel By: Cohort, Phenotype =======
    region,annot_group,max_maf,chr,pos_start,pos_stop,gene_symbol,analysis,phenotype,p_burden_stouffer_meta,p_burden_stouffer_N_eff,p_burden_stouffer_N_studies,beta_burden_inv_var_meta,se_burden_inv_var_meta,p_burden_inv_var_meta,N_eff_inv_var_meta,N_studies_inv_var_meta,p_burden_chi2_stat,p_burden_heterogeneity
    ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,Leave_EUR_Out,T2D,0.606368639793734,11702.0,3,0.0336482650736499,0.0653029773095249,0.6063686397937349,11702.0,3,,
    ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,AFR_EUR,LDL_median,0.4348806366999341,24879.0,2,-0.2393382100454541,0.3751452681678022,0.5234814383226609,24879.0,2,38.57666608752835,
    ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,ALL_F,LDL_median,0.0654224944953294,12515.0,2,-0.9459153671090688,0.5391298446427648,0.0793410420455428,12515.0,2,25.260017289413383,
    ENSG00000000419,damaging_missense,0.0001,20,50934867,50959140,DPM1,ALL_M,AAA,0.4794196005233119,15172.0,2,-0.0451279677642891,0.0638088900735573,0.4794196005233122,15172.0,2,,
    
    
  • Regions Meta-Analysis Summary Statistics

    • A gzipped, unfiltered TSV (tab-separated) file of the results for the gene (regions) analysis if run. One file will be created for each unique Cohort, Phenotype, and analysis (regular, cauchy, rare, ultra rare) combination.

    • Type: Summary Statistics

735cd94435c42746838e37ba6a951cf6801a3667

* Format: tsv.gz

* File Header:


```
phenotype|gene           |annot            |max_maf|p_value           |p_value_burden    |p_value_skat      |beta_burden        |se_burden         |mac   |mac_case|mac_control|rare_var_count|ultrarare_var_count
T2Diab   |ENSG00000141956|pLoF             |0.0001 |0.0479451461682565|0.0479451461682565|0.0479451461682565|0.0588652858997042 |0.0297621953331829|12.0  |5.0     |7.0        |0.0           |9.0
T2Diab   |ENSG00000141956|pLoF             |0.001  |0.0479451461682565|0.0479451461682565|0.0479451461682565|0.0588652858997042 |0.0297621953331829|12.0  |5.0     |7.0        |0.0           |9.0
T2Diab   |ENSG00000141956|pLoF             |0.01   |0.0479451461682565|0.0479451461682565|0.0479451461682565|0.0588652858997042 |0.0297621953331829|12.0  |5.0     |7.0        |0.0           |9.0
T2Diab   |ENSG00000141956|damaging_missense|0.0001 |0.464219450219203 |0.464219450219203 |0.464219450219203 |-0.0110759683619445|0.0151328276810456|52.0  |7.0     |45.0       |0.0           |41.0
```

<<<<<<< HEAD * Parallel By: Cohort, Phenotype, Annot Group, MAF

    * Parallel By: Cohort, Phenotype

735cd94435c42746838e37ba6a951cf6801a3667

  • Regions Meta-Analysis QQ Plots

    • A QQ Plot of the Null Model vs Log10P results of the analysis for gene regions. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold.

    • Type: QQ Plot

    • Format: png

      • Parallel By: Cohort, Phenotype, Annot Group, MAF <<<<<<< HEAD
  • Meta-Analysis Sample Sizes

    • A table containing the maximum sample size of each of the meta-analyses based on the input cohorts and phenotypes. The actual numbers for each test may vary if there is missingness for certain variants, but this captures the largest sample size.

    • Type: Summary Table

    • Format: csv

    • File Header:

    ANALYSIS,PHENO,N_Samples
    AFR_EUR,AAA,31265
    AFR_EUR,AAA,31265
    AFR_EUR,BMI_median,38134
    AFR_EUR,BMI_median,38134
    
    
  • Regions Meta-Analysis Summary Statistics

    • A gzipped, unfiltered TSV (tab-separated) file of the results for the gene (regions) analysis if run. One file will be created for each unique Cohort, Phenotype, and analysis (regular, cauchy, rare, ultra rare) combination.

    • Type: Summary Statistics

    • Format: tsv.gz

    • File Header: =======

  • Regions Meta-Analysis Manhattan Plots

    • A dot plot (manhattan plot) of significant gene regions associated with a phenotype. One plot will be created for each unique combination of phenotype, cohort, annotation group (pLof, etc.), and MAF threshold.

    • Type: Manhattan Plot

    • Format: png

735cd94435c42746838e37ba6a951cf6801a3667

    * Parallel By: Cohort, Phenotype, Annot Group, MAF

Parameters for ExWAS_Meta-Analysis

Post-Processing

<<<<<<< HEAD * Parallel By: Cohort, Phenotype

  • Singles Meta-Analysis Top Hits Table

    • A FILTERED top hits csv summary file of results including cohort, phenotype, gene, group annotation, p-values, and other counts. One single summary file will be aggregated from all the “top hits” in each “Singles (Variant) Summary Statistics” file. =======
  • region_plot_pcol (Type: String)

735cd94435c42746838e37ba6a951cf6801a3667

* One of three values: p_burden, p_skat, or p_skato. While all possible p-values will be utilized for meta-analyses, this flag chooses which will be plotted. Can be left null (defaults to p_skato)
  • gene_location_file (Type: File Path)

<<<<<<< HEAD * File Header:

* This file is used for getting gene-based coordinates for plotting .

735cd94435c42746838e37ba6a951cf6801a3667

* Corresponding Input File: Gene Location File

<<<<<<< HEAD chr,pos,effect_allele,other_allele,analysis,phenotype,p_single_stouffer_meta,p_single_stouffer_N_eff,p_single_stouffer_N_studies,beta_single_inv_var_meta,se_single_inv_var_meta,p_single_inv_var_meta,N_eff_inv_var_meta,N_studies_inv_var_meta,p_single_chi2_stat,p_single_heterogeneity 1,69745,T,C,ALL_M,AAA,0.579096,15172.0,2,-1.06221,1.91491,0.5790965097927716,15172.0,2,, 1,930282,A,G,ALL_M,LDL_median,0.4106934,12364.0,2,-8.493,10.3236,0.410691052172855,12364.0,2,, 1,935839,T,C,ALL,T2D,0.619276,39632.0,2,-0.334515,0.673235,0.6192757781492377,39632.0,2,, 1,935849,C,G,ALL,T2D,0.1225267999999999,39632.0,2,-0.515683,0.333937,0.1225272089986704,39632.0,2,,

Other Parameters for ExWAS_Meta-Analysis

Post-Processing

======= * CSV file of

    * Type: Data Table

735cd94435c42746838e37ba6a951cf6801a3667

    * Format: tsv

<<<<<<< HEAD

  • gene_location_file (Type: File Path)

    • This file is used for getting gene-based coordinates for plotting .

    • Corresponding Input File: Gene Location File

      • CSV file of

      • Type: Data Table

      • Format: tsv

      • File Header:

      gene_id chromosome  seq_region_start    seq_region_end  gene_symbol
      GENE1   1   1   90  GS1
      GENE2   2   91  100 GS2
      
  • region_plot_pcol (Type: String)

    • One of three values: p_burden, p_skat, or p_skato. While all possible p-values will be utilized for meta-analyses, this flag chooses which will be plotted. Can be left null (defaults to p_skato)

Pre-Processing

  • burden_cols (Type: Map (Dictionary))

    • A map with three keys: beta, se, and p_value, where the values are the corresponding gene burden test statistic columns in the regions files.
  • singles_info_cols (Type: Map (Dictionary))

    • A map with three keys: beta, se, and p_value, where the values are the corresponding test statistic columns in the singles files.
  • regions_info_cols (Type: Map (Dictionary))

    • A map with six keys: region, annot_group, max_maf, n, n_case, n_control, where the values are the corresponding test information columns in the regions files.
  • singles_sumstats_suffix (Type: String)

    • Suffix for singles files from the ExWAS summary stats

    • Corresponding Input File: ExWAS Singles Summary Statistics

      • Input summary statistics of the singles tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

      • Type: Summary Statistics

      • Format: tsv.gz

      • File Header:

      #CHROM  BP      A1      A2      BETA    OR      SE      P       A1_FREQ N       N_CASES N_CONTROLS
      21      46482292        G       T       0.06082811129594274     1.0627162295714525      0.061462174537242044    0.48893376459755766     0.10478298166200951     49575   12384      37191
      21      38975062        T       C       0.17065668690623242     1.1860834811261391      0.705489644020694       0.7748787400623763      0.026454221287319335    49575   12384      37191
      21      44118844        A       C       0.1934617601789459      1.2134429838217287      0.09367821095381963     0.09458445160078595     0.2519137577341014      49575   12384      37191
      21      24024049        G       A       0.2981114423114272      1.3473119270688356      0.08850429395727123     0.0027432656092397415   0.07047045978001559     49575   12384      37191
      
      
  • regions_sumstats_suffix (Type: String)

    • Suffix for regions files from the ExWAS summary stats

    • Corresponding Input File: ExWAS Regions Summary Statistics

      • Input summary statistics of the regions tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

      • Type: Summary Statistics

      • Format: tsv.gz

      • File Header:

      BETA    OR      SE      P       N       N_CASES N_CONTROLS      REGION  MAX_MAF ANNOT
      0.10749207338367353     1.113482034566681       0.10861255486849287     0.48893376459755766     16082   5316    10766   ENSG00000160256 0.01    pLOF
      0.2752070632715491      1.3168033082414594      1.1376977756875413      0.7748787400623763      16082   5316    10766   ENSG00000160256 0.01    damaging_missense
      0.09866625123248854     1.103697880275314       0.04777625246679377     0.09458445160078595     16082   5316    10766   ENSG00000160256 0.01    other_missense
      0.1869359835432516      1.2055501075333186      0.05549816240001925     0.0027432656092397415   16082   5316    10766   ENSG00000160256 0.01    synonymous
      
      
  • singles_effect_cols (Type: Map (Dictionary))

    • A map with seven keys: chr, pos, effect_allele, other_allele, n, n_case, and n_control, where the values are the corresponding test information columns in the singles files.
  • skato_p_col (Type: String)

    • The name of the SKAT-O test p-value from the ExWAS summary stats. It can be set to null if you don’t want to meta-analyze these p-values.
  • skat_p_col (Type: String)

    • The name of the SKAT test p-value from the ExWAS summary stats. It can be set to null if you don’t want to meta-analyze these p-values.

Workflow

======= * File Header:

    ```
    gene_id chromosome  seq_region_start    seq_region_end  gene_symbol
    GENE1   1   1   90  GS1
    GENE2   2   91  100 GS2
    ```

Pre-Processing

  • singles_effect_cols (Type: Map (Dictionary))

    • A map with seven keys: chr, pos, effect_allele, other_allele, n, n_case, and n_control, where the values are the corresponding test information columns in the singles files.
  • skato_p_col (Type: String)

    • The name of the SKAT-O test p-value from the ExWAS summary stats. It can be set to null if you don’t want to meta-analyze these p-values.
  • skat_p_col (Type: String)

    • The name of the SKAT test p-value from the ExWAS summary stats. It can be set to null if you don’t want to meta-analyze these p-values.
  • burden_cols (Type: Map (Dictionary))

    • A map with three keys: beta, se, and p_value, where the values are the corresponding gene burden test statistic columns in the regions files.
  • singles_info_cols (Type: Map (Dictionary))

    • A map with three keys: beta, se, and p_value, where the values are the corresponding test statistic columns in the singles files.
  • regions_info_cols (Type: Map (Dictionary))

    • A map with six keys: region, annot_group, max_maf, n, n_case, n_control, where the values are the corresponding test information columns in the regions files.
  • singles_sumstats_suffix (Type: String)

    • Suffix for singles files from the ExWAS summary stats

    • Corresponding Input File: ExWAS Singles Summary Statistics

      • Input summary statistics of the singles tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

      • Type: Summary Statistics

      • Format: tsv.gz

      • File Header:

      #CHROM  BP      A1      A2      BETA    OR      SE      P       A1_FREQ N       N_CASES N_CONTROLS
      21      46482292        G       T       0.06082811129594274     1.0627162295714525      0.061462174537242044    0.48893376459755766     0.10478298166200951     49575   12384      37191
      21      38975062        T       C       0.17065668690623242     1.1860834811261391      0.705489644020694       0.7748787400623763      0.026454221287319335    49575   12384      37191
      21      44118844        A       C       0.1934617601789459      1.2134429838217287      0.09367821095381963     0.09458445160078595     0.2519137577341014      49575   12384      37191
      21      24024049        G       A       0.2981114423114272      1.3473119270688356      0.08850429395727123     0.0027432656092397415   0.07047045978001559     49575   12384      37191
      
      
  • regions_sumstats_suffix (Type: String)

    • Suffix for regions files from the ExWAS summary stats

    • Corresponding Input File: ExWAS Regions Summary Statistics

      • Input summary statistics of the regions tests. They are expected to be organized in the directory from which you’re running the pipeline like so: “COHORT/Sumstats/PHENO.SUFFIX”. This matches output of your other workflows, but if you’re starting here, you may use the python script “scripts/set_up_cohort_directory_structure.py” to create symlinks with the correct structure

      • Type: Summary Statistics

      • Format: tsv.gz

      • File Header:

      BETA    OR      SE      P       N       N_CASES N_CONTROLS      REGION  MAX_MAF ANNOT
      0.10749207338367353     1.113482034566681       0.10861255486849287     0.48893376459755766     16082   5316    10766   ENSG00000160256 0.01    pLOF
      0.2752070632715491      1.3168033082414594      1.1376977756875413      0.7748787400623763      16082   5316    10766   ENSG00000160256 0.01    damaging_missense
      0.09866625123248854     1.103697880275314       0.04777625246679377     0.09458445160078595     16082   5316    10766   ENSG00000160256 0.01    other_missense
      0.1869359835432516      1.2055501075333186      0.05549816240001925     0.0027432656092397415   16082   5316    10766   ENSG00000160256 0.01    synonymous
      
      

Workflow

  • my_python (Type: File Path)

    • Path to the python executable to be used for python scripts - often it comes from the docker/singularity container (/opt/conda/bin/python)
  • analyses (Type: Map (Dictionary))

    • Map of lists where keys are meta-analysis group nicknames and lists are groups of cohorts to include in that meta-analysis. This allows for multiple combinations of meta-analyses, for example all cohorts of one sex/ancestry, leave-one-biobank-out.

735cd94435c42746838e37ba6a951cf6801a3667

  • bin_pheno_list (Type: List)

<<<<<<< HEAD

  • my_python (Type: File Path)

    • Path to the python executable to be used for python scripts - often it comes from the docker/singularity container (/opt/conda/bin/python)
  • bin_pheno_list (Type: List)

    • Binary phenotype list
  • quant_pheno_list (Type: List)

    • Quantitative phenotype list

Configuration and Advanced Workflow Files

======= * Binary phenotype list

Configuration and Advanced Workflow Files

735cd94435c42746838e37ba6a951cf6801a3667

Example Config File Contents (From Path)

params {
    // Map the overall "analyses" (meta-analysis combinations) to cohort/study population lists
    analyses = [
        'AFR_EUR': ['PMBB_AFR_ALL', 'PMBB_EUR_ALL'],
        'ALL': ['PMBB_AFR_ALL', 'PMBB_EUR_ALL', 'PMBB_EAS_ALL', 'PMBB_AMR_ALL', 'PMBB_SAS_ALL'],
        'ALL_M': ['PMBB_AFR_M', 'PMBB_EUR_M', 'PMBB_EAS_M', 'PMBB_AMR_M', 'PMBB_SAS_M'],
        'ALL_F': ['PMBB_AFR_F', 'PMBB_EUR_F', 'PMBB_EAS_F', 'PMBB_AMR_F', 'PMBB_SAS_F'],
        'Leave_EUR_Out': ['PMBB_AFR_ALL', 'PMBB_EAS_ALL', 'PMBB_AMR_ALL', 'PMBB_SAS_ALL']
    ]

    // Executable for python
    my_python = '/opt/conda/bin/python'

    // Lists of phenotypes
    bin_pheno_list =  ['T2D', 'AAA']
    quant_pheno_list = ['LDL_median', 'BMI_median']

    // Pre- and Post-Processing Params (probably starts with .)
    regions_sumstats_suffix = '.exwas_regions.saige.gz'
    singles_sumstats_suffix = '.exwas_singles.saige.gz'

    // Top-Hits tables will be filtered to this p-value
    p_cutoff_summarize = 0.00001

    // When plotting regions, choose p-values from different tests:
    // Possible values are p_burden, p_skat, and p_skato
    region_plot_pcol = 'p_burden'

    // this is for getting ENSEMBL gene symbols and coordinates for summary stats and plotting
    // tab-separated, columns include: gene_id, chromosome, seq_region_start, seq_region_end, gene_symbol
    gene_location_file = '/path/to/data/homo_sapiens_111_b38.txt'

    // Meta-Analysis Test Info
    regions_info_cols = [
        'region': 'gene',
        'annot_group': 'annot',
        'max_maf': 'max_maf',
        'n': 'N',
        'n_case': 'N_case',
        'n_control': 'N_ctrl'
    ]

    // Single-Variant Test Info
    singles_info_cols = [
        'chr': 'chromosome',
        'pos': 'base_pair_location',
        'effect_allele': 'effect_allele',
        'other_allele': 'other_allele',
        'n': 'n',
        'n_case': 'n_case',
        'n_control': 'n_ctrl'
    ]

    // set any of these column parameters to null 
    // if you don't want to meta-analyze those effects
    burden_cols = ['beta': 'beta_burden', 'se': 'se_burden', 'p_value' : 'p_value_burden']
    skat_p_col = 'p_value_skat'
    skato_p_col = null
    singles_effect_cols = ['beta': 'beta', 'se': 'standard_error', 'p_value': 'p_value']
}

<<<<<<< HEAD

Current Dockerfile for Container/Image

FROM continuumio/miniconda3
WORKDIR /app

# biofilter version argument
ARG BIOFILTER_VERSION=2.4.3

RUN apt-get update \    
    # install packages needed to install biofilter and NEAT-plots
    && apt-get install -y --no-install-recommends libz-dev g++ gcc git wget tar unzip make \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    # install python packages needed for pipeline
    && conda install -y -n base -c conda-forge wget libtiff conda-build scipy pandas seaborn matplotlib numpy apsw sqlite \
    && conda clean --all --yes \
    # install NEAT-plots
    && git clone https://github.com/PMBB-Informatics-and-Genomics/NEAT-Plots.git \
    && mv NEAT-Plots/manhattan-plot/ /app/ \
    && conda develop /app/manhattan-plot/ \
    # install biofilter
    && wget https://github.com/RitchieLab/biofilter/releases/download/Biofilter-${BIOFILTER_VERSION}/biofilter-${BIOFILTER_VERSION}.tar.gz -O biofilter.tar.gz \
    && tar -zxvf biofilter.tar.gz --strip-components=1 -C /app \
    && /opt/conda/bin/python setup.py install \
    # make biofilter executable
    && chmod a+rx /app/biofilter.py \
    # remove biofilter tarball and NEAT-plots directory
    && rm -R biofilter.tar.gz NEAT-Plots

USER root

735cd94435c42746838e37ba6a951cf6801a3667

Current nextflow.config contents

includeConfig 'exwas_meta_analysis.config'

profiles {
    non_docker_dev {
        process.executor = awsbatch-or-lsf-or-slurm-etc
        process.queue = 'epistasis_normal'
        process.memory = '15GB'
    }

    standard {
        process.executor = awsbatch-or-lsf-or-slurm-etc
        process.container = 'guarelin/exwas_meta:latest'
        docker.enabled = true
    }

    cluster0 {
        process.executor = awsbatch-or-lsf-or-slurm-etc
        process.queue = 'epistasis_normal'
        process.memory = '15GB'
    	process.container = 'exwas_meta.sif'
        singularity.enabled = true
        singularity.runOptions = '-B /root/,/directory/,/names/'
    }

    all_of_us {
        process.executor = awsbatch-or-lsf-or-slurm-etc
        process.memory = '15GB'
        process.container = 'gcr.io/ritchie-aou-psom-9015/exwas_meta:latest'
        docker.enabled = true
    }
}

params {
    skip_postprocessing_errors = true
}

process {
    withLabel: safe_to_skip {
        errorStrategy=params.skip_postprocessing_errors ? 'ignore' : 'terminate'
    }
}

<<<<<<< HEAD

=======

# Detailed Pipeline Steps


from pathlib import Path

detailed_steps_file = Path("Markdowns/Pipeline_Detailed_Steps.md")

# Write the detailed steps content to a separate file
detailed_steps_file

# Detailed Steps for Runnning One of our Pipelines

Note: test data were obtained from the [SAIGE github repo](https://github.com/saigegit/SAIGE).

## Part I: Setup
1. Start your own tools directory and go there. You may do this in your project analysis directory, but it often makes sense to clone into a general `tools` location

```sh
# Make a directory to clone the pipeline into
TOOLS_DIR="/path/to/tools/directory"
mkdir $TOOLS_DIR
cd $TOOLS_DIR
  1. Download the source code by cloning from git
git clone https://github.com/PMBB-Informatics-and-Genomics/pmbb-nf-toolkit-saige-family.git
cd ${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/
  1. Build the saige.sif singularity image
  • you may call the image whatever you like, and store it wherever you like. Just make sure you specify the name in nextflow.conf
  • this does NOT have to be done for every saige-based analysis, but it is good practice to re-build every so often as we update regularly.
cd ${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/
singularity build saige.sif docker://pennbiobank/saige:latest

Part II: Configure your run

  1. Make a separate analysis/run/working directory.
    • The quickest way to get started, is to run the analysis in the folder the pipeline is run. However, subsequent analyses will over-write results from previous analyses.
    • ❗This step is optional, but We Highly recommend making a tools directory separate from your run directory. The only items that need to be in the run directory are the nextflow.conf file and the ${workflow}.conf file.
WDIR="/path/to/analysis/run1"
mkdir -p 
cd $WDIR
  1. Fill out the nextflow.config file for your system.

    • See Nextflow configuration documentation for information on how to configure this file. An example can be found on our GitHub: Nextflow Config.
    • ❗IMPORTANTLY, you must configure a user-defined profile for your run environments (local, docker, saige, cluster, etc.). If multiple profiles are specified, run with a specific profile using nextflow run -profile ${MY_PROFILE}.
    • For singularity, The profile's attribute process.container should be set to '/path/to/saige.sif' (replace /path/to with the location where you built the image above). See Nextflow Executor Information for more details.
    • ⚠️As this file remains mostly unchanged for your system, We recommend storing this file in the tools/pipeline directory and symlinking it to your run directory.
  2. Create a pipeline-specific .config file specifying your run parameters and input files. See Below for workflow-specific parameters and what they mean.

    • Everything in here can be configured in nextflow.config, however we find it easier to separate the system-level profiles from the individual run parameters.
    • Examples can be found in our Pipeline-Specific Example Config Files.
    • you can compartamentalize your config file as much as you like by passing
    • There are 2 ways to specify the config file during a run:
      • with the -c option on the command line: nextflow run -c /path/to/workflow.conf
      • in the nextflow.conf: at the top of the file add: includeConfig '/path/to/workflow.conf'

Part III: Run your analysis

  • ❗We HIGHLY recommend doing a STUB run to test the analysis using the -stub flag. This is a dry run to make sure your environment, parameters, and input_files are specified and formatted correctly.
  • ❗We HIGHLY recommend doing a test run with the included test data in ${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/test_data
  • in the test_data/ directory for each pipeline, we have several pre-configured analyses runs with input data and fully-specified config files.
# run an exwas stub
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf -stub
# run an exwas for real
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf
# resume an exwas run if it was interrupted or ran into an error
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf -resume

735cd94435c42746838e37ba6a951cf6801a3667

About

repository for exwas meta analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •