PRScsx is a tool used to calculate polygenic scores. It integrates GWAS summary statistics and LD panels across multiple ancestry groups to improve global polygenic prediction. This nextflow pipeline seemlessly takes GWAS summary statistics, makes the input files for PRScsx, runs PRScsx, concatenates the chromosome-separated outcomes, and produces a summary violin plot, all in a parallelized fashion.
-
Command:
git clone https://github.com/PMBB-Informatics-and-Genomics/geno_pheno_workbench.git -
Navigate to relevant workflow directory...
-
Singularity Command:
singularity build prscsx.sif docker://pennbiobank/prscsx:latest -
Docker Command:
docker pull pennbiobank/prscsx:latest -
Pull from Google Container Registry:
docker pull gcr.io/verma-pmbb-codeworks-psom-bf87/prscsx:latest -
Run Command:
nextflow run /path/to/toolkit/module/prscsx.nf -
Common
nextflow runflags:-
-resumeflag picks up workflow where it left off -
-stubperforms a dry run, checks channels without executing code -
-profileselects the compute profiles in nextflow.config -
-profile standarduses the Docker image to execute processes -
-profile clusteruses the Singularity container and submits processes to a queue -
-profile all_of_ususes the Docker image on All of Us Workbench
-
-
More info: Nextflow documentation
-
my_prscsx
-
Path to the
-
Type: Script
-
Format: py
-
-
GWAS Summary Statistics
-
Genome-wide association study summary statistic files.
-
Type: Summary Statistics
-
Format: txt
-
File Header:
chromosome base_pair_location variant_id effect_allele other_allele beta standard_error p_value effect_allele_frequency ci_lower odds_ratio ci_upper -
-
Multi-ancestry SNP file
-
Multi-ancestry SNP information file with ancestry-specific allele frequencies and allele flips. The SNPs match those in the LD panels. This can be downloaded from the PRS-CSx GitHub:
-
Type: LD Panels
-
Format: txt
-
File Header:
CHR SNP BP A1 A2 FRQ_AFR FRQ_AMR FRQ_EAS FRQ_EUR FRQ_SAS FLP_AFR FLP_AMR FLP_EAS FLP_EUR FLP_SAS -
-
LD Panels
-
Genetic ancestry specific linkage disequilibrium (LD) panels. These can be generated from 1000 genomes (1KG) data or UK Biobank (UKBB data). They must be for one of the five 1000 genomes super populations (AFR, AMR, EAS, EUR, SAS). These can be downloaded from the PRS-CSx GitHub:
-
Type: LD Panels
-
Format: hdf5
-
-
Violinplots
-
Type: Distribution Plot
-
Format: png
-
-
SNP Weight Files
-
Variant-level scores produced by PRScsx, for all chromosomes
-
Type: PGS SNP Weights
-
Format: txt
-
File Header:
CHR RSID POS A1 A2 PGS* Parallel By: Cohort, Phenotype, Ancestry -
-
my_prscsx(Type: File Path)- Path to the
-
seed(Type: String)- Non-negative integer which seeds the random number generator. This can be any number, I always pick 7.
-
write_posterior_samples(Type: Bool (Java: true or false))- If True and META_FLAG also True, write all posterior samples of meta-analyzed SNP effect sizes after thinning. Default is False.
-
meta_flag(Type: Bool (Java: true or false))- If True, return combined SNP effect sizes across populations using an inverse-variance-weighted meta-analysis of the population-specific posterior effect size estimates. Default is False.
-
mcmc_thinning_factor(Type: String)- Thinning factor of the Markov chain. Default is 5
-
mcmc_burnin(Type: String)- Number of burnin iterations. Default is 500 * the number of discovery populations
-
mcmc_iterations(Type: String)- Total number of MCMC iterations. Default is 1,000 * the number of discovery populations
-
param_phi(Type: String)- if NOT using, specify as 'auto'. if using: Global shrinkage parameter phi. If phi is not specified, it will be learnt from the data using a fully Bayesian approach. This usually works well for polygenic traits with very large GWAS sample sizes (hundreds of thousands of subjects). For GWAS with limited sample sizes (including most of the current disease GWAS), fixing phi to 1e-02 (for highly polygenic traits) or 1e-04 (for less polygenic traits), or doing a small-scale grid search (e.g., phi = 1e-6, 1e-4, 1e-2, 1) to find the optimal phi value in the validation dataset often improves perdictive performance. format should be 1e-02 (make sure to include the zero after the minus)
-
param_b(Type: String)- Parameter b in the gamma-gamma prior. Default is 0.5
-
param_a(Type: String)- Parameter a in the gamma-gamma prior. Default is 1
-
LD_directory(Type: File Path)- FULL FILE PATH to directory with LD panels and Multi-Ancestry SNP info file. you MUST use LD panels from the PRScsx github because they have RSIDs matching the multi-ancestry SNP file that is specific to PRScsx. You must include a slash (/) after the directory name. If you want to use a relative path, add "${launchDir}/" in front of the path
-
genome_build(Type: String)- genome build of summary statistics, can be “b37” (build 37/hg19) or “b38” (build 38/hg38)
-
sumstats_colnames(Type: Map (Dictionary))- dictionary that specifies the column names in your sumstats files. specifies chromosome column, position column, A1 column, A2 column, quantitative phenotype effect size column (beta or odds ratio), binary phenotype effect size column (beta or odds ratio), and significance column (pvalue or standard error). this assumes that all sumstats files have the same column names. do not change the keys (chr_colname, pos_colname, etc), only change the value accordingly.
-
LD_panel_type(Type: String)- LD panel types: 1000 genomes or UKBiobank. Label 1000 genomes as “1kg” and UKBB as “ukbb”
-
prscsx_group_cohort_map(Type: Map (Dictionary))- prscsx computes scores for multiple ancestry groups (defined by LD reference panels) at the same time. it is possible that cohort names can include ancestry labels (especially when stitching a GWAS workflow to PRScsx). this maps a "score group nickname" (key) to cohort names (values) included in descriptor table. because prscsx can only include one analysis from each ancestry group, DO NOT duplicate ancestry groups in analysis groups. thus, prscsx will compute scores simultaneously by score group and phenotype. keys should be score groups and values should be a list of cohorts in each score group
-
input_descriptor_table_colnames(Type: Map (Dictionary))- dictionary including names of essential columns in input descriptor table. specifies cohort (nickname or dataset), ancestry of LD reference panel (AFR, AMR, EAS, EUR, SAS), phenotype GWAS sample size, and summary statistic FULL FILE PATH AND FILENAME column names (If you want to use a relative path, add "${launchDir}/" in front of the path)
-
input_descriptor_table_filename (PRS-CSx)(Type: File Path)- file that specifies the cohort, ancestry, phenotype, GWAS sample size, and summary statistics filename. must have five columns in this order: cohort, ancestry, phenotype, GWAS sample size, summary statistics filename. MUST BE A CSV (comma delimited)
-
gwas_meta_workflow_stitching(Type: Bool (Java: true or false))- whether an individual is stitching another workflow to this one, specifically, stitching a GWAS or meta analysis to this workflow. DO NOT put true if stitching PRScsx to a workflow AFTER. Only put true if stitching to a workflow BEFORE PRScsx
-
my_python(Type: File Path)- Path to the python executable to be used for python scripts - often it comes from the docker/singularity container (/opt/conda/bin/python)
// set up params
// can be customized by user
params {
// python path
my_python = "/opt/conda/bin/python" // path to python in docker/singularity container
// PRScsx path
my_prscsx = "/app/PRScsx/PRScsx.py" // path to PRScsx.py script in docker/singularity container (If you use your own, please use v1.1.0 or later to match the options used in the script)
// whether an individual is stitching another workflow to this one, specifically, stitching a GWAS or meta analysis to this workflow
// true or false
// DO NOT put true if stitching PRScsx to a workflow AFTER. Only put true if stitching to a workflow BEFORE PRScsx
gwas_meta_workflow_stitching = false
// list of chromosomes
chromosome_list = (21..22)
/*
file that specficies the cohort, ancestry, phenotype, GWAS sample size, and summary statistics filename
must have five columns: cohort, ancestry, phenotype, GWAS sample size, summary statistics filename
cohort: a nickname, for example, "AFR_ALL", "AFR_M", "PMBB"
ld_ref_ancestry: LD reference panel to be used in PRScsx, specified by 1000 genomes (1KG) genetically inferred ancestry group (AFR, AMR, EAS, EUR, SAS)
sumstats_full_filepath: specify summary statistics filenames with FULL file path (If you want to use a relative path, add "${launchDir}/" in front of the path)
MUST BE A CSV (comma delimited)
descriptor filename:
*/
input_descriptor_table_filename='test_descriptor_file.csv'
/*
dictionary including names of essential columns in input descriptor table
specifies cohort, ancestry of LD reference panel, phenotype, sample size, and summary statistic FULL FILE PATH AND FILENAME column names
*/
input_descriptor_table_colnames = [cohort_colname: 'COHORT',
ld_ref_ancestry_colname: 'LD_PANEL_ANCESTRY',
phenotype_colname: 'PHENOTYPE',
sample_size_colname: 'SAMPLE_SIZE',
sumstats_full_filepath_colname: 'SUMSTATS_FILEPATH']
/*
prscsx computes scores for multiple ancestry groups (defined by LD reference panels) at the same time
it is possible that cohort names can include ancestry labels (especially when stitching a GWAS workflow to PRScsx)
this maps a "score group nickname" (key) to cohort names (values) included in descriptor table
because prscsx can only include one analysis from each ancestry group, DO NOT duplicate ancestry groups in analysis groups
thus, prscsx will compute scores simultaneously by score group and phenotype
keys should be score groups and values should be a list of cohorts in each score group
*/
prscsx_group_cohort_map = ['PRScsx_test_data': ['PRScsx_test_data1','PRScsx_test_data2','PRScsx_test_data3','PRScsx_test_data4']]
/*
dictionary that specifies the column names in your sumstats files
specifies chromosome column, position column, A1 column, A2 column, quantitative phenotype effect size column (beta or odds ratio), binary phenotype effect size column (beta or odds ratio), and significance column (pvalue or standard error)
this assumes that all sumstats files have the same column names
do not change the keys (chr_colname, pos_colname, etc), only change the value accordingly
*/
sumstats_colnames = [chr_colname: '#CHROM', // chromosome column name in sumstats file
pos_colname: 'BP', // base pair/position column name in sumstats file
a1_colname: 'A1', // A1 or alternate (ALT) allele column name in sumstats file
a2_colname: 'A2', // A2 or reference (REF) allele column name in sumstats file
quant_beta_or_colname: 'BETA', // beta or odds ratio column name for quantitative phenotypes in sumstats file (either works)
bin_beta_or_colname: 'OR', // beta or odds ratio column name for binary phenotypes in sumstats file (either works)
pval_se_colname: 'P'] // p-value or standard error column name in sumstats file (either works)
/*
genome build of summary statistics
can be build 37 (b37/hg19) or build 38 (b38/hg38)
specify build 37 as "b37" and build 38 as "b38"
*/
genome_build='b38'
/*
FULL FILE PATH to directory with LD panels and Multi-Ancestry SNP info file
you MUST use LD panels from the PRScsx github because they have RSIDs matching a multi-ancestry SNP file that is specific to PRScsx
(https://github.com/getian107/PRScsx)
you must include a slash (/) after the directory name
If you want to use a relative path, add "${launchDir}/" in front of the path
*/
ld_directory = 'ID_Reference/'
// LD panel types: 1000 genomes or UKBiobank
// 1000 genomes label: 1kg
/// UKBB label: ukbb
ld_panel_type = '1kg'
// true/false statement to update output genome build to b38 (PRScsx outputs it to b37)
// must be all lower case with no quotes (true or false)
b38_output_genome_build = true
// other prscsx parameters
param_a = '1' // Parameter a in the gamma-gamma prior. Default is 1
param_b = '0.5' // Parameter b in the gamma-gamma prior. Default is 0.5
param_phi = 'auto' /* if NOT using, specify as 'auto'.
if using: Global shrinkage parameter phi.
If phi is not specified, it will be learnt from the data using a fully Bayesian approach.
This usually works well for polygenic traits with very large GWAS sample sizes (hundreds of thousands of subjects).
For GWAS with limited sample sizes (including most of the current disease GWAS),
fixing phi to 1e-02 (for highly polygenic traits)
or 1e-04 (for less polygenic traits),
or doing a small-scale grid search (e.g., phi = 1e-6, 1e-4, 1e-2, 1)
to find the optimal phi value in the validation dataset often improves perdictive performance.
format should be 1e-02 (make sure to include the zero after the minus) */
mcmc_iterations = '2000' // Total number of MCMC iterations. Default is 1,000 * the number of discovery populations
mcmc_burnin = '1000' // Number of burnin iterations. Default is 500 * the number of discovery populations
mcmc_thinning_factor = '5' // Thinning factor of the Markov chain. Default is 5
meta_flag = 'False' /* If True,
return combined SNP effect sizes across populations
using an inverse-variance-weighted meta-analysis of the population-specific posterior effect size estimates.
Default is False. */
write_posterior_samples = 'False' // If True and META_FLAG also True, write all posterior samples of meta-analyzed SNP effect sizes after thinning. Default is False.
seed = '7' // Non-negative integer which seeds the random number generator. This can be any number, I always pick 7.
}
FROM continuumio/miniconda3
WORKDIR /app
# copy local SNP maps (match SNPs in LD panels) into docker image
COPY snpinfo_mult_1kg_hm3_map_b37_b38 /app/snpinfo_mult_1kg_hm3_map_b37_b38
COPY snpinfo_mult_ukbb_hm3_map_b37_b38 /app/snpinfo_mult_ukbb_hm3_map_b37_b38
# give SNP maps read permissions
RUN chmod +r /app/snpinfo_mult_1kg_hm3_map_b37_b38 \
&& chmod +r /app/snpinfo_mult_ukbb_hm3_map_b37_b38 \
# install tools needed to install PRScsx
&& apt-get update \
&& apt-get install -y --no-install-recommends git \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
# download PRScsx
&& git clone --depth 1 -b master https://github.com/getian107/PRScsx.git \
# install python packages needed for pipeline
&& conda install -y -n base -c conda-forge -c bioconda scipy=1.11.4 h5py pandas seaborn matplotlib pathlib python=3.11 \
&& conda clean --all --yes
USER root
includeConfig "${launchDir}/prscsx.config"
// set up profile
// change these parameters as needed
profiles {
non_docker_dev {
process.executor = awsbatch-or-lsf-or-slurm-etc
}
standard {
process.executor = awsbatch-or-lsf-or-slurm-etc
process.container = 'pennbiobank/prscsx:latest'
docker.enabled = true
}
cluster {
process.executor = awsbatch-or-lsf-or-slurm-etc
process.queue = 'epistasis_normal'
process.memory = '30GB'
process.container = 'prscsx.sif'
singularity.enabled = true
singularity.runOptions = '-B /root/,/directory/,/names/'
}
all_of_us {
// CHANGE EVERY TIME! These are specific for each user, see docs
google.lifeSciences.serviceAccountEmail = [email protected]
workDir = /path/to/workdir/ // can be gs://
google.project = terra project id
// These should not be changed unless you are an advanced user
process.container = 'gcr.io/verma-pmbb-codeworks-psom-bf87/prscsx:latest' // GCR SAIGE docker container (static)
// these are AoU, GCR parameters that should NOT be changed
process.memory = '15GB' // minimum memory per process (static)
process.executor = awsbatch-or-lsf-or-slurm-etc
google.zone = "us-central1-a" // AoU uses central time zone (static)
google.location = "us-central1"
google.lifeSciences.debug = true
google.lifeSciences.network = "network"
google.lifeSciences.subnetwork = "subnetwork"
google.lifeSciences.usePrivateAddress = false
google.lifeSciences.copyImage = "gcr.io/google.com/cloudsdktool/cloud-sdk:alpine"
google.enableRequesterPaysBuckets = true
// google.lifeSciences.bootDiskSize = "20.GB" // probably don't need this
}
}
from pathlib import Path
detailed_steps_file = Path("Markdowns/Pipeline_Detailed_Steps.md")
detailed_steps_file
Note: test data were obtained from the SAIGE github repo.
- Start your own tools directory and go there. You may do this in your project analysis directory, but it often makes sense to clone into a general
toolslocation
# Make a directory to clone the pipeline into
TOOLS_DIR="/path/to/tools/directory"
mkdir $TOOLS_DIR
cd $TOOLS_DIR- Download the source code by cloning from git
git clone https://github.com/PMBB-Informatics-and-Genomics/pmbb-nf-toolkit-saige-family.git
cd ${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/- Build the
saige.sifsingularity image
- you may call the image whatever you like, and store it wherever you like. Just make sure you specify the name in
nextflow.conf - this does NOT have to be done for every saige-based analysis, but it is good practice to re-build every so often as we update regularly.
cd ${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/
singularity build saige.sif docker://pennbiobank/saige:latest- Make a separate analysis/run/working directory.
- The quickest way to get started, is to run the analysis in the folder the pipeline is run. However, subsequent analyses will over-write results from previous analyses.
- ❗This step is optional, but We Highly recommend making a
toolsdirectory separate from yourrundirectory. The only items that need to be in the run directory are thenextflow.conffile and the${workflow}.conffile.
WDIR="/path/to/analysis/run1"
mkdir -p
cd $WDIR-
Fill out the
nextflow.configfile for your system.- See Nextflow configuration documentation for information on how to configure this file. An example can be found on our GitHub: Nextflow Config.
- ❗IMPORTANTLY, you must configure a user-defined profile for your run environments (local, docker, saige, cluster, etc.). If multiple profiles are specified, run with a specific profile using
nextflow run -profile ${MY_PROFILE}. - For singularity, The profile's attribute
process.containershould be set to'/path/to/saige.sif'(replace/path/towith the location where you built the image above). See Nextflow Executor Information for more details. ⚠️ As this file remains mostly unchanged for your system, We recommend storing this file in thetools/pipelinedirectory and symlinking it to your run directory.
-
Create a pipeline-specific
.configfile specifying your run parameters and input files. See Below for workflow-specific parameters and what they mean.- Everything in here can be configured in
nextflow.config, however we find it easier to separate the system-level profiles from the individual run parameters. - Examples can be found in our Pipeline-Specific Example Config Files.
- you can compartamentalize your config file as much as you like by passing
- There are 2 ways to specify the config file during a run:
- with the
-coption on the command line:nextflow run -c /path/to/workflow.conf - in the
nextflow.conf: at the top of the file add:includeConfig '/path/to/workflow.conf'
- with the
- Everything in here can be configured in
- ❗We HIGHLY recommend doing a STUB run to test the analysis using the
-stubflag. This is a dry run to make sure your environment, parameters, and input_files are specified and formatted correctly. - ❗We HIGHLY recommend doing a test run with the included test data in
${TOOLS_DIR}/pmbb-nf-toolkit-saige-family/test_data - in the
test_data/directory for each pipeline, we have several pre-configured analyses runs with input data and fully-specified config files.
# run an exwas stub
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf -stub
# run an exwas for real
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf
# resume an exwas run if it was interrupted or ran into an error
nextflow run /path/to/pmbb-nf-toolkit-saige-family/workflows/saige_exwas.nf -profile cluster -c /path/to/run1/exwas.conf -resume