Author: Wei Zhu ([email protected])
Date: 2023-08-21 14:30:53
This new Snakemake workflow is to annotate variants using AutoGVP. It splits the input vcf files in multiple chunks as the original pipeline. The main AutoGVP worflow followed Jung's design (/data/CGB_share/autoGVP_submit.sh @ biowulf).
#!/bin/bash
#SBATCH --time=2-00:00:00
#SBATCH -o ${PWD}/snakemake.%j.out
#SBATCH -e ${PWD}/snakemake.%j.err
module load singularity
mkdir -p TMP
export TMPDIR=TMP
# conda activate snakemake
snakemake --profile workflow/profiles/biowulf --verbose -p --use-conda --jobs 400 --default-resources "mem_mb=10000 " --use-envmodules --use-singularity --singularity-args " -B /vf,/spin1,/data,/fdb,/gpfs " --latency-wait 120 -T 0 -s workflow/Snakefile --configfile $1
### Create the conda env AutoGVP
conda env create -f PLP_prediction_workflow/environment.yaml
### activate the workflow
conda activate AutoGVP
### Launch the workflow
sbatch -J hkbc --export=ALL --mem=12g -p norm -o ${PWD}/slurm-%j.out -e ${PWD}/slurm-%j.err --time=24:00:00 --wrap='./run_it2.sh config/HKBC_Dec2024.yaml'
- /data/GenoMEL/AutoGVP/AutoGVP/data
- Homo_sapiens_assembly19.fasta
- clinvar_20231230.vcf.gz
- clinvar_20231230.vcf.gz
- variant_summary.txt.gz
- ClinVar-selected-submissions.tsv
- /data/GenoMEL/PLP_prediction_workflow [working directory]
- run_it2.sh
- config/genomel_config.yaml
- workflow/Snakefile
- output/
- merge_call
As the original AutoGVP pipeline has not been designed and tested for hg19, we made some minor revisions in certain steps.
There is no special requirement by AutoGVP for the VCF input files: it could one or multiple-part VCF file(s) and the parent folder of the vcf file(s) should be specified as vcf_input_dir in the configure file.
If the input files are individual VCF file for each sample/subject, those VCF files should be merged before employing this workflow. Generally, we do not want repeat the annotation process for each sample; it is more efficient to apply annotation on the joint variant call output.
We removed the original annotation and the format columns to reduce file size and avoid any conflict due to the existing annotations. This step has been built in the workflow.
bcftools view -e 'ALT="*"' -Ou {input} |bcftools norm -m-both -Ou --threads {threads} | bcftools norm -f {params.ref} | bcftools annotate -Ou -x ID -I +"%CHROM:%POS:%REF:%ALT" --threads {threads} | bcftools annotate -Oz -x FORMAT,^INFO/AC,^INFO/AF,^INFO/AN -o {output.vcf}
tabix -p vcf {output.vcf}
The module "VEP/104" is used for the Vep annotation at Biowulf, as suggested by Jung. We introduced parameter {params.grc} ("GRCh37" is used for hg19):
- --fasta $VEP_CACHEDIR/{params.grc}.fa
- --assembly {params.grc}
In this step, we used locally installed InterVar for the standard configure setting (where instervar databases under intervardb were used):
In the workflow, we had introduced the setting from the module annovar to reduce the dependencies on the local settings at Biowulf:
/home/zhuw10/git/InterVar-2.2.1/Intervar.py -i {input.vcf} --input_type=VCF -o {params.prefix} -b {params.genome} -t /home/zhuw10/git/InterVar-2.2.1/intervardb -d $ANNOVAR_DATA/{params.genome} --table_annovar=$ANNOVAR_HOME/table_annovar.pl --convert2annovar=$ANNOVAR_HOME/convert2annovar.pl --annotate_variation=$ANNOVAR_HOME/annotate_variation.pl
As InterVar-2.2.1 biowulf module is available now, we replaced the locally install version with the new biowulf module adaptation:
rule intervar_ann:
input:
vcf = output_dir+"/vep_ann/{chunk}_VEP.vcf"
output:
multiext(output_dir+"/intervar_ann/{chunk}_VEP." + genome + "_multianno", ".txt", ".txt.grl_p", ".txt.intervar")
envmodules: "annovar", "intervar/2.2.1"
params:
genome=genome,
prefix= output_dir+"/intervar_ann/{chunk}_VEP"
shell: """
Intervar.py -i {input.vcf} --input_type=VCF -o {params.prefix} -b {params.genome} -t $INTERVAR_DATA/intervardb -d $ANNOVAR_DATA/{params.genome} --table_annovar=$ANNOVAR_HOME/table_annovar.pl --convert2annovar=$ANNOVAR_HOME/convert2annovar.pl --annotate_variation=$ANNOVAR_HOME/annotate_variation.pl
"""
We made a revision at data/autoPVS1_from_VEP_vcf.py to introduce genome_version option back to the python script:
### clone autoPVS1 repo under PLP_prediction_workflow and copy the revised code to replace the original one
git clone https://github.com/d3b-center/D3b-autoPVS1.git
cp data/autoPVS1_from_VEP_vcf.py D3b-autoPVS1/
# python /data/CGB_share/autopvs1_wz/autoPVS1_from_VEP_vcf.py --genome_version {params.genome} --vep_vcf {input.vcf} > {output}
python ./D3b-autoPVS1/autoPVS1_from_VEP_vcf.py --genome_version {params.genome} --vep_vcf {input.vcf} > {output}
Besides, we had copied the reference genomes to the folder /data/CGB_share/autopvs1_wz/data, and revised the file config.ini accordingly for autopvs1.
We used the annovar module installed at Biowulf with the data sets "--protocol gnomad211_exome,gnomad211_genome", which are opted for the WES data on hg19.
π this setting needs to be customized for other studies in the future.
To run AutoGVP, latest ClinVar data need to be downloaded and prepared.
cd /data/GenoMEL/AutoGVP
git clone [email protected]:diskin-lab-chop/AutoGVP.git
wget -nc -e robots=off --reject "index.html*" -r --level=1 -nd -np -A "clinvar_20231230.vcf.gz*" -P data/ https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/submission_summary.txt.gz -P data/
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz -P data/
ls -al data/
total 514410
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jan 4 11:02 .
drwxr-sr-x 2 zhuw10 GenoMEL 4096 Jan 4 10:56 ..
-rw-r--r-- 1 zhuw10 GenoMEL 91217146 Dec 31 02:44 clinvar_20231230.vcf.gz
-rw-r--r-- 1 zhuw10 GenoMEL 126 Dec 31 02:44 clinvar_20231230.vcf.gz.md5
-rw-r--r-- 1 zhuw10 GenoMEL 522144 Dec 31 02:44 clinvar_20231230.vcf.gz.tbi
-rw-r--r-- 1 zhuw10 GenoMEL 211364657 Dec 30 19:27 submission_summary.txt.gz
-rw-r--r-- 1 zhuw10 GenoMEL 223649945 Dec 31 00:33 variant_summary.txt.gz
### run select-clinVar-submissions_hg19.R
# to have ClinVar-selected-submissions.tsv
module load singularity
singularity pull docker://pgc-images.sbgenomics.com/naqvia/autogvp:latest
singularity exec -B $PWD:/share --pwd /share autogvp_latest.sif Rscript ./AutoGVP/scripts/select-clinVar-submissions_hg19.R --variant_summary data/variant_summary.txt.gz --submission_summary data/submission_summary.txt.gz --outdir data/
After all the input files prepared for the use of hg19, AutoGVP is ready to be launched via container docker://pgc-images.sbgenomics.com/naqvia/autogvp:latest
The final merged tab-delimited files are generated under output. The format is as described, with one additional column "vid" (e.g., 1:12198:G:C).
tree output/merge_call/
output/merge_call/
βββ genomel.autogvp_abridged.tsv
βββ genomel.autogvp_full.tsv
- Table 1. Comparison of the annotation tools used between AutoGVP and the original P/LP prediction workflow.
Annotation Tool | AutoGVP | Original P/LP prediction pipeline |
---|---|---|
VEP | VEP/104 (/fdb/VEP/104/cache) | NA |
snpEff | NA | snpEff/5.1d |
annovar | annovar/2020-06-08 | annovar/2020-06-08 |
--protocols | gnomad211_exome,gnomad211_genome | refGene,knownGene,ensGene,exac03nontcga,gnomad211_exome,gnomad211_genome,esp6500siv2_all,1000g2015aug_all,clinvar_20210123,dbnsfp41a,dbscsnv11,spliceai_filtered,spidex,cosmic92_coding |
InterVar | Version 2.2.1 | Version 2.2.1 |
ClinVar | clinvar_20231230 | clinvar_20210123 (via annovar) |
AutoPSV1 | v2.0-2-g7fb1be9 | NA |
π It is noted that annotation approaches are significantly different between the two workflows, but the same version of InterVar were used. One of the differences to be highlighted is the data source of the ClinVar: clinvar_20210123 was used and annotated variants via AnnoVar in the original P/LP prediction workflow, whereas clinvar_20231230 was used directly by AutoGVP.
AutoGVP has been updated due to a revison in the recent ClinVar release, and a new docker image is released:
- docker://pgc-images.sbgenomics.com/diskin-lab/autogvp:v1.0.1
Accordingly, we updated our Snakemake workflow and also added one example to run AutoGVP with the hg38 reference genome.
cd /data/GenoMEL/AutoGVP
mkdir clinvar_20240603
### Download the latest data
wget -nc -e robots=off --reject "index.html*" -r --level=1 -nd -np -A "clinvar_20240603.vcf.gz*" -P clinvar_20240603/ https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/submission_summary.txt.gz -P clinvar_20240603/
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz -P clinvar_20240603/
ls -al clinvar_20240603/
total 623525
drwxr-s---. 2 zhuw10 GenoMEL 4096 Jun 4 10:07 .
drwxr-sr-x. 2 zhuw10 GenoMEL 4096 Jun 4 10:01 ..
-rw-r-----. 1 zhuw10 GenoMEL 101041151 Jun 4 09:08 clinvar_20240603.vcf.gz
-rw-r-----. 1 zhuw10 GenoMEL 126 Jun 4 09:08 clinvar_20240603.vcf.gz.md5
-rw-r-----. 1 zhuw10 GenoMEL 534810 Jun 4 09:08 clinvar_20240603.vcf.gz.tbi
-rw-r-----. 1 zhuw10 GenoMEL 260853747 Jun 3 14:31 submission_summary.txt.gz
-rw-r-----. 1 zhuw10 GenoMEL 276058849 Jun 4 09:07 variant_summary.txt.gz
### Run select-clinVar-submissions.R
module load singularity
# https://github.com/diskin-lab-chop/AutoGVP
singularity pull docker://pgc-images.sbgenomics.com/diskin-lab/autogvp:v1.0.1
### Output data/ClinVar-selected-submissions.tsv
singularity exec -B $PWD:/share --pwd /share autogvp_v1.0.1.sif Rscript ./AutoGVP/scripts/select-clinVar-submissions.R --variant_summary clinvar_20240603/variant_summary.txt.gz --submission_summary clinvar_20240603/submission_summary.txt.gz --outdir clinvar_20240603/
# Warning message:
# In left_join(., variant_summary_df, by = "VariationID", multiple = "all", :
# Detected an unexpected many-to-many relationship between `x` and `y`.
# βΉ Row 25499 of `x` matches multiple rows in `y`.
# βΉ Row 2 of `y` matches multiple rows in `x`.
# βΉ If a many-to-many relationship is expected, set `relationship =
# "many-to-many"` to silence this warning.
ls -altr clinvar_20240603/
total 2115825
-rw-r----- 1 zhuw10 GenoMEL 260853747 Jun 3 14:31 submission_summary.txt.gz
-rw-r----- 1 zhuw10 GenoMEL 276058849 Jun 4 09:07 variant_summary.txt.gz
-rw-r----- 1 zhuw10 GenoMEL 534810 Jun 4 09:08 clinvar_20240603.vcf.gz.tbi
-rw-r----- 1 zhuw10 GenoMEL 126 Jun 4 09:08 clinvar_20240603.vcf.gz.md5
-rw-r----- 1 zhuw10 GenoMEL 101041151 Jun 4 09:08 clinvar_20240603.vcf.gz
drwxr-sr-x 2 zhuw10 GenoMEL 4096 Jun 4 10:01 ..
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 10:14 .
-rw-r----- 1 zhuw10 GenoMEL 1528115185 Jun 4 10:14 ClinVar-selected-submissions.tsv
- /data/GenoMEL/PLP_prediction_workflow
- config/UKB_June2024.yaml
# given a list vcf file under the folder
vcf_input_dir: "/data/GenoMEL/UKB/final_vcf"
### output_dir is sufficient to keep data unique and output_prefix is to for the merged file
output_dir: "/data/GenoMEL/PLP_prediction_workflow/output_June2024"
output_prefix: "UKB"
# ref: "/data/GenoMEL/AutoGVP/AutoGVP/data/Homo_sapiens_assembly19.fasta"
ref: /data/zhuw10/ukbb/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa
genome: "hg38"
genome_GRC: "GRCh38"
clinvar:
dir: "/data/GenoMEL/AutoGVP/clinvar_20240603"
vcf: "clinvar_20240603.vcf.gz"
selected_submission: "ClinVar-selected-submissions.tsv"
variant_summary: "variant_summary.txt.gz"
submission_summary: "submission_summary.txt.gz"
conceptIDs: "/data/GenoMEL/AutoGVP/AutoGVP/data/clinvar_cpg_concept_ids.txt"
split_total: 100
conda activate snakemake
cd /data/GenoMEL/PLP_prediction_workflow
snakemake -np -s workflow/Snakefile --configfile config/UKB_June2024.yaml --verbose
sbatch -J autogvs --export=ALL --mem=12g -p norm -o ${PWD}/slurm-%j.out -e ${PWD}/slurm-%j.err --time=24:00:00 --wrap='./run_it2.sh config/UKB_June2024.yaml'
ls -alt output_June2024/
total 14158
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:12 merge_call
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:10 .
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:08 autogvp
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:06 intervar_ann
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:04 vep_ann
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:03 annovar_ann
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 11:01 autopvs1
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 10:39 splitted
-rw-r----- 1 zhuw10 GenoMEL 6844 Jun 4 10:28 UKB.vcf.gz.tbi
-rw-r----- 1 zhuw10 GenoMEL 14489781 Jun 4 10:28 UKB.vcf.gz
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 10:26 prep
drwxr-s--- 2 zhuw10 GenoMEL 4096 Jun 4 10:26 ..
csvtk cut -t -f autogvp_call output_June2024/merge_call/UKB.autogvp_abridged.tsv | csvtk del-header | sort | uniq -c
7966 Benign
12038 Likely_benign
976 Likely_pathogenic
819 Pathogenic
1102177 Uncertain_significance
csvtk cut -t -f autogvp_call_reason output_June2024/merge_call/UKB.autogvp_abridged.tsv | csvtk del-header | sort | uniq -c
14536 ClinVar
1109440 InterVar
csvtk pretty -t output_april2024/merge_call/UKB.autogvp_abridged.tsv| less -S
csvtk filter2 -t -f '$autogvp_call == "Pathogenic" ' output_June2024/merge_call/UKB.autogvp_abridged.tsv