This repository contains CHM13v2-aligned 1000 Genomes Project (1KGP) variant data (Rhie et al 2022) that have been computationally phased using SHAPEIT5 (Hofmeister et al 2022). Phased panels (both unrelated 2504 member panels and full 3202 member panels) are available at the T2T/HPRC aws bucket.
Code used to produce @andrew-bortvin's T2T-CHM13 recombination maps can be found at https://github.com/mccoy-lab/1kgp_chm13_maps
All intermediate and summary dataframes generated by scripts/create_summary_dataframes.py, along with all resources necessary to replicate publication figures has been uploaded to zenodo at this link. Please cite the zenodo repository if you find this panel to be useful in your work. A preprint is in the works, and once it is up we will have a proper citation for you.
As part of this work, we created a SHAPEIT5 branch that is able to determine the number of flip errors and true switch errors alongside the number of switch errors. The publication utilized commit 585af40.
We also created a small python script to accurately lift SNPs and indels between GRCh38 and CHM13v2.0, modifying Ref/Alt alleles as necessary to reflect the differences between reference genomes. The publication used v0.1.
- resources: bed files and vcf files used to create and benchmark this panel.
- pedigrees: 1000 Genomes Project pedigree files (and subsets of that pedigree)
- sample_subsets: text files containing different groups of samples used for annotating or producing subsets of vcf files.
- recombination maps
- t2t_lifted_from_hapmap: HapMap recombination maps lifted from GRCh38 recombination map.
- t2t_native_scaled_maps: T2T recombination maps generated from pyrho analysis of 1KGP T2T variant panel by @andrew-bortvin.
- scripts: various scripts used to evaluate performance and gather GLIMPSE2_concordance or SHAPEIT5_concordance output into single dataframes.
- notebooks: notebooks used to generate figures for publication and calculate individual statistics that are quoted in the text of the manuscript.
- 1KGP variants called against T2Tv2.0: T2T_1KGP_calls
- T2T reference fasta: chm13v2.0.fa.gz
- HPRC Pangenome sample variation in T2T coordinates: hprc-v1.1-mc-chm13.vcfbub.a100k.wave.vcf.gz
- HPRC Pangenome sample variation in GRCh38 coordinates: hprc-v1.1-mc-chm13.vcfbub.a100k.wave.vcf.gz
- SHAPEIT5 5.1.1 release files: SHAPEIT5 v5.1.1 release
Performance was measured by comparing the phased haplotypes of 39 samples shared between the Human Pangenome Reference Consortium (HPRC)'s draft human pangenome and the 1000 Genotypes dataset, or the 61 samples shared between HGSVC3 released assemblies and the 1000 Genomes Project.
Panel | Source of ground truth assembly | Trio Membership | Switch Error Rate (%) | Flip Error Rate (%) | True Switch Error Rate (%) | Genotype Discordance Rate (%) |
---|---|---|---|---|---|---|
GRCh38 | HPRC | Proband | 0.408 | 0.186 | 0.029 | 1.114 |
GRCh38 | HGSVC3 | Proband | 0.504 | 0.232 | 0.030 | 1.333 |
GRCh38 | HGSVC3 | Parent | 0.705 | 0.303 | 0.080 | 1.457 |
GRCh38 | HGSVC3 | Non-trio | 1.285 | 0.487 | 0.257 | 1.443 |
T2T-CHM13 | HPRC | Proband | 0.355 | 0.166 | 0.019 | 0.918 |
T2T-CHM13 | HGSVC3 | Proband | 0.428 | 0.202 | 0.019 | 1.153 |
T2T-CHM13 | HGSVC3 | Parent | 0.495 | 0.235 | 0.020 | 1.277 |
T2T-CHM13 | HGSVC3 | Non-Trio | 1.110 | 0.474 | 0.130 | 1.253 |
Information on how to generate recombination maps can be found at https://github.com/mccoy-lab/1kgp_chm13_maps. This repository contains the portion of the analysis devoted to producing and evaluating a phased T2T-CHM13 reference panel.
Please note: These scripts were written to run on a server with > 500GB of RAM. Researchers who wish to replicate this work on their personal computers should download the summary dataframes and run notebooks/make_plots.ipynb and/or notebooks/calc_figures_for_paper.ipynb
To replicate, first clone this repository and enter the folder:
git clone https://www.github.com/JosephLalli/phasing_T2T
cd phasing_T2T
Next, ensure you have the requisite resource files downloaded. Go to the Zenodo repository linked to above and download resources_for_phasing_and_verification.tar.gz This file should then be untarred into the phasing_T2T folder. In addition, download the variant calls you will be phasing and/or evaluating.
wget resources_for_phasing_and_verification.tar.gz
tar -xvzf resources_for_phasing_and_verification.tar.gz .
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/variants/1000_Genomes_Project/chm13v2.0/all_samples_3202/1KGP.CHM13v2.0.chr22.recalibrated.snp_indel.pass.vcf.gz
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/variants/1000_Genomes_Project/chm13v2.0/all_samples_3202/1KGP.CHM13v2.0.chr22.recalibrated.snp_indel.pass.vcf.gz.tbi
The main script used to generate this panel can be found in scripts/phase_and_assess_panel.sh. It is designed to work in an environment specified by the working folder that the script is stored in.
Ensure all scripts and SHAPEIT files are executable, and then simply run that script. By default it should recognize the resource files in your folder.
chmod +x scripts/*.sh SHAPEIT*
mv scripts/* .
num_threads=16 #Or however many threads you want to work with
./phase_and_assess_panel.sh chr22 $num_threads test_phasing T2T
Hopefully, the script runs to completion and you have generated a great number of files in your phasing_T2T folder. Note that at one point the script tries to run a docker container. If docker is not installed on your machine, or you do not have permissions to launch a docker container, then go into the script, search for "docker", and comment out the entire if/then block that contains the command.
To collect data into a single dataframe, run create_summary_dataframes.py. Requirements for this script are polars, pandas, and numpy.
This will generate summary dataframes of all phasing statistics, and will use up to 400GB of RAM to process the data from a whole genome panel.
Finally, the notebooks folder contains the notebooks necessary to generate the figures in our paper. Figure 1 was made with code available at https://github.com/mccoy-lab/1kgp_chm13_maps. Figures 2-5, Figure 7, and supplemental figures 1, 4-9, 11-13, was made with notebooks/make_plots.ipynb. Figure s6 was by stitching together images created by scripts/Figure_6_script.R
Other scripts: scripts/assess_imputation.sh
- script called by phase_and_assess_panel.sh that imputes SGDP data using both GRCh38 and CHM13v2.0 reference panels, and then calls GLIMPSE2_concordance to evaluate the accuracy of imputation. scripts/generate_regions_for_rare_phasing.py
- script to split the CHM13v2.0 genome into 40mb chunks (overlapping by 1mb) for rare phasing. Note that we encountered regions of low variant density in some overlapping regions, and manually adjusted region breakpoints to avoid this problem. regions2.txt and GRCh38_panel_evaluation/regions2.txt are these adjusted region files. scripts/create_syn_nonsyn_bins.sh
- script to create two input file for assess_imputation.sh: one that puts each variant into MAF bins that are split by syntenic status, and another that puts each variant into MAF bins that are split by syntenic status and indel/snp status. scripts/find_trio_singletons.sh
- When evaluating phasing or imputation accuracy of an out-of-panel dataset that was phased/imputed using a reference panel, we bin variants by the reference panel minor allele frequency. This can cause problems when a 1KGP trio contains a variant that is in one parent and a child, as we use the unrelated 2504 member panel as a reference when phasing/imputing. This panel drops all trio children, resulting in an apparant singleton variant (present in the parent and no other samples) that was phased with trio information from the child. These trio-private 'singleton' variants are almost perfectly phased in the panel, and their presence makes singleton variants appear to be far more accurate than they actually are. This is particularly a problem when using trio information to identify switch errors, as the parent and child were phased to be consistent with eachother. Before excluding this class of variants, it looked like GRCh38 singletons were extremely accurately phased compared to even variants with an MAC of 2 or 3. Because true singletons were excluded from that panel, the only singletons present in the 2504 file are variants that were uniquely found in parent-child duos in the 3202 panel. This script creates a list of all trio-private variants, which is then read by create_summary_dataframes.py to identify and remove these variants from further analysis. scripts/get_discordant_multiallelic_site.py
- A short script to identify variants that are shared between a genome-native panel and a lifted over panel. We group biallelic variants by loci to ensure that we can identify situations where only a portion of alt alleles are present in both panels. These sites are considered discordant between the two panels, even if a portion of the alt alleles are concordant. scripts/liftover_indels.py
- This script is taken from github.com/josephlalli/LiftoverIndels, and implements a liftover algorithm that adjusts variant ref/alt identify to accomodate differences in build sequence. For example, if a GRCh38 variant is Ref AA/Alt ATGC, and the CHM13v2.0 reference contains an AA -> ATG variation at that locus, the variant will be changed to ATG -> ATGC, minimized to G -> GC, and the position of the variant will be lifted to the chain-specified lifted position + two, to account for the change in locus induced when minimizing the variant. The script also checks each variant to make sure variant ref/alt changes do not alter the implied local reference or alt genomic sequence. scripts/liftover_panel.sh
- A short bash script to ensure vcf files are appropriately formatted before running liftover_indels.
-
exclude FILTER (column in the VCF) = PASS:
FILTER!="PASS"
-
exclude variants with an alt allele of '*' after multiallelic splitting:
ALT=='*'
-
exclude GT missingness rate < 5%
F_MISSING>0.05
-
exclude Hardy-Weinberg p-value < 1e−10 in any 1000G subpopulation (as calculated in the 2504 unrelated 1KGP samples):
INFO/HWE_EUR<1e-10 && INFO/HWE_AFR<1e-10 && INFO/HWE_EAS<1e-10 && INFO/HWE_AMR<1e-10 && INFO/HWE_SAS<1e-10
-
exclude sites where Mendelian Error Rate (Mendelian errors/num alleles) >= 0.05 (Note: 0.05*602 trios = 30 mendelian errors)
INFO/MERR>=30
-
exclude homoalellelic sites
MAC==0
-
exclude variants with a high chance of being errors as predicted by computational modeling*
INFO/VQSLOD<0
Note: the SHAPEIT5 UK Biobank phasing paper excludes alternative alleles with AAscore < 0.5. This is a statistic produced by GraphTyper, which was not used to produce this dataset. The closest equivelent is the VQSLOD produced by Haplotype Caller. GraphTyper's AA score is simply the likelihood of an alternative allele truly being present in the dataset, so a cutoff of 0.5 is equivelent to 50% odds. Log odds of 1:1 is 0, so the VQSLOD log odds equivelent would be to exclude sites with a VQSLOD score of less than a cutoff of 0.
Phasing was performed with SHAPEIT v5.1.1, largely in accordance with the recommendations outlined in SHAPEIT5's online tutorial. For each chromosome, common variants were phased in one chunk. Rare variants were phased in chunks of 40 megabases.
Chromosome X PAR regions were phased separately from the rest of chromosome X. To phase the body of chromsome X, male samples were provided to SHAPEIT5 via the --halpoid option. The provided Ne value was reduced to 75% of the overall Ne, to account for the reduced population of haplotypes in this region of chrX. Phasing statistics were only calculated for female samples.
We rely on two different methods of phasing quality evaluation:
-
Using family data as outlined in SHAPEIT5's documentation.
-
Using the haplotype-phased samples present in the draft human pangenome as a ground truth.
To perform these evaluations, we phase the data set four times:
- all 3202 samples, providing a pedigree
- all samples excluding those identified as parents in the 1KGP pedigree (2002 samples)
- all samples excluding those identified as parents of samples included in the draft pangenome (note: all shared 1KGP-pangenome samples are in the 1KGP dataset as children of trios).
- Only pangenome samples, using the first data set (with pangenome samples and their parents excluded) as a reference panel.
We then calculate different sets of performance statistics using SHAPEIT5's switch tool.
Using family data as 'ground truth', we can:
- Evaluate switch error rate of pedigree-informed 3202 sample panel, using family data
- Measures best-case performance of phasing when using pedigree data
- Evaluate switch error rate of 2002 sample no-parent phased panel using family data as outlined in SHAPEIT5's documentation.
- Most valid measure of phasing accuracy when not using pedigree data
- Sets upper bound on error rate, as phasing is performed with ~66% of our dataset's haplotypes
Using the 39 1KGP samples present in the draft pangenome, we can use the HPRC's empirically phased haplotypes to:
- Evaluate switch error rate of the of pedigree-informed 3202 sample panel
- Measures phasing accuracy when using a pedigree
- Evaluate switch error rate of a panel generated using samples excluding those identified as parents in the 1KGP pedigree (2002 samples)
- Measures phasing accuracy without a pedigree
- Evaluate switch error rate when phasing a small number of samples (aka the 39 HPRC samples) using 2502 unrelated samples from the pedigree-informed 1KGP phased panel dataset as a reference (with the HPRC samples and parents excluded, of course)
- Measures accuracy in most realistic use case