Skip to content

D: Running PanGenie on HPRC data

Jana Ebler edited this page Feb 22, 2024 · 6 revisions

Using PanGenie-ready VCFs produced by HPRC

For the HPRC Minigraph-Cactus graph published in https://doi.org/10.1038/s41586-023-05896-x, we have generated PanGenie-ready VCFs containing haplotype data from 44 human samples (88 haplotypes). VCFs were generated based on GRCh38 and CHM13. They are available at:

Dataset PanGenie input VCF Callset VCF
HPRC-GRCh38 (88 haplotypes) graph-VCF callset-VCF
HPRC-CHM13 (88 haplotypes) graph-VCF callset-VCF

For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section Genotyping variation nested inside of bubbles and the same commands can be used to run PanGenie and the postprocessing step:

# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index

# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24


# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The script convert-to-biallelic.py is provided here.

Preparing PanGenie-ready VCFs from Minigraph-Cactus graphs

You can also generate your own PanGenie-ready VCFs from a Minigraph-Cactus graph containing human samples. What you need in order to do so, is the VCFs produced by the (Minigraph-Cactus pipeline)[https://github.com/ComparativeGenomicsToolkit/cactus], as well as the GFA file of the graph itself.

The pipeline provided here: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC can then be used in order to produce a graph-VCF as well as the corresponding callset-VCF in the same format as explained in Section Genotyping variation nested inside of bubbles. Again, the same commands as listed above can be used to run PanGenie and the postprocessing step.