Skip to content

C: Running PanGenie on HGSVC data

Jana Ebler edited this page Aug 3, 2023 · 4 revisions

For the HGSVC data published in https://www.science.org/doi/10.1126/science.abf7117, we have generated PanGenie-ready VCFs containing haplotype data from 32 human samples (64 haplotypes). VCFs were generated based on GRCh38.

Dataset PanGenie input VCF Callset VCF
HGSVC-GRCh38 (freeze3, 64 haplotypes) graph-VCF callset-VCF
HGSVC-GRCh38 (freeze4, 64 haplotypes) graph-VCF callset-VCF

For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section (Genotyping-variation-nested-inside-of-bubbles)[https://github.com/eblerjana/pangenie/wiki/A:--Genotyping-variation-nested-inside-of-bubbles]. The same commands can be used to run PanGenie and the postprocessing step:

# run PanGenie (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf"
PanGenie -i <input-reads> -v <graph-vcf> -r <reference-genome> -o pangenie -j 24 -t 24


# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The script convert-to-biallelic.py is provided here.