C: Running PanGenie on HGSVC data

For the HGSVC data published in https://www.science.org/doi/10.1126/science.abf7117, we have generated PanGenie-ready VCFs containing haplotype data from 32 human samples (64 haplotypes). VCFs were generated based on GRCh38.

Dataset	PanGenie input VCF	Callset VCF
HGSVC-GRCh38 (freeze3, 64 haplotypes)	graph-VCF	callset-VCF
HGSVC-GRCh38 (freeze4, 64 haplotypes)	graph-VCF	callset-VCF

For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section Genotyping variation nested inside of bubbles. The same commands can be used to run PanGenie and the postprocessing step:

# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index

# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24


# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf

The script convert-to-biallelic.py is provided here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C: Running PanGenie on HGSVC data

Clone this wiki locally