-
Notifications
You must be signed in to change notification settings - Fork 12
D: Running PanGenie on HPRC data
For the HPRC Minigraph-Cactus graph published in https://doi.org/10.1038/s41586-023-05896-x, we have generated PanGenie-ready VCFs containing haplotype data from 44 human samples (88 haplotypes). VCFs were generated based on GRCh38 and CHM13. They are available at:
Dataset | PanGenie input VCF | Callset VCF |
---|---|---|
HPRC-GRCh38 (88 haplotypes) | graph-VCF | callset-VCF |
HPRC-CHM13 (88 haplotypes) | graph-VCF | callset-VCF |
For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section Genotyping variation nested inside of bubbles and the same commands can be used to run PanGenie and the postprocessing step:
# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index
# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24
# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
The script convert-to-biallelic.py
is provided here.
You can also generate your own PanGenie-ready VCFs from a Minigraph-Cactus graph containing human samples. What you need in order to do so, is the VCFs produced by the (Minigraph-Cactus pipeline)[https://github.com/ComparativeGenomicsToolkit/cactus], as well as the GFA file of the graph itself.
The pipeline provided here: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC can then be used in order to produce a graph-VCF as well as the corresponding callset-VCF in the same format as explained in Section Genotyping variation nested inside of bubbles. Again, the same commands as listed above can be used to run PanGenie and the postprocessing step.