-
Notifications
You must be signed in to change notification settings - Fork 12
C: Running PanGenie on HGSVC data
Jana Ebler edited this page Aug 22, 2023
·
4 revisions
For the HGSVC data published in https://www.science.org/doi/10.1126/science.abf7117, we have generated PanGenie-ready VCFs containing haplotype data from 32 human samples (64 haplotypes). VCFs were generated based on GRCh38.
Dataset | PanGenie input VCF | Callset VCF |
---|---|---|
HGSVC-GRCh38 (freeze3, 64 haplotypes) | graph-VCF | callset-VCF |
HGSVC-GRCh38 (freeze4, 64 haplotypes) | graph-VCF | callset-VCF |
For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section Genotyping variation nested inside of bubbles. The same commands can be used to run PanGenie and the postprocessing step:
# run PanGenie (v3.0.0) preprocessing
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index
# run PanGenie (v3.0.0) on a specific sample (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf".
# to genotype multiple samples, run this command on each sample separately. PanGenie-index needs to be run only once.
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24
# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
The script convert-to-biallelic.py
is provided here.