-
Notifications
You must be signed in to change notification settings - Fork 12
C: Running PanGenie on HGSVC data
For the HGSVC data published in https://www.science.org/doi/10.1126/science.abf7117, we have generated PanGenie-ready VCFs containing haplotype data from 32 human samples (64 haplotypes). VCFs were generated based on GRCh38.
Dataset | PanGenie input VCF | Callset VCF |
---|---|---|
HGSVC-GRCh38 (freeze3, 64 haplotypes) | graph-VCF | callset-VCF |
HGSVC-GRCh38 (freeze4, 64 haplotypes) | graph-VCF | callset-VCF |
For each VCF, there is two versions. A multi-allelic graph-VCF (second column) representing the pangenome graph that is to be used as input to PanGenie, and a bi-allelic callset-VCF (third column) describing all variant alleles contained in the bubbles of the pangenome graph. These VCFs follow the same format as described in Section (Genotyping-variation-nested-inside-of-bubbles)[https://github.com/eblerjana/pangenie/wiki/A:--Genotyping-variation-nested-inside-of-bubbles]. The same commands can be used to run PanGenie and the postprocessing step:
# run PanGenie (using 24 cores), produces genotyped VCF "pangenie_genotyping.vcf"
PanGenie -i <input-reads> -v <graph-vcf> -r <reference-genome> -o pangenie -j 24 -t 24
# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
The script convert-to-biallelic.py
is provided here.