The MoBiDiC capture CNV calling, annotation and interpretation tools based on clamms workflow https://github.com/rgcgithub/clamms.
This workflow is easily accessible with a Singularity image, low memory usage, handle batch effects and annotate with useful databases for human diagnostics.
NV calling on NGS Capture library technologies are difficult to implement in bioinformatics pipeline and their performance are doubtful due to capture bias. Moreover, since there is no standard for CNV caller output, CNV are difficult to annotate.
We propose an exportable WDL workflow based on open source tools and open source MoBiDiC script for CNV calling, annotation and interpretation interface. Performance of MoLLuDIC are increasing with data collection enlargement.
- MoLLuDiC can be easily installed with a singularity container.
- MoLLuDiC is powered by WDL and Cromwell from Broad Institute. This pipeline are adaptable and easy-to-use via JSON input file .
- CNV calling and removing batch effects are based on clamms https://github.com/rgcgithub/clamms.
- CNV familial segregation is made with bedtools.
- CNV annotation is based on bedtools and MoBiDiC-made master annotation file.
- CNV call could be interpreted in TSV viewer and SNV/CNV common interpretation could be realized via Captain ACHAB https://github.com/mobidic/Captain-ACHAB.
molludic.sh
- Select mode : "from scratch" to process all your data or "routine" to process new samples
- Select a library : capture from panel to whole exome sequencing
- Remove batch effect : Selection of a cluster of X most identical sample within 7 technical parameters (SEX, AT and GC Dropout, Mean Insert Size, Percentage of targeted base covered with 10X and 50X...) via a statistical method named k-d Tree. The X number of sample is scalable depending on your data (KNN).
- Remove relatives from CNV calling : add a family list file (tabulated file with sample identifier) remove relative from calling
- CNV calling : Read Depth Coverage statistical analysis
- CNV familial segregation : bedtools intersect and merge with all CNV call from relatives
- CNV annotation : bedtools intersect with a master annotation file containing cytoband, OMIM, ExAc CNV population frequency and metrics, in silico predictions tools. Instructions for master annotation file creation are described below.
MoLLuDiC : Exportable CNV calling, annotating and interpretating workflow for NGS Capture sequencing (2019). https://github.com/mobidic/MoLLuDiC
- Library bed file
- Metrics from Picard Tools (CollectHsMetrics and InsertSizeMetrics)
- Coverage from samtools bedcov or GATK DepthOfCoverage
- An annotated CNV calling bed file (with or without familial segregation)
To download MoLLuDiC, please use git to download the most recent development tree. Currently, the tree is hosted on github, and can be obtained via:
$ git clone https://github.com/mobidic/MoLLuDiC.git
- Linux OS
- Cromwell
- C
- git
- python 3
- bedtools (v2.27.1)
- R software and the FNN package install.packages("FNN")
MoLLuDiC (version ${VERSION}) is a CNV workflow for calling and annotation !
Usage : /.molludic.sh
General arguments :
help : show this help message
-v : decrease of increase verbosity level (ERROR : 1 | WARNING : 2 | INFO [default] : 3 | DEBUG : 4)
MoLLuDiC is composed of several functions. You print help for each module by typing help after function name.
Example : ./molludic.sh install help
List of MoLLuDiC's functions :
dirpreparation <OPTION> : Create folders to use correctly clamms
install <CLAMM_DIRECTORY> : install Clamms in specific directory
mapinstall <CLAMM_DIRECTORY> <BigWigToWig_PATH> : create Mapability bed
windowsBed <CLAMMS_DIRECTORY> <INSERT_SIZE> <INTERVALBEDFILE> <REFFASTA> <CLAMMS_SPECIAL_REGIONS> <LIBRARY_DIRECTORY> : run clamms annotate windows
normalizeFS <COVERAGE_PATH> <CLAMMS_DIRECTORY> <WINDOWS_BED> <LIBRARY_DIRECTORY> : normalize bed files from scratch
normalize <CLAMMS_DIRECTORY> <SAMPLEID> <CLAMMSCOVERAGEFILE> <WINDOWS_BED> <LIBRARY_DIRECTORY> : normalize one bed file
metricsMatrixFS <LIBRARY_DIRECTORY> <HS_FOLDER> <PYTHON_PATH> <MATCH_METRICS> : create kd tree metrics from scratch
metricsMatrix : <LIBRARY_DIRECTORY> <SAMPLEID> <HSMETRICSTXT> <INSERT_SIZE_METRICS_TXT> <PYTHON_PATH> <MATCH_METRICS> : create kd tree metric for 1 sample
removeRelatives <ALLKDTREE> <FAMILYLIST> <LIBRARY_DIRECTORY> : remove relatives from all kd tree file
makekdtree <RSCRIPT_PATH> <RSCRIPT_FILE> <KNN> <ALL_TREE> <LIBRARY_DIRECTORY> <FROM_SCRATCH> : use Rscript to do kd tree
cnvCallingFS <CLAMMS_DIRECTORY> <LIBRARY_DIRECTORY> <LIST_KDTREE> <WINDOWS_BED> <KNN> : do calling from scratch
cnvCalling <CLAMMS_DIRECTORY> <LIBRARY_DIRECTORY> <NORMCOVBED> <LIST_KDTREE> <WINDOWS_BED> <KNN> : do calling for 1 sample
annotation <LIBRARY_DIRECTORY> <SAMPLEID> <BEDTOOLS_PATH> <HGBED> <HEADER_FILE> <CNV_BED> <DAD> (optional) <MUM> (optional) : annotate cnv bed file
Soon, you will be able to launch MoLLuDiC via a singularity container.
GATK needs that bed file contains "chr" before chromosome number. Clamms needs no chr before chromosome number. Be careful with your bed data !
sed 's/^chr//g' yourbedwithchr > bedforclamms.bed"
sed 's/^/chr/g' yourbedwithoutchr.bed > bedforgatk.bed"
Creation of windowsBed need that your capture library.bed got the same chromosome that in the hg19.fa genome. To create a specific hg19.fa genome without selected chromosome, please find a shell script that should work (example here with chromosome 13, 21 and 22 removed).
sed '/chr13/,/chr14/{//!d}' /usr/local/share/refData/genome/hg19/hg19.fa | grep -v "chr13" | sed '/chr21/,/chr22/{//!d}' | sed '/chr22/,/chrX/{//!d}' | grep -v "chr21" | grep -v "chr22" | sed 's/chr//g' > hg19_moins132122_nochr.fa
Montpellier Bioinformatique pour le Diagnostique Clinique (MoBiDiC)
CHU de Montpellier
France