Skip to content
Philip Reiner Kensche edited this page Apr 10, 2019 · 19 revisions

Alignment and Quality Control Plugin for Roddy

This plugins contains alignment and quality control related Roddy workflows:

  • PanCancer alignment workflow for whole genome (WGS) and exome (WES)
  • Bisulfite core workflow (WGBS) using methylCtools

These are basically BWA alignment (bwa mem) workflows with plenty of additional quality control steps. QC-steps acting on the BAM files are mostly done through piping the input data into multiple QC tools simultaneously to achieve a high performance. All workflows can be run for each sample or with combinations of tumor- and control-samples.

Configuration

All configuration variables are documented in the XML files in the resources/configurationFiles/ directory. There is one XML for each workflow. Note that the workflows depend on each other, i.e. the WES and WGBS workflows are extension of the WGS workflow -- this can be recognized from the imports attribute of the top-level configuration tag in the XMLs. This means that most options of the WGS workflow also affect the other two workflows. Conversely, settings in the WGBS and WES workflow may override those in the WGS workflow. Some processing steps, notably those of the ACEseq quality control (QC) are not valid for the WES workflow. Note that the plugin depends on the COWorkflowBasePlugin, which has its own configurations affecting this alignment plugin.

See the Roddy documentation for a description of how to configure and run workflows.

Software Requirements

Conda

The workflow contains a description of a Conda environment. A number of Conda packages from BioConda are required. You should set up the Conda environment at a centralized position available from all compute hosts.

First install the BioConda channels:

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Then install the environment

conda env create -n AlignmentAndQCWorkflows -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/qcAnalysis/environments/conda.yml

The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName variable. The default for that variable is set in resources/configurationFiles/qcAnalysis.xml.

Disclaimer for the Conda Version

The software versions in the Conda environment are yet not exactly the same as the ones in our local HPC infrastructure at the German Cancer Research Center. Among the many differences the following are probably most interesting for the user of the data:

Package DKFZ version Conda version Comment
biobambam 0.0.148 2.0.79 As long as you do not select markDuplicatesVariant=biobambam this won't be a problem, as biobambam is only used for sorting BAMs. Note further, we did not manage to get bamsort 2 from Conda to run on a CentOS 7 VM. You can also use useBioBamBamSort=false to sort with samtools.
picard 1.125 1.126 Probably no big deal.
bwa patched 0.7.8 0.7.8 For the WGBS workflow we currently use a patched version of BWA that does not check for the "/1" and "/2" first and second read marks. This version is not available in BioConda and thus the WGBS workflow won't work with the Conda environment.
R 3.4.0 3.4.1 Probably no big deal.
trimmomatic 0.30 0.33

We successfully tested the Conda environment imported as described above and using the parameters useBioBamBamSort=false, markDuplicatesVariant=sambamba, workflowEnvironmentScript=workflowEnvironment_conda and condaEnvironmentName=AlignmentAndQCWorkflows on WGS data.

WGBS Data Processing and methylCtools

The current implementation of the WGBS workflow uses methylCtools requires a patched BWA version. Note that the methylCtools version in this repository differs from the one in the official Github repository.

Recompiling the D-based Components

Two programs in this repository -- genomeCoverage.d and coverageQc.d -- were written in the programming language D and are provided as binaries and source code. If the need arises to recompile them you can find the build instructions in resources/. For the compilation you will need

  • the D-compiler LDC 0.12.1 compiler
  • and BioD master branch (commit 8b633de)

Workflow Job Structure

The workflow submits various jobs. The exact job-types depend on whether you analyse WGS, WES or WGBS data.

Whole Genome Sequencing (WGS)

The WGS variant does some GC- and replication-timing bias corrections for the coverage estimates, as are described is the documentation of the ACEseq workflow.

WGS job structure

Whole Exome Sequencing (WES)

For exome sequencing the QC statistics need to account for the target regions only, otherwise the estimates would be widely off any relevant value.

WES job structure

Whole Genome Bisulfite Sequencing (WGBS)

The WGBS variant does bisulfite calling on the fly with a patched version of methylCtools that is included in this repository. The workflow can handle not only WGBS but also tagmentation-WGBS data as described by Wang et al., 2013. Tagmentation data is based on independently amplified libraries, which makes it necessary to do independent duplication marking for each individual library before merging everything into a final merged-BAM.

WGBS job structure

Resource Requirements

The workflow is rather tuned to minimize IO. For instance, the tools are glued together using pipes. However, the duplication marking and the BAM sorting steps produce temporary files. These two and the BWA step are also the memory-hungry steps, while BWA is the step that requires most CPU time.

Have a look at the resource definitions in XMLs in the resources/configurationFiles/ directory, which are rather conservative and will cover almost all 30x-80x human data sets we received from our X10 sequencers. The XMLs contain multiple parameters that allow you can tweak the actually used memory and cores.

  • bwa (BWA_MEM_THREADS)
  • mbuffer (MBUFFER_SIZE_LARGE, MBUFFER_SIZE_SMALL)
  • sambamba (SAMBAMBA_MARKDUP_OPTS)
  • samtools (SAMPESORT_MEMSIZE)
  • picard (PICARD_MARKDUP_JVM_OPTS)

Other relevant options are

  • The resource requirements depend on the workflow variant that is used (e.g. whether biobambam's bamsort or samtools sort is used to sort the BAM file). Use a fast duplication marker. We found sambamba-0.5.9 to be optimal. Use the markDuplicatesVariant variable.
  • If you have a large and fast local filesystem thes useRoddyScratchAsBigFileScratch to true and set scratchBaseDirectory in the applicationProperties.ini to a path on that filesystem. This will speed up all temporary file IO.
  • Mbuffer is used to buffer short timescale throughput fluctuation and for copying data to multiple output (named) pipes.

BAM and FASTQ Usage

The workflow was originally designed to retrieve the files -- FASTQs and possible BAMs -- from the filesystem by predicting their names using the filename patterns. Usually, the workflow was run on one directory (both input and output directory) and the workflow made sure that all files it expects are present.

It depends on the set of parameters, which BAM file is used as input, e.g. when doing incremental merge.

  • If the bam parameter is set to some file, an existing target BAM file (derived from the filename pattern in the XML) will be rescued by renaming with a date suffix.
  • If fastq_list is set (to a semicolon-separated list of files) FASTQ files matching filename patterns are ignored. However, read groups already in the BAM file (no matter whether provided via bam or via filename pattern matching) are ignored.
  • bam and fastq_list can be set together. Then all only the explicitly set files are used but other BAM and FASTQ files matching filename patterns are ignored.
  • If useOnlyExistingTargetBam is set to true, then all existing FASTQ files matching filename patterns are ignored.
  • Using useOnlyExistingTargetBam=true with 'bam' or 'fastq_list' set is considered a configuration error. The workflow will check this and stop before the job submission.

Read Group Identifiers

Read group IDs in BAMs are determined (input files) from or stored in (output files) the ID attribute in @RG header lines. Usually, read group IDs in FASTQ files are determined from filenames using the patterns ${RUN}_${LANE}. With a metadata input table you can provide FASTQ files with arbitrary file names, because the metadata is taken from the table's columns.

Running the Workflow

The most important parameters are:

Parameter Example Description
INDEX_PREFIX /path/to/assembly/assembly.fa The path to the fasta file with the assembled genome(s). Note that the BWA index needs to be this directly and use the string 'assembly.fa' as prefix
CHROM_SIZES_FILE /path/to/assembly/sizes.tsv A two-column TSV file with chromosome identifiers (1) and number of bases (2). Usually you want the number of bases just be from the set {A, T, C, G}, to ignore all lower-quality bases in the genome in the statistics.
CHR_PREFIX chrMmu With xenograft data used to discern 'matching' and 'nonmatching' identifiers which match the /$CHR_PREFIX$chr$CHR_SUFFIX/ pattern or not, respectively. Also used for WGBS.
CHR_SUFFIX _hsa See CHR_PREFIX.
CHR_GROUP_NOT_MATCHING human See CHR_PREFIX. Default: "nonmatching"
CHR_GROUP_MATCHING mouse See CHR_PREFIX. Default" "matching"
CHROMOSOME_INDICES "( 1 2 3 )" Needed for the WGBS workflow to select chromosomes to be processed. This should be a quoted bash array, i.e. with spaces as element separators and including the parentheses.

A full description of all options in the different workflows can be found in the XML files in resources/configurationFiles. Note that workflow configurations inherit from each other in the order "WGS" <- "WES" <- "WGBS". Thus the WGS configuration (analysisQc.xml) contains variables that are overridden by values in the WES configuration (analysisExome.xml), and so forth.

Xenograft

The WGS and WES workflows can deal with xenograft data simply by aligning against the combined genome. Thus to process xenograft data you need a FASTA file, a genome index, and a matching "chromosome sizes file" -- all with both the host's and xenografted species's genomes. Make sure that the chromosomes from both species have different identifiers, e.g. by pre- or suffixing one of sets of the chromosome names, e.g. with chrMmu or whatever is appropriate.

If you want to have species-specific coverage you additionally need to set some variables. The chromosomes of one of the two species need to be prefixed (CHR_PREFIX) and/or suffixed (CHR_SUFFIX). For instance you may use a FASTA file with the human chromosomes without (explicitly configured) prefixes, e.g. with chromosomes 1, 2, ..., X, Y, MT and mouse chromosomes prefixed by 'chrMmu'. In this situation use the following configuration values:

  • CHR_PREFIX=chrMmu
  • CHR_GROUP_MATCHING=mouse
  • CHR_GROUP_NOT_MATCHING=human

The result will be that in the files $sample_$pid(_targetExtract)?.rmdup.bam.DepthOfCoverage(_Target)?_Grouped.txt two lines are created called "matching" (or "mouse" in the example) and "nonmatching" (or "human in the example). Additionally, these values are collected into the _qualitycontrol.json files.

NOTE: Currently, the WGBS workflow variant uses the CHR_PREFIX-variable for another purpose and, therefore, can not collect dedicated statistics for xenograft data.

Release Branches

Various versions are or have been in production mode at the DKFZ/ODCF. These often have dedicated release branches named "ReleaseBranch_$major.$minor[.$patch]" in which only certain changes have been made:

  • No changes that alter previous output.
  • New important features are sometimes backported -- as long as they do not change the previous results.
  • Bugfixes that allow running the workflow on some data on which it previously crashed, but that do not alter the existing output, are included.

Note that ReleaseBranch_1.2.182 is not the newest branch, but the oldest! It was derived from a very old version of the workflow (QualityControlWorkflows_1.0.182) at a time where the versioning system was not fixed to semver 2.0.

Change Logs

Please see the changelogs file for details.

Clone this wiki locally