Skip to content

Commit

Permalink
Merge pull request #50 from Pathogen-Genomics-Cymru/tbprofiler
Browse files Browse the repository at this point in the history
Tbprofiler
  • Loading branch information
annacprice authored Feb 21, 2024
2 parents 77e2a3b + c074d6f commit af99d4a
Show file tree
Hide file tree
Showing 17 changed files with 423 additions and 353 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/build-push-quay.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,11 @@ name: build-push-quay
on:
push:
branches:
- v0.9.6
- 0.9.7-dev
- climb
- main
paths:
- '**/Dockerfile*'
- "bin/"
- "resources/"

workflow_dispatch:

Expand Down Expand Up @@ -46,6 +45,7 @@ jobs:
- name: Copy folders to docker
run: |
cp -r bin docker/bin
cp -r resources docker/resources
- name: Get image name
id: image_name
Expand Down
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Pipeline cleans and QCs reads with fastp and FastQC, classifies with Kraken2 & A

Note that while Mykrobe is included within this pipeline, it runs as an independent process and is not used for any downstream reporting.

**WARNING**: There are currently known errors with vcfmix and gnomonicus, as such `errorStrategy 'ignore'` has been added to the processes vcfpredict:vcfmix and vcfpredict:gnomonicus to stop the pipeline from crashing. Please check the stdout from nextflow to see whether these processes have ran successfully.
**WARNING**: There are currently known errors with vcfmix, as such `errorStrategy 'ignore'` has been added to the processes vcfpredict:vcfmix to stop the pipeline from crashing. Please check the stdout from nextflow to see whether these processes have ran successfully.

## Quick Start ##
This is a Nextflow DSL2 pipeline, it requires a version of Nextflow that supports DSL2 and the stub-run feature. It is recommended to run the pipeline with `NXF_VER=20.11.0-edge`, as the pipeline has been tested using this version. E.g. to download
Expand All @@ -29,6 +29,8 @@ NXF_VER=20.11.0-edge nextflow run main.nf -profile docker --filetype bam --input
--output_dir . --kraken_db /path/to/database --bowtie2_index /path/to/index --bowtie_index_name hg19_1kgmaj
```

There is also a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```-profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` directories by default. These are mounted on a public S3 bucket hosted on CLIMB.

### Executors ###

By default, the pipeline will just run on the local machine. To run on a cluster, modifications will have to be made to the `nextflow.config` to add in the executor. E.g. for a SLURM cluster add `process.executor = 'slurm'`. For more information on executor options see the Nextflow docs: https://www.nextflow.io/docs/latest/executor.html
Expand Down Expand Up @@ -63,10 +65,8 @@ Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/b
Name of the bowtie index, e.g. hg19_1kgmaj<br />
* **vcfmix**<br />
Run [vcfmix](https://github.com/AlexOrlek/VCFMIX), yes or no. Set to no for synthetic samples<br />
* **gnomonicus**<br />
Run [gnomonicus](https://github.com/oxfordmmm/gnomonicus), yes or no<br />
* **amr_cat**<br />
Path to AMR catalogue for gnomonicus<br />
* **resistance_profiler**<br />
Run resistance profiling for Mycobacterium tubercuclosis. Either ["tb-profiler"](https://tbdr.lshtm.ac.uk/) or "none".
* **afanc_myco_db**<br />Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
<br />

Expand Down Expand Up @@ -125,12 +125,10 @@ process clockwork:alignToRef\
25. (Fail) If < 50% of the reference genome was covered at 10-fold depth

process clockwork:minos\
26. (Warn) If sample is not TB, then it is not passed to gnomonicus

## Running on CLIMB Jupyter Hub
There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` respectively. These are mounted on a public shared volume.
26. (Warn) If sample is not TB, then it is not passed to a resistance profiler

## Acknowledgements ##
For a list of direct authors of this pipeline, please see the contributors list. All of the software dependencies of this pipeline are recorded in the version.json

The preprocessing sub-workflow is based on the preprocessing nextflow DSL1 pipeline written by Stephen Bush, University of Oxford. The clockwork sub-workflow uses aspects of the variant calling workflow from https://github.com/iqbal-lab-org/clockwork, lead author Martin Hunt, Iqbal Lab at EMBL-EBI

48 changes: 48 additions & 0 deletions config/containers.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
params{
container_enabled = "true"
container_enabled = "true"
}


process {
update_tbprofiler = "false"


withLabel:low_cpu {cpus = 2}
withLabel:normal_cpu { cpus = 8 }
withLabel:low_memory { memory = '5GB' }
withLabel:medium_memory { memory = '10GB' }
withLabel:high_memory { memory = '18GB' }

withLabel:getversion {
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.8"
}

withLabel:preprocessing {
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.8"
}

withLabel:tbprofiler {
container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.8"
}

withName:downloadContamGenomes {
shell = ['/bin/bash','-u']
errorStrategy = { task.exitStatus in 100..113 ? 'retry' : 'terminate' }
maxRetries = 5
}

withLabel:retryAfanc {
shell = ['/bin/bash','-u']
errorStrategy = {task.exitStatus == 1 ? 'retry' : 'ignore' }
maxRetries = 5
}

withLabel:clockwork {
container = "quay.io/pathogen-genomics-cymru/clockwork:0.9.8"
}

withLabel:vcfpredict {
container = "quay.io/pathogen-genomics-cymru/vcfpredict:0.9.8"
}
}
54 changes: 54 additions & 0 deletions docker/Dockerfile.tbprofiler-0.9.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
FROM mambaorg/micromamba:1.3.0 as app

#copy the reference genome to pre-compute our index
COPY resources/tuberculosis.fasta /data/tuberculosis.fasta

USER root
WORKDIR /

ARG TBPROFILER_VER="5.0.1"

# this version is the shortened commit hash on the `master` branch here https://github.com/jodyphelan/tbdb/
# commits are found on https://github.com/jodyphelan/tbdb/commits/master
# this was the latest commit as of 2023-10-26
ARG TBDB_VER="e25540b"

# LABEL instructions tag the image with metadata that might be important to the user
LABEL base.image="micromamba:1.3.0"
LABEL dockerfile.version="1"
LABEL software="tbprofiler"
LABEL software.version="${TBPROFILER_VER}"
LABEL description="The pipeline aligns reads to the H37Rv reference using bowtie2, BWA or minimap2 and then calls variants using bcftools. These variants are then compared to a drug-resistance database."
LABEL website="https://github.com/jodyphelan/TBProfiler/"
LABEL license="https://github.com/jodyphelan/TBProfiler/blob/master/LICENSE"
LABEL maintainer="John Arnn"
LABEL maintainer.email="[email protected]"
LABEL maintainer2="Curtis Kapsak"
LABEL maintainer2.email="[email protected]"

# Install dependencies via apt-get; cleanup apt garbage
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
ca-certificates \
procps && \
apt-get autoclean && rm -rf /var/lib/apt/lists/*

# install tb-profiler via bioconda; install into 'base' conda env
RUN micromamba install --yes --name base --channel conda-forge --channel bioconda \
tb-profiler=${TBPROFILER_VER}

RUN micromamba install --yes --name base --channel conda-forge --channel bioconda gatk4
RUN micromamba install --yes --name base --channel conda-forge --channel bioconda samtools
RUN micromamba install --yes --name base --channel conda-forge jq
RUN micromamba clean --all --yes

# hardcode 'base' env bin into PATH, so conda env does not have to be "activated" at run time
ENV PATH="/opt/conda/bin:${PATH}"

# Version of database can be confirmed at /opt/conda/share/tbprofiler/tbdb.version.json
# can also run 'tb-profiler list_db' to find the same version info
# In 5.0.1 updating_tbdb does not work with tb-profiler update_tbdb --commit ${TBDB_VER}
RUN tb-profiler update_tbdb --commit ${TBDB_VER}

WORKDIR /data
RUN tb-profiler update_tbdb --match_ref tuberculosis.fasta
36 changes: 6 additions & 30 deletions docker/Dockerfile.vcfpredict-0.9.8
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,16 @@ FROM ubuntu:20.04
LABEL maintainer="[email protected]" \
about.summary="container for the vcf predict workflow"

#add run-vcf to container
COPY bin/ /opt/bin/
ENV PATH=/opt/bin:$PATH

ENV PACKAGES="procps curl wget git build-essential libhdf5-dev libffi-dev r-base-core jq" \
PYTHON="python3 python3-pip python3-dev"

ENV vcfmix_version=d4693344bf612780723e39ce27c8ae3868f95417 \
gumpy_version=1.0.15 \
piezo_version=0.3 \
gnomonicus_version=1.1.2 \
tuberculosis_amr_catalogues=12d38733ad2e238729a3de9f725081e1d4872968

COPY bin/ /opt/bin/
ENV PATH=/opt/bin:$PATH


#apt updates
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata \
&& apt-get install -y $PACKAGES $PYTHON \
Expand All @@ -27,25 +24,4 @@ RUN apt-get update \
&& pip3 install awscli \
&& pip3 install . \
&& cp -r data /usr/local/lib/python3.8/dist-packages \
&& cd ..

RUN curl -fsSL https://github.com/oxfordmmm/gumpy/archive/refs/tags/v${gumpy_version}.tar.gz | tar -xz \
&& cd gumpy-${gumpy_version} \
&& pip3 install . \
&& cd ..

RUN curl -fsSL https://github.com/oxfordmmm/piezo/archive/refs/tags/v${piezo_version}.tar.gz | tar -xz \
&& cd piezo-${piezo_version} \
&& pip3 install . \
&& cd ..

RUN curl -fsSL https://github.com/oxfordmmm/gnomonicus/archive/refs/tags/v${gnomonicus_version}.tar.gz | tar -xz \
&& cd gnomonicus-${gnomonicus_version} \
&& pip3 install . \
&& cd ..

RUN git clone https://github.com/oxfordmmm/tuberculosis_amr_catalogues.git \
&& cd tuberculosis_amr_catalogues \
&& git checkout ${tuberculosis_amr_catalogues} \
&& cd ..

&& cd ..
61 changes: 38 additions & 23 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -36,24 +36,24 @@ Produces as output one directory per sample, containing the relevant reports & a
Mandatory and conditional parameters:
------------------------------------------------------------------------
--input_dir Directory containing fastq OR bam files. Workflow will process one or the other, so don't mix
--filetype File type in input_dir. One of either "fastq" or "bam". fastq files can be gzipped and do not
--filetype File type in input_dir. One of either "fastq" or "bam". fastq files can be gzipped and do not
have to literally take the form "*.fastq"; see --pattern
--pattern Regex to match files in input_dir, e.g. "*_R{1,2}.fq.gz". Only mandatory if --filetype is "fastq"
--output_dir Output directory, in which will be created subdirectories matching base name of fastq/bam files
--unmix_myco Do you want to disambiguate mixed-mycobacterial samples by read alignment? One of "yes" or "no"
If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so
--unmix_myco Do you want to disambiguate mixed-mycobacterial samples by read alignment? One of "yes" or "no"
If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so
WILL ALMOST CERTAINLY ALSO reduce coverage of the principal species
If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria
If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria
will still be disambiguated
--kraken_db Directory containing Kraken2 database files (obtain from https://benlangmead.github.io/aws-indexes/k2)
--bowtie2_index Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19_1kgmaj_bt2.zip
This is the Langmead lab pre-built major-allele-SNP reference; see https://github.com/BenLangmead/bowtie-majref)
--bowtie_index_name Name of the bowtie index, e.g. hg19_1kgmaj
--vcfmix Run VFCMIX "yes" or "no". Should be set to "no" for synthetic samples
--gnomonicus Run gnomon "yes" or "no"
--vcfmix Run VFCMIX "yes" or "no". Should be set to "no" for synthetic samples
--resistance_profiler Tool to profile resistance with. At the moment options are "tb-profiler" or "none"
--amr_cat Path to the AMR catalogue (https://github.com/oxfordmmm/tuberculosis_amr_catalogues is at /tuberculosis_amr_catalogues
in the vcfpredict container)
--afanc_myco_db Path to the Afanc database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_3.0.tar.gz
--afanc_myco_db Path to the Afanc database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_3.0.tar.gz
Optional parameters:
------------------------------------------------------------------------
Expand All @@ -63,17 +63,17 @@ Optional parameters:
default: null
using this parameter will apply an additional sanity test to your sample
if you DO NOT use this parameter (default option), pipeline will determine principal species from
if you DO NOT use this parameter (default option), pipeline will determine principal species from
the reads and consider any other species a contaminant
if you DO use this parameter, pipeline will expect this to be the principal species. It will fail
the sample if reads from this species are not actually the majority
If you DO use this parameter, pipeline will expect this to be the principal species. It will fail
the sample if reads from this species are not actually the majority
Profiles:
------------------------------------------------------------------------
singularity to run with singularity
docker to run with docker
docker to run with docker
Examples:
Expand All @@ -86,6 +86,21 @@ nextflow run main.nf -profile docker --filetype bam --input_dir bam_dir --unmix_
}


resistance_profilers = ["tb-profiler", "none"]

if(!resistance_profilers.contains(params.resistance_profiler)){
exit 1, 'Invalid resistance profiler. Must be one of "tb-profiler" or "none" to skip.'
}

//tbprofiler container already has the reference genome in the DB, so skip if using docker
if((params.resistance_profiler == "tb-profiler") && (params.container_enabled == true)) {
update_tbprofiler = true
} else {
update_tbprofiler = false
}

resistance_profiler = params.resistance_profiler

// confirm that mandatory parameters have been set and that the conditional parameter, --pattern, has been used appropriately
if ( params.input_dir == "" ) {
exit 1, "error: --input_dir is mandatory (run with --help to see parameters)"
Expand Down Expand Up @@ -118,18 +133,17 @@ M Y C O B A C T E R I A L P I P E L I N E
Parameters used:
------------------------------------------------------------------------
--input_dir ${params.input_dir}
--filetype ${params.filetype}
--pattern ${params.pattern}
--output_dir ${params.output_dir}
--unmix_myco ${params.unmix_myco}
--kraken_db ${params.kraken_db}
--input_dir ${params.input_dir}
--filetype ${params.filetype}
--pattern ${params.pattern}
--output_dir ${params.output_dir}
--unmix_myco ${params.unmix_myco}
--kraken_db ${params.kraken_db}
--bowtie2_index ${params.bowtie2_index}
--bowtie_index_name ${params.bowtie_index_name}
--species ${params.species}
--vcfmix ${params.vcfmix}
--gnomonicus ${params.gnomonicus}
--amr_cat ${params.amr_cat}
--resistance_profiler ${params.resistance_profiler}
--species ${params.species}
--vcfmix ${params.vcfmix}
--afanc_myco_db ${params.afanc_myco_db}
Runtime data:
Expand Down Expand Up @@ -198,9 +212,10 @@ workflow {

mpileup_vcf = clockwork.out.mpileup_vcf
minos_vcf = clockwork.out.minos_vcf
genbank = channel.fromPath(params.gnomonicus_genbank)
reference = clockwork.out.reference
bam = clockwork.out.bam

vcfpredict(mpileup_vcf, minos_vcf, genbank)
vcfpredict(bam, mpileup_vcf, minos_vcf, reference)

}

Expand Down
11 changes: 5 additions & 6 deletions modules/clockworkModules.nf
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ process alignToRef {
doWeAlign =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/

output:
tuple val(sample_name), path("${sample_name}_report.json"), path("${sample_name}.bam"), path("${sample_name}.fa"), stdout, emit: alignToRef_bam
tuple val(sample_name), path("${sample_name}_report.json"), path("${sample_name}.bam"), path(reference_path), stdout, emit: alignToRef_bam
path("${sample_name}.bam.bai", emit: alignToRef_bai)
path("${sample_name}_alignmentStats.json", emit: alignToRef_json)
path "${sample_name}_err.json", emit: alignToRef_log optional true
Expand All @@ -63,9 +63,8 @@ process alignToRef {

"""
echo $reference_path
cp ${reference_path} ${sample_name}.fa
minimap2 -ax sr ${sample_name}.fa -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference ${sample_name}.fa - minimap.bam
minimap2 -ax sr $reference_path -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference $reference_path - minimap.bam
java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups INPUT=minimap.bam OUTPUT=${bam} RGID=${sample_name} RGLB=lib RGPL=Illumina RGPU=unit RGSM=sample
Expand Down Expand Up @@ -206,7 +205,7 @@ process callVarsCortex {

process minos {
/**
* @QCcheckpoint check if top species is TB, if yes pass vcf to gnomonicus
* @QCcheckpoint check if top species is TB, if yes pass vcf to resistance profiling
*/

tag { sample_name }
Expand Down Expand Up @@ -241,7 +240,7 @@ process minos {
cp ${sample_name}_report.json ${sample_name}_report_previous.json
if [[ \$top_hit =~ ^"Mycobacterium tuberculosis" ]]; then printf "CREATE_ANTIBIOGRAM_${sample_name}"; else echo '{"gnomonicus-warning":"sample is not TB so cannot produce antibiogram using gnomonicus"}' | jq '.' > ${error_log} && printf "no" && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
if [[ \$top_hit =~ ^"Mycobacterium tuberculosis" ]]; then printf "CREATE_ANTIBIOGRAM_${sample_name}"; else echo '{"resistance-profiling-warning":"sample is not TB so cannot produce antibiogram using resistance profiling tools"}' | jq '.' > ${error_log} && printf "no" && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
"""

stub:
Expand Down Expand Up @@ -296,7 +295,7 @@ process gvcf {
cp ${sample_name}_report.json ${sample_name}_report_previous.json
if [ ${params.vcfmix} == "no" ] && [ ${params.gnomonicus} == "no" ]; then echo '{"complete":"workflow complete without error"}' | jq '.' > ${error_log} && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
if [ ${params.vcfmix} == "no" ] && [ ${params.resistance_profiler} == "none" ]; then echo '{"complete":"workflow complete without error"}' | jq '.' > ${error_log} && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
"""

stub:
Expand Down
Loading

0 comments on commit af99d4a

Please sign in to comment.