Merge pull request #50 from Pathogen-Genomics-Cymru/tbprofiler

Tbprofiler
Pathogen-Genomics-Cymru · Feb 21, 2024 · af99d4a · af99d4a
2 parents 77e2a3b + c074d6f
commit af99d4a
Show file tree

Hide file tree

Showing 17 changed files with 423 additions and 353 deletions.
diff --git a/.github/workflows/build-push-quay.yml b/.github/workflows/build-push-quay.yml
@@ -2,12 +2,11 @@ name: build-push-quay
 on:
   push:
     branches:
-      - v0.9.6
-      - 0.9.7-dev
-      - climb
+      - main
     paths:
       - '**/Dockerfile*'
       - "bin/"
+      - "resources/"
 
   workflow_dispatch:
 
@@ -46,6 +45,7 @@ jobs:
       - name: Copy folders to docker
         run: |
           cp -r bin docker/bin
+          cp -r resources docker/resources
 
       - name: Get image name
         id: image_name

diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ Pipeline cleans and QCs reads with fastp and FastQC, classifies with Kraken2 & A
 
 Note that while Mykrobe is included within this pipeline, it runs as an independent process and is not used for any downstream reporting.
 
-**WARNING**: There are currently known errors with vcfmix and gnomonicus, as such `errorStrategy 'ignore'` has been added to the processes vcfpredict:vcfmix and vcfpredict:gnomonicus to stop the pipeline from crashing. Please check the stdout from nextflow to see whether these processes have ran successfully.
+**WARNING**: There are currently known errors with vcfmix, as such `errorStrategy 'ignore'` has been added to the processes vcfpredict:vcfmix to stop the pipeline from crashing. Please check the stdout from nextflow to see whether these processes have ran successfully.
 
 ## Quick Start ## 
 This is a Nextflow DSL2 pipeline, it requires a version of Nextflow that supports DSL2 and the stub-run feature. It is recommended to run the pipeline with  `NXF_VER=20.11.0-edge`, as the pipeline has been tested using this version. E.g. to download
@@ -29,6 +29,8 @@ NXF_VER=20.11.0-edge nextflow run main.nf -profile docker --filetype bam --input
 --output_dir . --kraken_db /path/to/database --bowtie2_index /path/to/index --bowtie_index_name hg19_1kgmaj
 ```
 
+There is also a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```-profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` directories by default. These are mounted on a public S3 bucket hosted on CLIMB.
+
 ### Executors ###
 
 By default, the pipeline will just run on the local machine. To run on a cluster, modifications will have to be made to the `nextflow.config` to add in the executor. E.g. for a SLURM cluster add `process.executor = 'slurm'`. For more information on executor options see the Nextflow docs: https://www.nextflow.io/docs/latest/executor.html
@@ -63,10 +65,8 @@ Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/b
 Name of the bowtie index, e.g. hg19_1kgmaj<br />
 * **vcfmix**<br />
 Run [vcfmix](https://github.com/AlexOrlek/VCFMIX), yes or no. Set to no for synthetic samples<br />
-* **gnomonicus**<br />
-Run [gnomonicus](https://github.com/oxfordmmm/gnomonicus), yes or no<br />
-* **amr_cat**<br />
-Path to AMR catalogue for gnomonicus<br />
+* **resistance_profiler**<br />
+Run resistance profiling for Mycobacterium tubercuclosis. Either ["tb-profiler"](https://tbdr.lshtm.ac.uk/) or "none".
 * **afanc_myco_db**<br />Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from  https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
 <br />
 
@@ -125,12 +125,10 @@ process clockwork:alignToRef\
 25. (Fail) If < 50% of the reference genome was covered at 10-fold depth
 
 process clockwork:minos\
-26. (Warn) If sample is not TB, then it is not passed to gnomonicus
-
-## Running on CLIMB Jupyter Hub
-There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` respectively. These are mounted on a public shared volume.
+26. (Warn) If sample is not TB, then it is not passed to a resistance profiler
 
 ## Acknowledgements ##
 For a list of direct authors of this pipeline, please see the contributors list. All of the software dependencies of this pipeline are recorded in the version.json
 
 The preprocessing sub-workflow is based on the preprocessing nextflow DSL1 pipeline written by Stephen Bush, University of Oxford. The clockwork sub-workflow uses aspects of the variant calling workflow from https://github.com/iqbal-lab-org/clockwork, lead author Martin Hunt, Iqbal Lab at EMBL-EBI
+
diff --git a/config/containers.config b/config/containers.config
@@ -0,0 +1,48 @@
+params{
+    container_enabled = "true"
+    container_enabled = "true"
+}
+
+
+process {
+    update_tbprofiler = "false"
+
+
+    withLabel:low_cpu {cpus = 2}
+    withLabel:normal_cpu { cpus = 8 }
+    withLabel:low_memory { memory = '5GB' }
+    withLabel:medium_memory { memory = '10GB' }
+    withLabel:high_memory { memory = '18GB' }
+
+    withLabel:getversion {
+        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.8"
+    }
+
+    withLabel:preprocessing {
+        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.8"
+    }
+
+    withLabel:tbprofiler {
+        container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.8"
+    }
+
+    withName:downloadContamGenomes {
+        shell = ['/bin/bash','-u']
+        errorStrategy = { task.exitStatus in 100..113 ? 'retry' : 'terminate' }
+        maxRetries = 5
+   }
+
+    withLabel:retryAfanc {
+	    shell = ['/bin/bash','-u']
+        errorStrategy = {task.exitStatus == 1 ? 'retry' : 'ignore' }
+        maxRetries = 5
+    }
+
+    withLabel:clockwork {
+        container = "quay.io/pathogen-genomics-cymru/clockwork:0.9.8"
+    }
+
+    withLabel:vcfpredict {
+        container = "quay.io/pathogen-genomics-cymru/vcfpredict:0.9.8"
+    }
+ }
diff --git a/docker/Dockerfile.tbprofiler-0.9.8 b/docker/Dockerfile.tbprofiler-0.9.8
@@ -0,0 +1,54 @@
+FROM mambaorg/micromamba:1.3.0 as app
+
+#copy the reference genome to pre-compute our index
+COPY resources/tuberculosis.fasta /data/tuberculosis.fasta
+
+USER root
+WORKDIR /
+
+ARG TBPROFILER_VER="5.0.1"
+
+# this version is the shortened commit hash on the `master` branch here https://github.com/jodyphelan/tbdb/
+# commits are found on https://github.com/jodyphelan/tbdb/commits/master
+# this was the latest commit as of 2023-10-26
+ARG TBDB_VER="e25540b"
+
+# LABEL instructions tag the image with metadata that might be important to the user
+LABEL base.image="micromamba:1.3.0"
+LABEL dockerfile.version="1"
+LABEL software="tbprofiler"
+LABEL software.version="${TBPROFILER_VER}"
+LABEL description="The pipeline aligns reads to the H37Rv reference using bowtie2, BWA or minimap2 and then calls variants using bcftools. These variants are then compared to a drug-resistance database."
+LABEL website="https://github.com/jodyphelan/TBProfiler/"
+LABEL license="https://github.com/jodyphelan/TBProfiler/blob/master/LICENSE"
+LABEL maintainer="John Arnn"
+LABEL maintainer.email="[email protected]"
+LABEL maintainer2="Curtis Kapsak"
+LABEL maintainer2.email="[email protected]"
+
+# Install dependencies via apt-get; cleanup apt garbage
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    ca-certificates \
+    procps && \
+    apt-get autoclean && rm -rf /var/lib/apt/lists/*
+
+# install tb-profiler via bioconda; install into 'base' conda env
+RUN micromamba install --yes --name base --channel conda-forge --channel bioconda  \
+    tb-profiler=${TBPROFILER_VER}
+
+RUN micromamba install --yes --name base --channel conda-forge --channel bioconda gatk4 
+RUN micromamba install --yes --name base --channel conda-forge --channel bioconda samtools 
+RUN micromamba install --yes --name base --channel conda-forge jq
+RUN micromamba clean --all --yes
+
+# hardcode 'base' env bin into PATH, so conda env does not have to be "activated" at run time
+ENV PATH="/opt/conda/bin:${PATH}"
+
+# Version of database can be confirmed at /opt/conda/share/tbprofiler/tbdb.version.json
+# can also run 'tb-profiler list_db' to find the same version info
+# In 5.0.1 updating_tbdb does not work with tb-profiler update_tbdb --commit ${TBDB_VER}
+RUN tb-profiler update_tbdb --commit ${TBDB_VER}
+
+WORKDIR /data
+RUN tb-profiler update_tbdb --match_ref tuberculosis.fasta
diff --git a/docker/Dockerfile.vcfpredict-0.9.8 b/docker/Dockerfile.vcfpredict-0.9.8
@@ -3,19 +3,16 @@ FROM ubuntu:20.04
 LABEL maintainer="[email protected]" \
 about.summary="container for the vcf predict workflow"
 
+#add run-vcf to container
+COPY bin/ /opt/bin/
+ENV PATH=/opt/bin:$PATH
+
 ENV PACKAGES="procps curl wget git build-essential libhdf5-dev libffi-dev r-base-core jq" \
 PYTHON="python3 python3-pip python3-dev"
 
 ENV vcfmix_version=d4693344bf612780723e39ce27c8ae3868f95417 \
-gumpy_version=1.0.15 \
-piezo_version=0.3 \
-gnomonicus_version=1.1.2 \
-tuberculosis_amr_catalogues=12d38733ad2e238729a3de9f725081e1d4872968
-
-COPY bin/ /opt/bin/
-ENV PATH=/opt/bin:$PATH
-
 
+#apt updates
 RUN apt-get update \
 && DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata \
 && apt-get install -y $PACKAGES $PYTHON \
@@ -27,25 +24,4 @@ RUN apt-get update \
 && pip3 install awscli \
 && pip3 install . \
 && cp -r data /usr/local/lib/python3.8/dist-packages \
-&& cd ..
-
-RUN curl -fsSL https://github.com/oxfordmmm/gumpy/archive/refs/tags/v${gumpy_version}.tar.gz | tar -xz \
-&& cd gumpy-${gumpy_version} \
-&& pip3 install . \
-&& cd ..
-
-RUN curl -fsSL https://github.com/oxfordmmm/piezo/archive/refs/tags/v${piezo_version}.tar.gz | tar -xz \
-&& cd piezo-${piezo_version} \
-&& pip3 install . \
-&& cd ..
-
-RUN curl -fsSL https://github.com/oxfordmmm/gnomonicus/archive/refs/tags/v${gnomonicus_version}.tar.gz | tar -xz \
-&& cd gnomonicus-${gnomonicus_version} \
-&& pip3 install . \
-&& cd ..
-
-RUN git clone https://github.com/oxfordmmm/tuberculosis_amr_catalogues.git \
-&& cd tuberculosis_amr_catalogues \
-&& git checkout ${tuberculosis_amr_catalogues} \
-&& cd ..
-
+&& cd ..
diff --git a/main.nf b/main.nf
@@ -36,24 +36,24 @@ Produces as output one directory per sample, containing the relevant reports & a
 Mandatory and conditional parameters:
 ------------------------------------------------------------------------
 --input_dir           Directory containing fastq OR bam files. Workflow will process one or the other, so don't mix
---filetype	      File type in input_dir. One of either "fastq" or "bam". fastq files can be gzipped and do not
+--filetype            File type in input_dir. One of either "fastq" or "bam". fastq files can be gzipped and do not
                       have to literally take the form "*.fastq"; see --pattern
 --pattern             Regex to match files in input_dir, e.g. "*_R{1,2}.fq.gz". Only mandatory if --filetype is "fastq"
 --output_dir          Output directory, in which will be created subdirectories matching base name of fastq/bam files
---unmix_myco	      Do you want to disambiguate mixed-mycobacterial samples by read alignment? One of "yes" or "no"
-	              If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so
+--unmix_myco          Do you want to disambiguate mixed-mycobacterial samples by read alignment? One of "yes" or "no"
+                      If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so
                       WILL ALMOST CERTAINLY ALSO reduce coverage of the principal species
-	              If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria
+                      If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria
                       will still be disambiguated
 --kraken_db           Directory containing Kraken2 database files (obtain from https://benlangmead.github.io/aws-indexes/k2)
 --bowtie2_index       Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19_1kgmaj_bt2.zip
                       This is the Langmead lab pre-built major-allele-SNP reference; see https://github.com/BenLangmead/bowtie-majref)
 --bowtie_index_name   Name of the bowtie index, e.g. hg19_1kgmaj
---vcfmix	      Run VFCMIX "yes" or "no". Should be set to "no" for synthetic samples
---gnomonicus          Run gnomon "yes" or "no"
+--vcfmix              Run VFCMIX "yes" or "no". Should be set to "no" for synthetic samples
+--resistance_profiler Tool to profile resistance with. At the moment options are "tb-profiler" or "none"
 --amr_cat             Path to the AMR catalogue (https://github.com/oxfordmmm/tuberculosis_amr_catalogues is at /tuberculosis_amr_catalogues
                       in the vcfpredict container)
---afanc_myco_db	      Path to the Afanc database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_3.0.tar.gz
+--afanc_myco_db       Path to the Afanc database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_3.0.tar.gz
 
 Optional parameters:
 ------------------------------------------------------------------------
@@ -63,17 +63,17 @@ Optional parameters:
                    default: null
                    using this parameter will apply an additional sanity test to your sample
 
-	           if you DO NOT use this parameter (default option), pipeline will determine principal species from
+                   if you DO NOT use this parameter (default option), pipeline will determine principal species from
                    the reads and consider any other species a contaminant
 
-	           if you DO use this parameter, pipeline will expect this to be the principal species. It will fail
-		   the sample if reads from this species are not actually the majority
+                   If you DO use this parameter, pipeline will expect this to be the principal species. It will fail
+                   the sample if reads from this species are not actually the majority
 
 
 Profiles:
 ------------------------------------------------------------------------
 singularity        to run with singularity
-docker		   to run with docker
+docker             to run with docker
 
 
 Examples:
@@ -86,6 +86,21 @@ nextflow run main.nf -profile docker --filetype bam --input_dir bam_dir --unmix_
 }
 
 
+resistance_profilers = ["tb-profiler", "none"]
+
+if(!resistance_profilers.contains(params.resistance_profiler)){
+    exit 1, 'Invalid resistance profiler. Must be one of "tb-profiler" or "none" to skip.'
+    }
+
+//tbprofiler container already has the reference genome in the DB, so skip if using docker
+if((params.resistance_profiler == "tb-profiler") && (params.container_enabled == true)) {
+    update_tbprofiler = true
+} else {
+    update_tbprofiler = false
+}
+
+resistance_profiler = params.resistance_profiler
+
 // confirm that mandatory parameters have been set and that the conditional parameter, --pattern, has been used appropriately
 if ( params.input_dir == "" ) {
     exit 1, "error: --input_dir is mandatory (run with --help to see parameters)"
@@ -118,18 +133,17 @@ M Y C O B A C T E R I A L  P I P E L I N E
 
 Parameters used:
 ------------------------------------------------------------------------
---input_dir		${params.input_dir}
---filetype		${params.filetype}
---pattern		${params.pattern}
---output_dir	        ${params.output_dir}
---unmix_myco	        ${params.unmix_myco}
---kraken_db		${params.kraken_db}
+--input_dir             ${params.input_dir}
+--filetype              ${params.filetype}
+--pattern               ${params.pattern}
+--output_dir            ${params.output_dir}
+--unmix_myco            ${params.unmix_myco}
+--kraken_db             ${params.kraken_db}
 --bowtie2_index         ${params.bowtie2_index}
 --bowtie_index_name     ${params.bowtie_index_name}
---species		${params.species}
---vcfmix		${params.vcfmix}
---gnomonicus		${params.gnomonicus}
---amr_cat		${params.amr_cat}
+--resistance_profiler   ${params.resistance_profiler}
+--species               ${params.species}
+--vcfmix                ${params.vcfmix}
 --afanc_myco_db         ${params.afanc_myco_db}
 
 Runtime data:
@@ -198,9 +212,10 @@ workflow {
 
       mpileup_vcf = clockwork.out.mpileup_vcf
       minos_vcf = clockwork.out.minos_vcf
-      genbank = channel.fromPath(params.gnomonicus_genbank)
+      reference = clockwork.out.reference
+      bam = clockwork.out.bam
 
-      vcfpredict(mpileup_vcf, minos_vcf, genbank)
+      vcfpredict(bam, mpileup_vcf, minos_vcf, reference)
 
 }
 

diff --git a/modules/clockworkModules.nf b/modules/clockworkModules.nf
@@ -47,7 +47,7 @@ process alignToRef {
     doWeAlign =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/
 
     output:
-    tuple val(sample_name), path("${sample_name}_report.json"), path("${sample_name}.bam"), path("${sample_name}.fa"), stdout, emit: alignToRef_bam
+    tuple val(sample_name), path("${sample_name}_report.json"), path("${sample_name}.bam"), path(reference_path), stdout, emit: alignToRef_bam
     path("${sample_name}.bam.bai", emit: alignToRef_bai)
     path("${sample_name}_alignmentStats.json", emit: alignToRef_json)
     path "${sample_name}_err.json", emit: alignToRef_log optional true
@@ -63,9 +63,8 @@ process alignToRef {
 
     """
     echo $reference_path
-    cp ${reference_path} ${sample_name}.fa
 
-    minimap2 -ax sr ${sample_name}.fa -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference ${sample_name}.fa - minimap.bam
+    minimap2 -ax sr $reference_path -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference $reference_path - minimap.bam
 
     java -jar /usr/local/bin/picard.jar AddOrReplaceReadGroups INPUT=minimap.bam OUTPUT=${bam} RGID=${sample_name} RGLB=lib RGPL=Illumina RGPU=unit RGSM=sample
 
@@ -206,7 +205,7 @@ process callVarsCortex {
 
 process minos {
     /**
-    * @QCcheckpoint check if top species is TB, if yes pass vcf to gnomonicus
+    * @QCcheckpoint check if top species is TB, if yes pass vcf to resistance profiling
     */
 
     tag { sample_name }
@@ -241,7 +240,7 @@ process minos {
 
     cp ${sample_name}_report.json ${sample_name}_report_previous.json
 
-    if [[ \$top_hit =~ ^"Mycobacterium tuberculosis" ]]; then printf "CREATE_ANTIBIOGRAM_${sample_name}"; else echo '{"gnomonicus-warning":"sample is not TB so cannot produce antibiogram using gnomonicus"}' | jq '.' > ${error_log} && printf "no" && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
+    if [[ \$top_hit =~ ^"Mycobacterium tuberculosis" ]]; then printf "CREATE_ANTIBIOGRAM_${sample_name}"; else echo '{"resistance-profiling-warning":"sample is not TB so cannot produce antibiogram using resistance profiling tools"}' | jq '.' > ${error_log} && printf "no" && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
     """
 
     stub:
@@ -296,7 +295,7 @@ process gvcf {
 
     cp ${sample_name}_report.json ${sample_name}_report_previous.json
 
-    if [ ${params.vcfmix} == "no" ] && [ ${params.gnomonicus} == "no" ]; then echo '{"complete":"workflow complete without error"}' | jq '.' > ${error_log} && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
+    if [ ${params.vcfmix} == "no" ] && [ ${params.resistance_profiler} == "none" ]; then echo '{"complete":"workflow complete without error"}' | jq '.' > ${error_log} && jq -s ".[0] * .[1]" ${error_log} ${sample_name}_report_previous.json > ${report_json}; fi
     """
 
     stub: