Pathogen-Genomics-Cymru
diff --git a/‎.github/workflows/build-push-quay.yml
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/build-push-quay.yml
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md
Lines changed: 40 additions & 42 deletions b/‎README.md
Lines changed: 40 additions & 42 deletions
diff --git a/‎bin/identify_tophit_and_contaminants2.py
Lines changed: 12 additions & 12 deletions b/‎bin/identify_tophit_and_contaminants2.py
Lines changed: 12 additions & 12 deletions
diff --git a/‎bin/parse_kraken_report2.py
Lines changed: 3 additions & 3 deletions b/‎bin/parse_kraken_report2.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎config/containers.config
Lines changed: 3 additions & 3 deletions b/‎config/containers.config
Lines changed: 3 additions & 3 deletions
diff --git a/‎docker/Dockerfile.preprocessing-0.9.9r1 renamed to ‎docker/Dockerfile.preprocessing-0.9.9r2 b/‎docker/Dockerfile.preprocessing-0.9.9r1 renamed to ‎docker/Dockerfile.preprocessing-0.9.9r2
diff --git a/‎docker/Dockerfile.tbprofiler-0.9.9
Lines changed: 1 addition & 1 deletion b/‎docker/Dockerfile.tbprofiler-0.9.9
Lines changed: 1 addition & 1 deletion
diff --git a/‎docker/Dockerfile.tbprofiler-0.9.9r1
Lines changed: 0 additions & 62 deletions b/‎docker/Dockerfile.tbprofiler-0.9.9r1
Lines changed: 0 additions & 62 deletions
@@ -3,6 +3,7 @@ on:
   push:
     branches:
       - main
+      - validate_test
     paths:
       - '**/Dockerfile*'
       - "bin/"
 
@@ -53,48 +53,46 @@ By default, the pipeline will just run on the local machine. To run on a cluster
 Minimum recommended requirements: 32GB RAM, 8CPU
 
 ## Paramaters ##
-The following parameters should be set in `nextflow.config` or specified on the command line:
-
-* **input_dir**<br /> 
-Directory containing fastq OR bam files
-* **filetype**<br />
-File type in input_dir. Either "fastq" or "bam"
-* **pattern**<br />
-Regex to match fastq files in input_dir, e.g. "*_R{1,2}.fq.gz". Only mandatory if --filetype is "fastq"
-* **output_dir**<br />
-Output directory for results
-* **unmix_myco**<br />
-Do you want to disambiguate mixed-mycobacterial samples by read alignment? Either "yes" or "no":
-  * If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so WILL ALMOST CERTAINLY ALSO reduce coverage of the principal species
-  * If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria will still be disambiguated
-* **species**<br />
-Principal species in each sample, assuming genus Mycobacterium. Default 'null'. If parameter used, takes 1 of 10 values: abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis. Using this parameter will apply an additional sanity test to your sample
-  * If you DO NOT use this parameter (default option), pipeline will determine principal species from the reads and consider any other species a contaminant
-  * If you DO use this parameter, pipeline will expect this to be the principal species. It will fail the sample if reads from this species are not actually the majority
-* **kraken_db**<br />
-Directory containing `*.k2d` Kraken2 database files (k2_pluspf_16gb recommended, obtain from https://benlangmead.github.io/aws-indexes/k2)
-* **bowtie2_index**<br />
-Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19_1kgmaj_bt2.zip). The specified path should NOT include the index name
-* **bowtie_index_name**<br />
-Name of the bowtie index, e.g. hg19_1kgmaj<br />
-* **vcfmix**<br />
-Run [vcfmix](https://github.com/AlexOrlek/VCFMIX), yes or no. Set to no for synthetic samples<br />
-* **resistance_profiler**<br />
-Run resistance profiling for Mycobacterium tubercuclosis. Either ["tb-profiler"](https://tbdr.lshtm.ac.uk/), ["tbtamr"](https://github.com/MDU-PHL/tbtamr) or "none".
-* **afanc_myco_db**<br />
-Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from  https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
-* **update_tbprofiler**<br />
-Update tb-profiler. Either "yes" or "no". "yes" may be useful when running outside of a container for the first time as we will not have constructed a tb-profiler database matching our reference. This is not needed with the climb, docker and singluarity profiles as the reference has already been added. Alternatively you can run ```tb-profiler update_tbdb --match_ref <lodestone_dir>/resources/tuberculosis.fasta```.
-* **refseq**<br />
-Path to assembly summary refseq file (taken from [here](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt)). A local version is stored for reproducibility purposes in ```resources/``` but for best results download the latest version. Instead of downloading, the link can be supplied directly in the refseq argument e.g. `--refseq "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txtftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"`
-* **permissive**<br />
-One of "yes" or "no". If "yes", continue to clockwork flags will be ignored and alignment will be performed anyway. If there are not enough reads and/or not a reference found the programme will still exit.
-* **collate**<br />
-One of "yes" or "no". If "yes" collate function will be ran to collect all resistance profiling reports. Will be outputted to the base level output directory (e.g. ```output/tbprofiler.variants.csv```)
-
-For more information on the parameters run `nextflow run main.nf --help`
-
-The path to the singularity images can also be changed in the singularity profile in `nextflow.config`. Default value is `${baseDir}/singularity`
+The following parameters should be set in `nextflow.config`. They can be accessed by `nextflow run main.nf --help`:
+
+```
+--input_dir                      [string]          Input directory containing FASTQs or BAMs
+--pattern                        [string]          Glob pattern for FASTQs or BAM
+--output_dir                     [string]          Output directory
+--permissive                     [boolean]         Flag. If True, errors in decontamination will be demoted to warnings
+--filetype                       [string]          Either "fastq" or "bam". Assumes FASTQs are PE Illumina reads and BAMs are mapped against one of the references in resources/  (accepted: bam, fastq) [default: fastq]
+--unmix_myco                     [boolean]         Flag. If True then minority Mycobacteriaceae reads will be removed. If False, they will be discarded
+--species                        [string]          Species which will be mapped against, corresponding to references in resources/: can be one of  abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis or null. If 'null' the top hit as determined by Afanc will be used  (accepted: null, abscessus, africanum,
+avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis)
+--sing_dir                       [string]          Directory to singularity definition files. Used to parse versions for reporting [default: ${baseDir}/resources]
+--config_file                    [string]          Path to Nextflow config file. Used for parsing arguments to write to results if needed [default: ${baseDir}/nextflow.config]
+--help                           [boolean, string] Show the help message for all top level parameters. When a parameter is given to `--help`, the full help message of that parameter will be printed.
+--helpFull                       [boolean]         Show the help message for all non-hidden parameters.
+--showHidden                     [boolean]         Show all hidden parameters in the help message. This needs to be used in combination with `--help` or `--helpFull`.
+
+resources
+  --resource_dir                 [string] Path to resources directroy where utility files are stored [default: ${baseDir}/resources]
+  --refseq                       [string] Path to NCBI refseq summary file [default: ${baseDir}/resources/assembly_summary_refseq.txt]
+
+resistance
+  --resistance_profiler          [string]  Tool used for tb-profiler. Either tb-profiler or tbtamr  (accepted: tb-profiler, tbtamr) [default: tb-profiler]
+  --collate                      [boolean] Flag. If True resistance reports will be summarised
+
+bowtie
+  --bowtie_index                 [string] Bowtie index directory [default: ${baseDir}/bowtie2/]
+  --bowtie_index_name            [string] Prefix for the Bowtie2 index (minus the file extensions). [default: hg19_1kgmaj]
+
+afanc
+  --afanc_percent_threshold      [number]  Minimum percentage threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 5]
+  --afanc_n_reads_threshold      [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc [default: 500]
+  --afanc_fail_percent_threshold [number]  Minimum percentage threshold for reads in order for a taxa to be considered in Afanc [default: 2]
+  --afanc_fail_n_reads_threshold [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 200]
+
+kraken
+  --kraken_percent_threshold     [number]  Percentage threshold of reads required for taxa to be included in Kraken reports [default: 10]
+  --kraken_n_reads_threshold     [integer] Raw reads threshold required for taxa to be included in Kraken reports [default: 10000]
+  --kraken_db                    [string]  Kraken2 database path [default: kraken2/]
+```
 
 ## Stub runs ##
 To test the stub run:
 
@@ -60,11 +60,11 @@ def process_requirements(args):
     if ((supposed_species != 'null') & (supposed_species not in species)):
         sys.exit('ERROR: if you provide a species ID, it must be one of either: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis')
 
-    if ((unmix_myco != 'yes') & (unmix_myco != 'no')):
-        sys.exit('ERROR: \'unmix myco\' should be either \'yes\' or \'no\'')
+    if ((unmix_myco != 'true') & (unmix_myco != 'false')):
+        sys.exit('ERROR: \'unmix myco\' should be either \'true\' or \'false\'')
 
-    if ((permissive != 'yes') & (permissive != 'no')):
-        sys.exit('ERROR: \'permissive\' should be either \'yes\' or \'no\'')
+    if ((permissive != 'true') & (permissive != 'false')):
+        sys.exit('ERROR: \'permissive\' should be either \'true\' or \'false\'')
 
     ## check IDs from the file names
 
@@ -149,7 +149,7 @@ def match_taxonomy(spec):
 # define main function to process data
 def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_myco, myco_dir_path, prev_species_json_path, urls, tax_ids, sample_id, permissive):
 
-    if permissive == "yes":
+    if permissive == "true":
         permissive = True
     else:
         permissive = False
@@ -312,7 +312,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
             if len(re_species[0]) > 1:
                  contaminant_genus = re_species[0][0]
                  contaminant_species = re_species[0][1]
-            if ((unmix_myco == 'no') & (match_taxonomy(top_species)) & (match_taxonomy(spec))):
+            if ((unmix_myco == 'false') & (match_taxonomy(top_species)) & (match_taxonomy(spec))):
                 if spec not in ignored_mixed_myco: ignored_mixed_myco[spec] = 0
                 ignored_mixed_myco[spec] += 1
             else:
@@ -446,7 +446,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
     # WHAT IS LIKELY TO HAVE HAPPENED IS THAT THE ALIGNMENT-BASED DECONTAMINATION PROCESS HAS TRIED TO DISAMBIGUATE A MIXTURE OF VERY SIMILAR MYCOBACTERIA AND INADVERTENTLY REMOVED TOO MANY READS. THERE WILL BE NOTHING SUBSTANTIVE LEFT FOR AFANC TO CLASSIFY.
     if ((num_afanc_species == afanc_finds_nothing) & (num_afanc_species == 1)):
         if out['summary_questions']['were_contaminants_removed'] == 'yes':
-            warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification. If this is a mixed-mycobacterial sample, then an alignment-based contaminant-removal process may not be appropriate. Suggestion: re-run with --unmix_myco 'no'")
+            warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification. If this is a mixed-mycobacterial sample, then an alignment-based contaminant-removal process may not be appropriate. Suggestion: re-run with --unmix_myco 'false'")
         elif out['summary_questions']['were_contaminants_removed'] == 'no':
             warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification")
 
@@ -483,20 +483,20 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
     description += "A 'reference genome' is a manually-selected community standard for that species. Note that some prokaryotes can have more than one reference genome\n"
     description += "[species] refers to what you believe this sample to be. You will be warned if this differs from the Kraken/afanc predictions\n"
     description += "By defining [species] you will automatically select this to be the genome against which reads will be aligned using Clockwork\n"
-    description += "[unmix myco] is either 'yes' or 'no', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
-    description += "If 'no', any contaminating mycobacteria will be recorded but NOT acted upon\n"
+    description += "[unmix myco] is either 'true' or 'false', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
+    description += "If 'false', any contaminating mycobacteria will be recorded but NOT acted upon\n"
     usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes] [aws_config]\n"
-    usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis yes myco_dir\n\n\n"
+    usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis true myco_dir\n\n\n"
 
     parser = argparse.ArgumentParser(description=description, usage=usage, formatter_class=argparse.RawTextHelpFormatter)
     parser.add_argument('afanc_json', metavar='afanc_json', type=str, help='Path to afanc json report')
     parser.add_argument('kraken_json', metavar='kraken_json', type=str, help='Path to Kraken json report')
     parser.add_argument('assembly_file', metavar='assembly_file', type=str, help='Path to RefSeq assembly summary file')
     parser.add_argument('species', metavar='species', type=str, help='Refers to what you believe this sample to be. You will be warned if this differs from the Kraken/afanc predictions')
-    parser.add_argument('unmix_myco', metavar='unmix_myco', type=str, help='Is either \'yes\' or \'no\', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\nIf \'no\', any contaminating mycobacteria will be recorded but NOT acted upon')
+    parser.add_argument('unmix_myco', metavar='unmix_myco', type=str, help='Is either \'true\' or \'false\', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\nIf \'false\', any contaminating mycobacteria will be recorded but NOT acted upon')
     parser.add_argument('myco_dir', metavar='myco_dir', type=str, help='Path to myco directory')
     parser.add_argument('prev_species_json', metavar='prev_species_json', type=str, help='Path to previous species json file. Can be set to \'null\'')
-    parser.add_argument('permissive', metavar='permissive', type=str, help="Is either \'yes\' or \'no\', given in response to the question: do you want to carry on to Clockwork regardless of errors?")
+    parser.add_argument('permissive', metavar='permissive', type=str, help="Is either \'true\' or \'false\', given in response to the question: do you want to carry on to Clockwork regardless of errors?")
     parser.add_argument('pass_number', metavar='pass_number', type=int, help="Pass number. Refers to what pass of decontamination the pipeline is on")
     args = parser.parse_args()
 
 
@@ -226,8 +226,8 @@ def process_requirements(args):
     if pct_threshold > 100:
         sys.exit('ERROR: %f is a %% and cannot be > 100' %(pct_threshold))
 
-    if ((permissive != 'yes') & (permissive != 'no')):
-        sys.exit('ERROR: \'permissive\' should be either \'yes\' or \'no\'')
+    if ((permissive != 'true') & (permissive != 'false')):
+        sys.exit('ERROR: \'permissive\' should be either \'true\' or \'false\'')
 
     return
 
@@ -258,7 +258,7 @@ def process_requirements(args):
     permissive = sys.argv[5]
 
 	#coerce permissive into a bool
-    if permissive == "yes":
+    if permissive == "true":
         permissive = True
     else:
         permissive = False
 
@@ -1,15 +1,15 @@
 process { 
 
     withLabel:getversion {
-        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r1"
+        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"
     }
 
     withLabel:preprocessing {
-        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9"
+        container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"
     }
 
     withLabel:tbprofiler {
-        container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9r1"
+        container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9"
     }
 
     withLabel:tbtamr {
 
@@ -11,7 +11,7 @@ ENV tbdb_version=a5e1d48 \
 #USER root
 WORKDIR /
 ENV TMPDIR="/data"
-ARG TBPROFILER_VER="6.2.1"
+ARG TBPROFILER_VER="2c92475"
 
 # this version is the shortened commit hash on the `master` branch here https://github.com/jodyphelan/tbdb/
 # commits are found on https://github.com/jodyphelan/tbdb/commits/master
Original file line number	Diff line number	Diff line change
`@@ -1,15 +1,15 @@`
`1`	`1`	`process {`
`2`	`2`
`3`	`3`	`withLabel:getversion {`
`4`		`- container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r1"`
	`4`	`+ container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"`
`5`	`5`	`}`
`6`	`6`
`7`	`7`	`withLabel:preprocessing {`
`8`		`- container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9"`
	`8`	`+ container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"`
`9`	`9`	`}`
`10`	`10`
`11`	`11`	`withLabel:tbprofiler {`
`12`		`- container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9r1"`
	`12`	`+ container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9"`
`13`	`13`	`}`
`14`	`14`
`15`	`15`	`withLabel:tbtamr {`