Skip to content

Commit 3d7e723

Browse files
authored
Merge pull request #113 from Pathogen-Genomics-Cymru/validate_test
Validate test
2 parents 6ea6971 + 9eeda4e commit 3d7e723

21 files changed

+437
-398
lines changed

.github/workflows/build-push-quay.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ on:
33
push:
44
branches:
55
- main
6+
- validate_test
67
paths:
78
- '**/Dockerfile*'
89
- "bin/"

README.md

Lines changed: 40 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -53,48 +53,46 @@ By default, the pipeline will just run on the local machine. To run on a cluster
5353
Minimum recommended requirements: 32GB RAM, 8CPU
5454

5555
## Paramaters ##
56-
The following parameters should be set in `nextflow.config` or specified on the command line:
57-
58-
* **input_dir**<br />
59-
Directory containing fastq OR bam files
60-
* **filetype**<br />
61-
File type in input_dir. Either "fastq" or "bam"
62-
* **pattern**<br />
63-
Regex to match fastq files in input_dir, e.g. "*_R{1,2}.fq.gz". Only mandatory if --filetype is "fastq"
64-
* **output_dir**<br />
65-
Output directory for results
66-
* **unmix_myco**<br />
67-
Do you want to disambiguate mixed-mycobacterial samples by read alignment? Either "yes" or "no":
68-
* If "yes" workflow will remove reads mapping to any minority mycobacterial genomes but in doing so WILL ALMOST CERTAINLY ALSO reduce coverage of the principal species
69-
* If "no" then mixed-mycobacterial samples will be left alone. Mixtures of mycobacteria + non-mycobacteria will still be disambiguated
70-
* **species**<br />
71-
Principal species in each sample, assuming genus Mycobacterium. Default 'null'. If parameter used, takes 1 of 10 values: abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis. Using this parameter will apply an additional sanity test to your sample
72-
* If you DO NOT use this parameter (default option), pipeline will determine principal species from the reads and consider any other species a contaminant
73-
* If you DO use this parameter, pipeline will expect this to be the principal species. It will fail the sample if reads from this species are not actually the majority
74-
* **kraken_db**<br />
75-
Directory containing `*.k2d` Kraken2 database files (k2_pluspf_16gb recommended, obtain from https://benlangmead.github.io/aws-indexes/k2)
76-
* **bowtie2_index**<br />
77-
Directory containing Bowtie2 index (obtain from ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19_1kgmaj_bt2.zip). The specified path should NOT include the index name
78-
* **bowtie_index_name**<br />
79-
Name of the bowtie index, e.g. hg19_1kgmaj<br />
80-
* **vcfmix**<br />
81-
Run [vcfmix](https://github.com/AlexOrlek/VCFMIX), yes or no. Set to no for synthetic samples<br />
82-
* **resistance_profiler**<br />
83-
Run resistance profiling for Mycobacterium tubercuclosis. Either ["tb-profiler"](https://tbdr.lshtm.ac.uk/), ["tbtamr"](https://github.com/MDU-PHL/tbtamr) or "none".
84-
* **afanc_myco_db**<br />
85-
Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
86-
* **update_tbprofiler**<br />
87-
Update tb-profiler. Either "yes" or "no". "yes" may be useful when running outside of a container for the first time as we will not have constructed a tb-profiler database matching our reference. This is not needed with the climb, docker and singluarity profiles as the reference has already been added. Alternatively you can run ```tb-profiler update_tbdb --match_ref <lodestone_dir>/resources/tuberculosis.fasta```.
88-
* **refseq**<br />
89-
Path to assembly summary refseq file (taken from [here](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt)). A local version is stored for reproducibility purposes in ```resources/``` but for best results download the latest version. Instead of downloading, the link can be supplied directly in the refseq argument e.g. `--refseq "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txtftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt"`
90-
* **permissive**<br />
91-
One of "yes" or "no". If "yes", continue to clockwork flags will be ignored and alignment will be performed anyway. If there are not enough reads and/or not a reference found the programme will still exit.
92-
* **collate**<br />
93-
One of "yes" or "no". If "yes" collate function will be ran to collect all resistance profiling reports. Will be outputted to the base level output directory (e.g. ```output/tbprofiler.variants.csv```)
94-
95-
For more information on the parameters run `nextflow run main.nf --help`
96-
97-
The path to the singularity images can also be changed in the singularity profile in `nextflow.config`. Default value is `${baseDir}/singularity`
56+
The following parameters should be set in `nextflow.config`. They can be accessed by `nextflow run main.nf --help`:
57+
58+
```
59+
--input_dir [string] Input directory containing FASTQs or BAMs
60+
--pattern [string] Glob pattern for FASTQs or BAM
61+
--output_dir [string] Output directory
62+
--permissive [boolean] Flag. If True, errors in decontamination will be demoted to warnings
63+
--filetype [string] Either "fastq" or "bam". Assumes FASTQs are PE Illumina reads and BAMs are mapped against one of the references in resources/ (accepted: bam, fastq) [default: fastq]
64+
--unmix_myco [boolean] Flag. If True then minority Mycobacteriaceae reads will be removed. If False, they will be discarded
65+
--species [string] Species which will be mapped against, corresponding to references in resources/: can be one of abscessus, africanum, avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis or null. If 'null' the top hit as determined by Afanc will be used (accepted: null, abscessus, africanum,
66+
avium, bovis, chelonae, chimaera, fortuitum, intracellulare, kansasii, tuberculosis)
67+
--sing_dir [string] Directory to singularity definition files. Used to parse versions for reporting [default: ${baseDir}/resources]
68+
--config_file [string] Path to Nextflow config file. Used for parsing arguments to write to results if needed [default: ${baseDir}/nextflow.config]
69+
--help [boolean, string] Show the help message for all top level parameters. When a parameter is given to `--help`, the full help message of that parameter will be printed.
70+
--helpFull [boolean] Show the help message for all non-hidden parameters.
71+
--showHidden [boolean] Show all hidden parameters in the help message. This needs to be used in combination with `--help` or `--helpFull`.
72+
73+
resources
74+
--resource_dir [string] Path to resources directroy where utility files are stored [default: ${baseDir}/resources]
75+
--refseq [string] Path to NCBI refseq summary file [default: ${baseDir}/resources/assembly_summary_refseq.txt]
76+
77+
resistance
78+
--resistance_profiler [string] Tool used for tb-profiler. Either tb-profiler or tbtamr (accepted: tb-profiler, tbtamr) [default: tb-profiler]
79+
--collate [boolean] Flag. If True resistance reports will be summarised
80+
81+
bowtie
82+
--bowtie_index [string] Bowtie index directory [default: ${baseDir}/bowtie2/]
83+
--bowtie_index_name [string] Prefix for the Bowtie2 index (minus the file extensions). [default: hg19_1kgmaj]
84+
85+
afanc
86+
--afanc_percent_threshold [number] Minimum percentage threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 5]
87+
--afanc_n_reads_threshold [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc [default: 500]
88+
--afanc_fail_percent_threshold [number] Minimum percentage threshold for reads in order for a taxa to be considered in Afanc [default: 2]
89+
--afanc_fail_n_reads_threshold [integer] Minimum reads threshold for reads in order for a taxa to be considered in Afanc if the pipeline has failed earlier on (for reporting) [default: 200]
90+
91+
kraken
92+
--kraken_percent_threshold [number] Percentage threshold of reads required for taxa to be included in Kraken reports [default: 10]
93+
--kraken_n_reads_threshold [integer] Raw reads threshold required for taxa to be included in Kraken reports [default: 10000]
94+
--kraken_db [string] Kraken2 database path [default: kraken2/]
95+
```
9896

9997
## Stub runs ##
10098
To test the stub run:

bin/identify_tophit_and_contaminants2.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,11 @@ def process_requirements(args):
6060
if ((supposed_species != 'null') & (supposed_species not in species)):
6161
sys.exit('ERROR: if you provide a species ID, it must be one of either: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis')
6262

63-
if ((unmix_myco != 'yes') & (unmix_myco != 'no')):
64-
sys.exit('ERROR: \'unmix myco\' should be either \'yes\' or \'no\'')
63+
if ((unmix_myco != 'true') & (unmix_myco != 'false')):
64+
sys.exit('ERROR: \'unmix myco\' should be either \'true\' or \'false\'')
6565

66-
if ((permissive != 'yes') & (permissive != 'no')):
67-
sys.exit('ERROR: \'permissive\' should be either \'yes\' or \'no\'')
66+
if ((permissive != 'true') & (permissive != 'false')):
67+
sys.exit('ERROR: \'permissive\' should be either \'true\' or \'false\'')
6868

6969
## check IDs from the file names
7070

@@ -149,7 +149,7 @@ def match_taxonomy(spec):
149149
# define main function to process data
150150
def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_myco, myco_dir_path, prev_species_json_path, urls, tax_ids, sample_id, permissive):
151151

152-
if permissive == "yes":
152+
if permissive == "true":
153153
permissive = True
154154
else:
155155
permissive = False
@@ -312,7 +312,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
312312
if len(re_species[0]) > 1:
313313
contaminant_genus = re_species[0][0]
314314
contaminant_species = re_species[0][1]
315-
if ((unmix_myco == 'no') & (match_taxonomy(top_species)) & (match_taxonomy(spec))):
315+
if ((unmix_myco == 'false') & (match_taxonomy(top_species)) & (match_taxonomy(spec))):
316316
if spec not in ignored_mixed_myco: ignored_mixed_myco[spec] = 0
317317
ignored_mixed_myco[spec] += 1
318318
else:
@@ -446,7 +446,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
446446
# WHAT IS LIKELY TO HAVE HAPPENED IS THAT THE ALIGNMENT-BASED DECONTAMINATION PROCESS HAS TRIED TO DISAMBIGUATE A MIXTURE OF VERY SIMILAR MYCOBACTERIA AND INADVERTENTLY REMOVED TOO MANY READS. THERE WILL BE NOTHING SUBSTANTIVE LEFT FOR AFANC TO CLASSIFY.
447447
if ((num_afanc_species == afanc_finds_nothing) & (num_afanc_species == 1)):
448448
if out['summary_questions']['were_contaminants_removed'] == 'yes':
449-
warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification. If this is a mixed-mycobacterial sample, then an alignment-based contaminant-removal process may not be appropriate. Suggestion: re-run with --unmix_myco 'no'")
449+
warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification. If this is a mixed-mycobacterial sample, then an alignment-based contaminant-removal process may not be appropriate. Suggestion: re-run with --unmix_myco 'false'")
450450
elif out['summary_questions']['were_contaminants_removed'] == 'no':
451451
warnings.append("warning: regardless of what Kraken reports, afanc did not make a species-level mycobacterial classification")
452452

@@ -483,20 +483,20 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
483483
description += "A 'reference genome' is a manually-selected community standard for that species. Note that some prokaryotes can have more than one reference genome\n"
484484
description += "[species] refers to what you believe this sample to be. You will be warned if this differs from the Kraken/afanc predictions\n"
485485
description += "By defining [species] you will automatically select this to be the genome against which reads will be aligned using Clockwork\n"
486-
description += "[unmix myco] is either 'yes' or 'no', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
487-
description += "If 'no', any contaminating mycobacteria will be recorded but NOT acted upon\n"
486+
description += "[unmix myco] is either 'true' or 'false', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
487+
description += "If 'false', any contaminating mycobacteria will be recorded but NOT acted upon\n"
488488
usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes] [aws_config]\n"
489-
usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis yes myco_dir\n\n\n"
489+
usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis true myco_dir\n\n\n"
490490

491491
parser = argparse.ArgumentParser(description=description, usage=usage, formatter_class=argparse.RawTextHelpFormatter)
492492
parser.add_argument('afanc_json', metavar='afanc_json', type=str, help='Path to afanc json report')
493493
parser.add_argument('kraken_json', metavar='kraken_json', type=str, help='Path to Kraken json report')
494494
parser.add_argument('assembly_file', metavar='assembly_file', type=str, help='Path to RefSeq assembly summary file')
495495
parser.add_argument('species', metavar='species', type=str, help='Refers to what you believe this sample to be. You will be warned if this differs from the Kraken/afanc predictions')
496-
parser.add_argument('unmix_myco', metavar='unmix_myco', type=str, help='Is either \'yes\' or \'no\', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\nIf \'no\', any contaminating mycobacteria will be recorded but NOT acted upon')
496+
parser.add_argument('unmix_myco', metavar='unmix_myco', type=str, help='Is either \'true\' or \'false\', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\nIf \'false\', any contaminating mycobacteria will be recorded but NOT acted upon')
497497
parser.add_argument('myco_dir', metavar='myco_dir', type=str, help='Path to myco directory')
498498
parser.add_argument('prev_species_json', metavar='prev_species_json', type=str, help='Path to previous species json file. Can be set to \'null\'')
499-
parser.add_argument('permissive', metavar='permissive', type=str, help="Is either \'yes\' or \'no\', given in response to the question: do you want to carry on to Clockwork regardless of errors?")
499+
parser.add_argument('permissive', metavar='permissive', type=str, help="Is either \'true\' or \'false\', given in response to the question: do you want to carry on to Clockwork regardless of errors?")
500500
parser.add_argument('pass_number', metavar='pass_number', type=int, help="Pass number. Refers to what pass of decontamination the pipeline is on")
501501
args = parser.parse_args()
502502

bin/parse_kraken_report2.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -226,8 +226,8 @@ def process_requirements(args):
226226
if pct_threshold > 100:
227227
sys.exit('ERROR: %f is a %% and cannot be > 100' %(pct_threshold))
228228

229-
if ((permissive != 'yes') & (permissive != 'no')):
230-
sys.exit('ERROR: \'permissive\' should be either \'yes\' or \'no\'')
229+
if ((permissive != 'true') & (permissive != 'false')):
230+
sys.exit('ERROR: \'permissive\' should be either \'true\' or \'false\'')
231231

232232
return
233233

@@ -258,7 +258,7 @@ def process_requirements(args):
258258
permissive = sys.argv[5]
259259

260260
#coerce permissive into a bool
261-
if permissive == "yes":
261+
if permissive == "true":
262262
permissive = True
263263
else:
264264
permissive = False

config/containers.config

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
process {
22

33
withLabel:getversion {
4-
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r1"
4+
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"
55
}
66

77
withLabel:preprocessing {
8-
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9"
8+
container = "quay.io/pathogen-genomics-cymru/preprocessing:0.9.9r2"
99
}
1010

1111
withLabel:tbprofiler {
12-
container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9r1"
12+
container = "quay.io/pathogen-genomics-cymru/tbprofiler:0.9.9"
1313
}
1414

1515
withLabel:tbtamr {

docker/Dockerfile.tbprofiler-0.9.9

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ENV tbdb_version=a5e1d48 \
1111
#USER root
1212
WORKDIR /
1313
ENV TMPDIR="/data"
14-
ARG TBPROFILER_VER="6.2.1"
14+
ARG TBPROFILER_VER="2c92475"
1515

1616
# this version is the shortened commit hash on the `master` branch here https://github.com/jodyphelan/tbdb/
1717
# commits are found on https://github.com/jodyphelan/tbdb/commits/master

docker/Dockerfile.tbprofiler-0.9.9r1

Lines changed: 0 additions & 62 deletions
This file was deleted.

0 commit comments

Comments
 (0)