This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data using the following steps:
- fastp - a tool for all-in-one FASTQ processing, including quality filtering, adaptor-trimming, and quality-trimming, as well as quality profiling
- Samtools stats - a tool for collecting statistics from BAM files and outputting them in a text format
- Mash - a tool for species screening via fast genome and metagenome distance estimation using MinHash
- Shovill - an assembly tool for illumina paired end reads
- QUAST - a tool for evaluating assemblies through calculating and reporting quality metrics
- Snippy - a tool for rapid haploid variant calling and core genome alignment
- mlst - a tool for scanning contigs against PubMLST typing schemes.
- NGMASTER - a tool for performing multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) and Neisseria gonorrhoeae sequence typing for antimicrobial resistance (NG-STAR)
- BLASTn - basic local alignment search tool (BLAST) for comparing nucleotide sequences to those in a database.
- Samtools depth - a tool for calculating the read depth at a given position from an alignment.
- snp-dists - a tool for generating a SNP distance matrix from a FASTA core alignment
- Gubbins - a tool for marking recombination regions and constructing a phylogeny based on mutations outside of those regions
- RAxML - a tool for performing Maximum Likelihood based inference of large phylogenetic trees
- Gotree - tool to manipulate phylogenetic trees and generate visualizations
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
- QC_FASTQ – neissflow_FASTQ_QC_report.tsv
- MASH – neissflow_Mash_contaminants.tsv, neissflow_Mash_top_hit_report.tsv, neissflow_Mash_plasmids.tsv
- processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt
- amr_profiler - neissflow_amr_report.tsv, neissflow_avg_depth_report.tsv
- This report is present if the
QCprofile is included in the run - This report contains the same columns as run-name_final_report.tsv
In this directory, there are 2 subdirectories, named Reports and Samples.
Samples - contains subdirectories for each of the isolates, which contain the trimmed & filtered paired read files produced by fastp for that isolate.
Reports - contains the files neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv.
-
neissflow_FASTQ_QC_report.tsv - aggregated report containing the sequence reports generated by fastp for all of the samples. This report is used for the first QC check to remove low quality isolates from analysis.
-
neissflow_failed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the low-quality samples that failed the first QC check
-
neissflow_passed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the samples that passed the first QC check and moved on for further analysis
The columns for the following metrics should be found in neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv (this information can also be found in the fastp repository):
| Metric | Description |
|---|---|
| Isolate | Isolate ID |
| before_filtering_total_reads | Total number of reads in the isolate FASTQ file before quality filtering and trimming |
| before_filtering_total_bases | Total number of bases in the isolate FASTQ file before quality filtering and trimming |
| before_filtering_q20_bases | Total number of bases with a Phred score of >= Q20 before quality filtering and trimming |
| before_filtering_q30_bases | Total number of bases with a Phred score of >= Q30 before quality filtering and trimming |
| before_filtering_read1_mean_length | Mean length of reads for the R1 file for the isolate paired-end read files before quality filtering and trimming |
| before_filtering_read2_mean_length | Mean length of reads for the R2 file for the isolate paired-end read files before quality filtering and trimming |
| before_filtering_gc_content | Fraction of bases that are guanine or cytosine before quality filtering and trimming |
| after_filtering_total_reads | Total number of reads in the isolate FASTQ file after quality filtering and trimming |
| after_filtering_total_bases | Total number of bases in the isolate FASTQ file after quality filtering and trimming |
| after_filtering_q20_bases | Total number of bases with a phred score of >= Q20 after quality filtering and trimming |
| after_filtering_q30_bases | Total number of bases with a phred score of >= Q30 after quality filtering and trimming |
| after_filtering_read1_mean_length | Mean length of reads for the R1 file for the isolate paired-end read files after quality filtering and trimming |
| after_filtering_read2_mean_length | Mean length of reads for the R2 file for the isolate paired-end read files after quality filtering and trimming |
| after_filtering_gc_content | Fraction of bases that are guanine or cytosine after quality filtering and trimming |
| passed_filter_reads | Total number of reads in the isolate FASTQ file that passed the quality filter |
| low_quality_reads | Total number of reads in the isolate FASTQ file that did NOT pass the quality filter |
| too_many_N_reads | Total number of reads that contain too many “N” bases (bases that were not able to basecalled by the sequencer) |
| too_short_reads | Total number of reads that were filtered out due to having length <= 100 bases |
| too_long_reads | Total number of reads that were filtered out due to being too long (limit not currently set) |
In this directory, there are subdirectories for each reference used (default is FA19 and AMR), which contain directories for for each of the isolates in the collection.
The isolate directories contain all of the Snippy output for that specific isolate. Below is a table of the produced files (this information can also be found in the Snippy repository):
| Extension | Description |
|---|---|
| .tab | A simple tab-separated summary of all the variants |
| .csv | A comma-separated version of the .tab file |
| .html | A HTML version of the .tab file |
| .vcf | The final annotated variants in VCF format |
| .bed | The variants in BED format |
| .gff | The variants in GFF3 format |
| .bam | The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates. |
| .bam.bai | Index for the .bam file |
| .log | A log file with the commands run and their outputs |
| .aligned.fa | A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants) |
| .consensus.fa | A version of the reference genome with all variants instantiated |
| .consensus.subs.fa | A version of the reference genome with only substitution variants instantiated |
| .raw.vcf | The unfiltered variant calls from Freebayes |
| .filt.vcf | The filtered variant calls from Freebayes |
| .vcf.gz | Compressed .vcf file via BGZIP |
| .vcf.gz.csi | Index for the .vcf.gz via bcftools index |
The TAB, CSV, and HTML files all should contain the same variant summary information in tables with the following columns:
| Name | Description |
|---|---|
| CHROM | The sequence the variant was found in eg. the name after the > in the FASTA reference |
| POS | Position in the sequence, counting from 1 |
| TYPE | The variant type: single nucleotide polymorphism (snp), multinucleotide polymorphism (msp), insertion (ins), deletion (del), combination of snp/mnp (complex) |
| REF | The nucleotide(s) in the reference |
| ALT | The alternate nucleotide(s) supported by the reads |
| EVIDENCE | Frequency counts for REF and ALT |
| FTYPE | Class of feature affected: CDS tRNA rRNA ... |
| STRAND | Strand the feature was on: + - |
| NT_POS | Nucleotide position of the variant within the feature / Length in nucleotides |
| AA_POS | Residue position / Length in amino acids (only if FTYPE is CDS) |
| EFFECT | The snpEff annotated consequence of this variant |
| LOCUS_TAG | The /locus_tag of the feature (if it existed) |
| GENE | The /gene tag of the feature (if it existed) |
| PRODUCT | The /product tag of the feature (if it existed) |
In this directory, there are report text files for each of the samples, as well as a file called “neissflow_coverage.tsv”
The “neissflow_coverage.tsv” file contains rows for each isolate in the input directory. The columns for the following fields can be found in the output file:
| Metric | Description |
|---|---|
| ID | Isolate ID |
| %target>10x | Percent of the target genome (FA19) with greater than 10x coverage |
The content contained in the report text files is not used for further analysis or QC checks at this time, but information on metrics included can be found in SOP-TBA or on the Samtools stats manual page.
In this directory there are TSV reports for each sample containing the complete Mash screening report for each respective sample, as well as Mash_contaminants.tsv and Mash_top_hit_report.tsv.
- neissflow_Mash_top_hit_report.tsv - Reports the top species hit (greatest identity score + fraction of hashes) from the Mash screening with 1000 total hashes. (non plasmid hits)
- neissflow_Mash_contaminants.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) non-Neisseria hits from the Mash screening. (non plasmid hits)
- neissflow_Mash_plasmids.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) plasmid hits from the Mash screening.
neissflow_Mash_contaminants.tsv, neissflow_Mash_plasmids, and neissflow_Mash_top_hit_report.tsv contain the following columns:
| Metric | Description |
|---|---|
| ID | Isolate ID |
| ident | What fraction of the bases are shared between the reference genome and input reads. Sequencing errors and gaps in coverage reduce this. |
| hashes | For each k-mer in the dataset screened, a “hash” is created and compared with the k-mers or “hashes” in each reference genome in the databases. When all k-mers are shared with a particular reference genome then 1000/1000 is the value for shared hashes. |
| median_mult | Is a “rough estimate” for abundance, but it is affected by the size of the reference genome. Small genomes tend to produce large values, while having low identity and few shared hashes. |
| p_val | The probability of observing the number of shared k-mers with the estimated identity |
| hit_name | Truncated name of the reference genome. Ex: Neisseria_gonorrhoeae_MS11 |
The individual reports contain the same information; however, the columns are not labeled and the “hit_name” is the long version of the Refseq FASTA hit.
In this directory there are FASTA assemblies for each sample, a QUAST directory, and a Denovo_assembly_Stats_QC_report.txt file.
- QUAST - Contains subdirectories for each of the samples in the set. These directories contain output from QUAST, which are reports on the quality metrics for that sample's assembly. The description of the QUAST output can be found below:
report.txt summary table
report.tsv tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)
report.tex Latex version
report.pdf PDF version, includes all tables and plots for some statistics
report.html everything in an interactive HTML file
icarus.html Icarus main menu with links to interactive viewers
contigs_reports/
misassemblies_report detailed report on misassemblies
unaligned_report detailed report on unaligned and partially unaligned contigs
-
Sample Assemblies – FASTA files produced by shovill, follow sample_name_contigs.fa nomenclature
-
neissflow_Denovo_assembly_Stats_QC_report.txt – Compiled QC metrics report for all of the isolates in the collection. This report contains the following columns:
Metric Description Filename Isolate ID Contig_Count Total number of contigs Bases_In_Contigs Total number of bases included in contigs Large_Contig_Count Total number large contigs (length > 10,000) Small_Contig_Count Total number of small contigs (length <= 10,000) >500bp_Contig_Count Total number of contigs with >500 basepairs Bases_In_Large_Contigs Total number of bases included in large contigs (length > 10,000) Bases_In_Small_Contigs Total number of bases included in small contigs (length <= 10,000) Fraction_Of_Contigs_That_Are_Large (large contig count) / (total contig count) Min_Coverage_Large_Contigs Minimum of the coverages of the large contigs Max_Ratio_of_Coverage_Large_Contigs (maximum of the coverages of the large contigs) / (minimum of the coverages of the large contigs) Low_Coverage_Contig_Count Total number of contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2) Low_Coverage_Contig_Bases Total number of bases in contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2) Mean_Coverage ( ∀ contig; contig ∊ assembly ∑(|contig| * [coverage of contig]) ) / ( total number of bases in the assembly ) Ambiguous_nucleotides Total number of ambiguous nucleotides in the assembly N50 The length for which the collection of all contigs of that length or longer cover at least half the assembly N75 Similar to N50, but using 75% of the assembly covered N90 Similar to N50, and N75 but using 90% of the assembly covered
In this directory there are 2 files named neissflow_failed_qc.tsv and neissflow_passed_qc.tsv.
These reports are an aggregation of some of the generated reports up to this point in order to perform a species and assembly QC check.
Aggregated Reports:
- QC_FASTQ – neissflow_FASTQ_QC_report.tsv
- COVERAGE – neissflow_coverage.tsv
- MASH – neissflow_Mash_top_hit_report.tsv
- processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt
-
neissflow_failed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that failed the species and assembly QC check and did not continue for further analysis.
-
neissflow_passed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that passed the species and assembly QC check and continued for further analysis.
In this directory there are subdirectories for each of the isolates, as well as two aggregated reports, those being the amr_report.tsv and avg_depth_report.tsv. These reports are based on those generated by the GC-Genome-Profiler, and largely contain the same columns/information. The isolate directories contain the various reports for that isolate. These reports are aggregated to generate the amr_report.tsv and avg_depth_report.tsv files to contain the results for all of the samples in the set. The amr_report.tsv is an aggregation of the < isolate >_amr_report.tsv files. The avg_depth_report.tsv is an aggregation of the < isolate >_amr_depth.tsv reports. The reports found in the isolate directories are:
- < isolate >_amr_blast.tsv - this report contains the results of the blastn runs for that isolate, all of the columns are included in the amr_report.tsv report
| Metric | Description |
|---|---|
| Sample | Isolate ID |
| penA allele | Best hit from penA allele BLAST database |
| porB allele | Best hit from porB allele BLAST database |
| mtrR mosaic | True if isolate reaches 98% match threshold |
- < isolate >_amr_depth.tsv - this report contains the depth of coverage for each of the relevant AMR genes
| Metric | Description |
|---|---|
| Sample | Isolate ID |
| ermC | Depth of coverage for the erythromycin resistance ermC gene |
| ermB | Depth of coverage for the erythromycin resistance ermB gene |
| ermF | Depth of coverage for the erythromycin resistance ermF gene |
| gyrB | Depth of coverage for the gyrB gene (positions 1547933-1550323 in FA19) |
| gyrA | Depth of coverage for the gyrA gene (positions 357412-360162 in FA19) |
| mtrR-CDEprom | Depth of coverage for the mtrR CDE promoter (positions 1110651-1110900 in FA19) |
| macA_and_prom | Depth of coverage for the macA gene and promoter (positions 1191001-1192230 in FA19) |
| norMprom | Depth of coverage for the norM promoter (positions 129494-129496 in FA19) |
| ftsX | Depth of coverage for the ftsX gene (positions 1707518-1708435 in FA19) |
| ponA | Depth of coverage for the ponA gene (positions 2078911-2081307 in FA19) |
| TetM-partial | Depth of coverage for partial reference of the tetracycline resistance determinant (tetM) gene |
| FA19_16SrRNA | Depth of coverage for 16S rRNA from FA19 |
| porB | Depth of coverage for the porB gene (positions 1598044-1599027 in FA19) |
| 23SrRNA | Depth of coverage for the 23S rRNA from FA19 |
| penA | Depth of coverage for the penA gene (positions 1301424-1303169 in FA19) |
| blaTEM | Depth of coverage for extended spectrum beta-lactamase gene blaTEM |
| Nm_sodC | Depth of coverage for Neisseria meningitidis gene sodC |
| mefA | Depth of coverage for Macrolide efflux protein A (mefA) |
| parC | Depth of coverage for the parC gene (positions 993563-995866 in FA19) |
| acnB | Depth of coverage for the acnB gene (positions 963428-966013 in FA19) |
| rplD | Depth of coverage for the rplD gene (positions 1614768-1615388 in FA19) |
| rplV | Depth of coverage for the rplV gene (positions 1612996-1613325 in FA19) |
| mtrR | Depth of coverage for the mtrR gene (positions 1110901-1111533 in FA19) |
| mtrD | Depth of coverage for the mtrD gene (positions 1106197-1109400 in FA19) |
| rpsE | Depth of coverage for the rpsE gene (positions 1607799-1608317 in FA19) |
| rpsJ | Depth of coverage for the rpsJ gene (positions 1616818-1617129 in FA19) |
-
< isolate >_amr_report.tsv - This report includes data from the blast, mlst, ngmaster, and variant reports for this isolate.
-
< isolate >_amr_vcf.tsv - This is a tab delimited version of a VCF file containing all of the variants found in the AMR associated genes for the isolate. This includes SNPs that are not reported in the variant report. If new positions of interest are added later, these files can be analyzed to see if they appeared in previous neissflow runs. This file is formatted exactly like the TAB file produced by Snippy, shown in the Snippy output section of this README.
-
< isolate >_mlst.tsv - Contains the MLST sequence type for the isolate
-
< isolate >_ngmaster.tsv - This report contains the NGMAST and NGSTAR sequence types for the isolate as well as the allele calls made to determine those types.
| Metric | Description |
|---|---|
| Sample | Isolate ID |
| SCHEME | ngmaSTar |
| NG-MAST | Sequence Type for the NG-MAST typing scheme |
| NG-STAR | Sequence Type for the NG-STAR typing scheme |
| porB_NG-MAST | porB allele per the NG-MAST typing scheme |
| tbpB | tbpB allele |
| penA | penA allele |
| mtrR | mtrR allele |
| porB_NG-STAR | porB allele per the NG-STAR typing scheme |
| ponA | ponA allele |
| gyrA | gyrA allele |
| parC | parC allele |
| 23S | 23S allele |
-
< isolate >_variant_report.tsv - This report includes either a mutation variant call or the FA19 nucleotide or amino acid for AMR loci as well as frequencies for certain calls, whether certain genes have an early stop, and the presence of certain horizontally transferred genes.
| Metric | Description | Default | Associated Drug(s) |
|---|---|---|---|
| Sample | Isolate ID | - | - |
| 23S-2611 base | FA19 or SNP base call at nucleotide 2599 in the 23S gene | C | AZM |
| 23S-2611 freq | Frequency of FA19 or SNP base call at nucleotide 2599 in the 23S gene | 1.0 | AZM |
| 23S-2059 base | FA19 or SNP base call at nucleotide 2047 in the 23S gene | A | AZM |
| 23S-2059 freq | Frequency of FA19 or SNP base call at nucleotide 2047 in the 23S gene | 1.0 | AZM |
| 23S-2058 base | FA19 or SNP base call at nucleotide 2046 in the 23S gene | A | AZM |
| 23S-2058 freq | Frequency of FA19 or SNP base call at nucleotide 2046 in the 23S gene | 1.0 | AZM |
| mtrR promoter | The mtrR promoter contains the sequence AAAAA, and this field is to report a mutation at any one of these positions (but only one nucleotide is reported). The FA19 default is reported as A and if a mutation is found at any of the 5 positions, that nucleotide will be reported instead. | A | AZM/PEN/TET/CFM/CRO |
| mtr120 promoter | FA19 or SNP base call at nucleotide 1110770 relative to the reference for the mtr120 promoter | G | AZM/PEN/TET/CFM/CRO |
| mtrR -35 | FA19 or SNP base call at nucleotide 1110836 relative to the reference for mtrR -35 | G | AZM/PEN/TET/CFM/CRO |
| mtrR WHOP | FA19 or SNP base call at nucleotide 1110839 relative to the reference for mtrR -35 | C | AZM/PEN/TET/CFM/CRO |
| mtrA binding site | FA19 or SNP base call at nucleotide 1110865 relative to the reference for the mtrA binding site | G | AZM/PEN/TET/CFM/CRO |
| mtrR aa39 | FA19 or mutated variant amino acid at position 39 in the mtrR gene | A | AZM/PEN/TET/CFM/CRO |
| mtrR aa44 | FA19 or mutated variant amino acid at position 44 in the mtrR gene | R | AZM/PEN/TET/CFM/CRO |
| mtrR aa45 | FA19 or mutated variant amino acid at position 45 in the mtrR gene | G | AZM/PEN/TET/CFM/CRO |
| mtrR aa47 | FA19 or mutated variant amino acid at position 47 in the mtrR gene | L | AZM/PEN/TET/CFM/CRO |
| mtrR aa79 | FA19 or mutated variant amino acid at position 79 in the mtrR gene | D | AZM/PEN/TET/CFM/CRO |
| mtrR aa105 | FA19 or mutated variant amino acid at position 105 in the mtrR gene | H | AZM/PEN/TET/CFM/CRO |
| mtrR premature stop | True if early stop is found in the mtrR gene | False | AZM/PEN/TET/CFM/CRO |
| penA aa311 | FA19 or mutated variant amino acid at position 311 in the penA gene | A | PEN/CFM/CRO |
| penA aa312 | FA19 or mutated variant amino acid at position 312 in the penA gene | I | PEN/CFM/CRO |
| penA aa316 | FA19 or mutated variant amino acid at position 316 in the penA gene | V | PEN/CFM/CRO |
| penA aa483 | FA19 or mutated variant amino acid at position 483 in the penA gene | T | PEN/CFM/CRO |
| penA aa501 | FA19 or mutated variant amino acid at position 501 in the penA gene | A | PEN/CFM/CRO |
| penA aa504 | FA19 or mutated variant amino acid at position 504 in the penA gene | L | PEN/CFM/CRO |
| penA aa512 | FA19 or mutated variant amino acid at position 512 in the penA gene | N | PEN/CFM/CRO |
| penA aa516 | FA19 or mutated variant amino acid at position 516 in the penA gene | A | PEN/CFM/CRO |
| penA aa542 | FA19 or mutated variant amino acid at position 542 in the penA gene | G | PEN/CFM/CRO |
| penA aa545 | FA19 or mutated variant amino acid at position 545 in the penA gene | G | PEN/CFM/CRO |
| penA aa549 | FA19 or mutated variant amino acid at position 549 in the penA gene | A | PEN/CFM/CRO |
| penA aa551 | FA19 or mutated variant amino acid at position 551 in the penA gene | P | PEN/CFM/CRO |
| penA D345ins | True if there is an Asparagine (D) at amino acid position 345 in the penA gene | False | PEN/CFM/CRO |
| ponA aa375 | FA19 or mutated variant amino acid at position 375 in the ponA gene | A | PEN/CFM/CRO |
| ponA aa421 | FA19 or mutated variant amino acid at position 421 in the ponA gene | L | PEN/CFM/CRO |
| pilQ full length | False if early stop is found in the pilQ gene | True | PEN/TET |
| pilQ aa341 | FA19 or mutated variant amino acid at position 341 in the pilQ gene | S | PEN/TET |
| pilQ aa526 | FA19 or mutated variant amino acid at position 526 in the pilQ gene | D | PEN/TET |
| pilQ aa648 | FA19 or mutated variant amino acid at position 648 in the pilQ gene | S | PEN/TET |
| pilQ aa666 | FA19 or mutated variant amino acid at position 666 in the pilQ gene | E | PEN/TET |
| gyrA aa91 | FA19 or mutated variant amino acid at position 91 in the gyrA gene | S | CIP |
| gyrA aa92 | FA19 or mutated variant amino acid at position 92 in the gyrA gene | A | CIP |
| gyrA aa95 | FA19 or mutated variant amino acid at position 95 in the gyrA gene | D | CIP |
| parC aa86 | FA19 or mutated variant amino acid at position 86 in the parC gene | D | CIP |
| parC aa87 | FA19 or mutated variant amino acid at position 87 in the parC gene | S | CIP |
| parC aa88 | FA19 or mutated variant amino acid at position 88 in the parC gene | S | CIP |
| parC aa91 | FA19 or mutated variant amino acid at position 91 in the parC gene | E | CIP |
| blaTEM present | True if blaTEM is found with sufficient depth (>=2) | False | PEN |
| tetM present | True if tetM is found with sufficient depth (>=2) | False | TET |
| rpsJ aa57 | FA19 or mutated variant amino acid at position 57 in the rpsJ gene | V | TET |
| ftsX aa31 | FA19 or mutated variant amino acid at position 31 in the ftsX gene | T | CTX |
| rplD aa68 | FA19 or mutated variant amino acid at position 68 in the rplD gene | G | AZM |
| rplD aa70 | FA19 or mutated variant amino acid at position 70 in the rplD gene | G | AZM |
| rplV ins | True if any insertions are found in the rplV gene | False | ASK!! |
| macA aa99 | FA19 or mutated variant amino acid at position 99 in the macA gene | N | AZM |
| mtrD aa42 | FA19 or mutated variant amino acid at position 42 in the mtrD gene | T | AZM |
| mtrD aa46 | FA19 or mutated variant amino acid at position 46 in the mtrD gene | H | AZM |
| mtrD aa48 | FA19 or mutated variant amino acid at position 48 in the mtrD gene | I | AZM |
| mtrD aa101 | FA19 or mutated variant amino acid at position 101 in the mtrD gene | N | AZM |
| mtrD aa174 | FA19 or mutated variant amino acid at position 174 in the mtrD gene | R | AZM |
| mtrD aa612 | FA19 or mutated variant amino acid at position 612 in the mtrD gene | F | AZM |
| mtrD aa662 | FA19 or mutated variant amino acid at position 669 in the mtrD gene | V | AZM |
| mtrD aa669 | FA19 or mutated variant amino acid at position 669 in the mtrD gene | E | AZM |
| mtrD aa714 | FA19 or mutated variant amino acid at position 714 in the mtrD gene | R | AZM |
| mtrD aa821 | FA19 or mutated variant amino acid at position 821 in the mtrD gene | S | AZM |
| mtrD aa823 | FA19 or mutated variant amino acid at position 823 in the mtrD gene | K | AZM |
| mtrD aa826 | FA19 or mutated variant amino acid at position 826 in the mtrD gene | A | AZM |
| macA promoter | FA19 or SNP base call at nucleotide 1192227 relative to the reference for the macA promoter | A | AZM |
| norM promoter | FA19 or SNP base call at nucleotide 129495 relative to the reference for the norM promoter | G | CIP |
| ermB present | True if ermB is found with sufficient depth (>=2) | False | AZM |
| ermC present | True if ermC is found with sufficient depth (>=2) | False | AZM |
| ermF present | True if ermF is found with sufficient depth (>=2) | False | AZM |
| mefA present | True if mefA is found with sufficient depth (>=2) | False | AZM |
| gyrB aa429 | FA19 or mutated variant amino acid at position 429 in the gyrB gene | D | ETX0914 |
| gyrB aa450 | FA19 or mutated variant amino acid at position 450 in the gyrB gene | K | ETX0914 |
| acnB aa348 | FA19 or mutated variant amino acid at position 348 in the acnB gene | G | CMF/CRO |
| acnB aa371 | FA19 or mutated variant amino acid at position 371 in the acnB gene | Q | CMF/CRO |
| 16S-1053 base | FA19 or SNP base call at nucleotide 1053 in the 16S gene | G | ERA |
| 16S-1053 freq | Frequency of FA19 or SNP base call at nucleotide 1053 in the 16S gene | 1.0 | ERA |
| 16S-1186 base | FA19 or SNP base call at nucleotide 1186 in the 16S gene | G | ERA |
| 16S-1186 freq | Frequency of FA19 or SNP base call at nucleotide 1186 in the 16S gene | 1.0 | ERA |
| rpsE aa24 | FA19 or mutated variant amino acid at position 24 in the rpsE gene | T | SPC |
| rpsE aa28 | FA19 or mutated variant amino acid at position 28 in the rpsE gene | K | SPC |
The phylogeny directory contains the snippy_core, gubbins, snp_dists, RAxML, and Gotree subdirectories. It also contains the phylogeny_qc_report.tsv.
snippy_core - The snippy_core directory contains the output from running the entire collection of isolates against the same reference, building a core genome, and then generating core alignment files using that core genome. This information can be used for building a phylogenetic tree. The core genome is comprised of genomic positions that are present in all of the isolates in the collection. The files found in the snippy_core directory are detailed in the table below:
| File | Description |
|---|---|
| core.aln | A core SNP alignment in the --aformat format (default FASTA) |
| core.full.aln | A whole genome SNP alignment (includes invariant sites) |
| core.tab | Tab-separated columnar list of core SNP sites with alleles but NO annotations |
| core.txt | Tab-separated columnar list of alignment/core-size statistics |
| core.ref.fa | FASTA version/copy of the --ref |
The core.txt file contains columns with the metrics seen in the following table with each of the isolates and the reference as rows.
| Metric | Description |
|---|---|
| ID | Sample or reference name |
| LENGTH | Total length of the genome measured in bases |
| ALIGNED | Number of bases that aligned with the reference |
| UNALIGNED | Number of bases that did not align with the reference |
| VARIANT | Total number of variants relative to the reference |
| HET | Number of heterozygotes |
| MASKED | Number of masked bases |
| LOWCOV | Number of bases with low coverage |
gubbins - The gubbins subdirectory contains output from Gubbins, A tool used to mark regions as recombinations and construct a phylogeny based on mutations outside of those recombination regions. The Phylip file from this output is used during phylogenetic analysis. This directory also contains _partition_data.txt which contains the monomorphic counts for the unambiguous nucleotides, as well as _partition.txt which references _partition_data.txt and is used for the ascertainment bias correction in the RAxML step.
The following table includes all of the files produced by Gubbins that can be found in the gubbins directory, as detailed in the Gubbins manual:
| Extension | Description |
|---|---|
| .recombination_predictions.embl | Recombination predictions in EMBL file format |
| .recombination_predictions.gff | Recombination predictions in GFF format |
| .branch_base_reconstruction.embl | Base substitution reconstruction in EMBL format |
| .summary_of_snp_distribution.vcf | Per branch reporting of the base substitutions inside and outside recombination events |
| .filtered_polymorphic_sites.fasta | FASTA format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration |
| .filtered_polymorphic_sites.phylip | Phylip format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration |
| .final_tree.tre | This file contains the final phylogeny in Newick format; branch lengths are in point mutations |
| .node_labelled.final_tree.tre | Final phylogenetic tree in Newick format but with the internal node labels; branch lengths are in point mutations |
| .log | A log file specifying the software used at each step of the analysis, with accompanying citations |
| .per_branch_statistics.csv | File containing summary statistics for each branch in the tree in comma delimited format |
The .per_branch_statistics.csv file contains columns with the metrics seen in the following table with rows for each sample, per the gubbins manual.
| Metric | Description |
|---|---|
| Node | Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered |
| Total SNPs | Total number of base substitutions reconstructed onto the branch |
| Number of SNPs Inside Recombinations | Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (r) |
| Number of SNPs Outside Recombinations | Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (m) |
| Number of Recombination Blocks | Total number of recombination blocks reconstructed onto the branch |
| Bases in Recombinations | Total length of all recombination events reconstructed onto the branch |
| Cumulative Bases in Recombinations | Total number of bases in the alignment affected by recombination on this branch and its ancestors |
| r/m | The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch |
| rho/theta | The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation |
| Genome Length | The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either |
| Bases in Clonal Frame | The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation) |
snp_dists - The snp_dists directory contains all of the output from snpdists as well as any potential outbreak clusters detected in the sample set.
This output includes the following files:
- isolate_clusters.txt - Each line has a distinct outbreak cluster which includes the name of each isolate in the outbreak cluster as well as a list after each isolate that shows the SNP distance between that isolate and the rest of the isolates in the clusters. The clusters are identified by first constructing a graph with the isolates as nodes, and edges existing between isolates with a SNP distance of < 20 (as the default, this value can be changed). An analysis using disjoint set union (DSU) is then performed on this graph to identify the connected components, which are then listed as the potential outbreak clusters.
Example cluster:
GCWGS-27036-WA-M5130-240405_S28_L001 [ 0 22 24 14 20] GCWGS-27038-WA-M5130-240405_S30_L001 [22 0 30 16 14] GCWGS-27039-WA-M5130-240405_S31_L001 [24 30 0 21 18] GCWGS-27046-WA-M5130-240405_S38_L001 [14 16 21 0 16] GCWGS-27419-WA-M5130-240417_S68_L001 [20 14 18 16 0] - matrix.tsv - Contains the pairwise SNP distances between all isolates in the sample set in matrix format.
RAxML - The RAxML directory contains the output produced by RAxML, the RAxML_bestTree.core_ is the best tree generated during the phylogenetic analysis in Newick format. This is the tree visualized by Gotree.
The following table includes all of the files produced by RAxML that can be found in the RAxML directory, as detailed in the RAxML manual:
| Prefix | Description |
|---|---|
| RAxML_bestTree. | Contains the best-scoring ML tree of a thorough ML analysis |
| RAxML_bipartitions. | BS support values on the best tree found during the ML search |
| RAxML_bipartitionsBranchLabels. | Contains the same information as the file above, but support values are correctly displayed as Newick branch labels and not node labels! Support values always refer to branches/splits of trees and never to nodes of the tree. |
| RAxML_bootstrap. | All final bootstrapped trees |
| RAxML_info | Contains information about the model and algorithm used and how RAxML was called. The final GAMMA-based likelihood(s) as well as the alpha shape parameter(s) are printed to this file. In addition, if the rearrangement setting was determined automatically (i has not been used) the rearrangement setting found by the program will be indicated. This is the most important output file because it tells you what RAxML did and is always written irrespective of the command line option. In addition, it contains information about all other output files that were written by your run. |
Gotree - The Gotree directory contains the output from Gotree, including the midrooted version of the best tree output by RAxML and bestTree.png, which is an image of the midrooted phylogenetic tree. The clusters identified in the outbreak detection step are color coded in this tree, with the cluster_annotation.tsv file containing this annotation information.This directory also contains phylogeny_report.html which can be opened in your browser. This report contains the image of the phylogenetic tree as well as the outbreak clusters from isolate_clusters.txt.
phylogeny_qc_report.tsv - This file contains the QC report for the phylogenetic analysis. The table below contains an example of this report.
| QC Parameter | Accepted Value | Actual Value | Pass/Fail |
|---|---|---|---|
| Num_Samples_Aligned | 7 | 7 | pass |
| Match_Ref_Length | true | true | pass |
| Num_Lines_w_Invalid_Nuc | 0 | 0 | pass |
| All_Present_in_Tree | true | true | pass |
| Num_Outliers | - | 0 | NA |
| Core_Mono_Nuc_bp_Count | - | 1892518 | NA |
The following table contains the descriptions of these parameters:
| Metric | Description |
|---|---|
| Num_Samples_Aligned | Number of samples in the .full.aln FASTA alignment produced by snippy-core. A run will pass if this value (reported under Actual Value) matches the number of samples given to the pipeline (reported under Accepted Value) |
| Match_Ref_Length | “true” if the sequences in the .full.aln FASTA file produced by snippy-core are the same length as the reference. This being “true” results in a pass. |
| Num_Lines_w_Invalid_Nuc | Number of lines in the sequences in the .full.aln FASTA file produced by snippy-core that contain invalid nucleotides. There being 0 lines containing invalid nucleotides results in a pass. |
| All_Present_in_Tree | “true” if all of the samples in the run are present in the Newick file produced by RAxML. This being “true” results in a pass. |
| Num_Outliers | Reports the number of statistical outliers in the resulting tree. A statistical outlier being a sample whose branch has a length of greater than the mean branch length + 2*(the standard deviation of branch lengths). There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run. If you perform a lineage specific (same MLST or coregeno group) phylogenetic analysis, you should not have any outliers and if there are any outliers we suggest to look into your isolate list and phylogenetic visuals inorder to assess whether any isolates with unusually long branch lengths belonged to the same lineage or not. |
| Core_Mono_Nuc_bp_Count | Reports size of the core alignment, excluding any portions that are not perfectly aligned between the samples in the set. There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run. |
Output files
multiqc/multiqc_report.html: a standalone HTML file that can be viewed in your web browser.multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv. - Parameters used by the pipeline run:
params.json.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.