Skip to content

Latest commit

 

History

History
581 lines (449 loc) · 44.5 KB

File metadata and controls

581 lines (449 loc) · 44.5 KB

neissflow: Output

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • fastp - a tool for all-in-one FASTQ processing, including quality filtering, adaptor-trimming, and quality-trimming, as well as quality profiling
  • Samtools stats - a tool for collecting statistics from BAM files and outputting them in a text format
  • Mash - a tool for species screening via fast genome and metagenome distance estimation using MinHash
  • Shovill - an assembly tool for illumina paired end reads
  • QUAST - a tool for evaluating assemblies through calculating and reporting quality metrics
  • Snippy - a tool for rapid haploid variant calling and core genome alignment
  • mlst - a tool for scanning contigs against PubMLST typing schemes.
  • NGMASTER - a tool for performing multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) and Neisseria gonorrhoeae sequence typing for antimicrobial resistance (NG-STAR)
  • BLASTn - basic local alignment search tool (BLAST) for comparing nucleotide sequences to those in a database.
  • Samtools depth - a tool for calculating the read depth at a given position from an alignment.
  • snp-dists - a tool for generating a SNP distance matrix from a FASTA core alignment
  • Gubbins - a tool for marking recombination regions and constructing a phylogeny based on mutations outside of those regions
  • RAxML - a tool for performing Maximum Likelihood based inference of large phylogenetic trees
  • Gotree - tool to manipulate phylogenetic trees and generate visualizations
  • MultiQC - Aggregate report describing results and QC from the whole pipeline
  • Pipeline information - Report metrics generated during the workflow execution

run-name_final_report.tsv - aggregated report, is a combination of data from the following reports.

  • QC_FASTQ – neissflow_FASTQ_QC_report.tsv
  • MASH – neissflow_Mash_contaminants.tsv, neissflow_Mash_top_hit_report.tsv, neissflow_Mash_plasmids.tsv
  • processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt
  • amr_profiler - neissflow_amr_report.tsv, neissflow_avg_depth_report.tsv

QC_final_report.tsv - aggregated report containing only the control samples

  • This report is present if the QC profile is included in the run
  • This report contains the same columns as run-name_final_report.tsv

QC_FASTQ – output from the Preprocessing Subworkflow

In this directory, there are 2 subdirectories, named Reports and Samples.

Samples - contains subdirectories for each of the isolates, which contain the trimmed & filtered paired read files produced by fastp for that isolate.

Reports - contains the files neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv.

  • neissflow_FASTQ_QC_report.tsv - aggregated report containing the sequence reports generated by fastp for all of the samples. This report is used for the first QC check to remove low quality isolates from analysis.

  • neissflow_failed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the low-quality samples that failed the first QC check

  • neissflow_passed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the samples that passed the first QC check and moved on for further analysis

The columns for the following metrics should be found in neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv (this information can also be found in the fastp repository):

Metric Description
Isolate Isolate ID
before_filtering_total_reads Total number of reads in the isolate FASTQ file before quality filtering and trimming
before_filtering_total_bases Total number of bases in the isolate FASTQ file before quality filtering and trimming
before_filtering_q20_bases Total number of bases with a Phred score of >= Q20 before quality filtering and trimming
before_filtering_q30_bases Total number of bases with a Phred score of >= Q30 before quality filtering and trimming
before_filtering_read1_mean_length Mean length of reads for the R1 file for the isolate paired-end read files before quality filtering and trimming
before_filtering_read2_mean_length Mean length of reads for the R2 file for the isolate paired-end read files before quality filtering and trimming
before_filtering_gc_content Fraction of bases that are guanine or cytosine before quality filtering and trimming
after_filtering_total_reads Total number of reads in the isolate FASTQ file after quality filtering and trimming
after_filtering_total_bases Total number of bases in the isolate FASTQ file after quality filtering and trimming
after_filtering_q20_bases Total number of bases with a phred score of >= Q20 after quality filtering and trimming
after_filtering_q30_bases Total number of bases with a phred score of >= Q30 after quality filtering and trimming
after_filtering_read1_mean_length Mean length of reads for the R1 file for the isolate paired-end read files after quality filtering and trimming
after_filtering_read2_mean_length Mean length of reads for the R2 file for the isolate paired-end read files after quality filtering and trimming
after_filtering_gc_content Fraction of bases that are guanine or cytosine after quality filtering and trimming
passed_filter_reads Total number of reads in the isolate FASTQ file that passed the quality filter
low_quality_reads Total number of reads in the isolate FASTQ file that did NOT pass the quality filter
too_many_N_reads Total number of reads that contain too many “N” bases (bases that were not able to basecalled by the sequencer)
too_short_reads Total number of reads that were filtered out due to having length <= 100 bases
too_long_reads Total number of reads that were filtered out due to being too long (limit not currently set)

snippy – output from the SNIPPY module and from the AMR_Profiler Subworkflow SNIPPY_AMR module

In this directory, there are subdirectories for each reference used (default is FA19 and AMR), which contain directories for for each of the isolates in the collection.

The isolate directories contain all of the Snippy output for that specific isolate. Below is a table of the produced files (this information can also be found in the Snippy repository):

Extension Description
.tab A simple tab-separated summary of all the variants
.csv A comma-separated version of the .tab file
.html A HTML version of the .tab file
.vcf The final annotated variants in VCF format
.bed The variants in BED format
.gff The variants in GFF3 format
.bam The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai Index for the .bam file
.log A log file with the commands run and their outputs
.aligned.fa A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants)
.consensus.fa A version of the reference genome with all variants instantiated
.consensus.subs.fa A version of the reference genome with only substitution variants instantiated
.raw.vcf The unfiltered variant calls from Freebayes
.filt.vcf The filtered variant calls from Freebayes
.vcf.gz Compressed .vcf file via BGZIP
.vcf.gz.csi Index for the .vcf.gz via bcftools index

The TAB, CSV, and HTML files all should contain the same variant summary information in tables with the following columns:

Name Description
CHROM The sequence the variant was found in eg. the name after the > in the FASTA reference
POS Position in the sequence, counting from 1
TYPE The variant type: single nucleotide polymorphism (snp), multinucleotide polymorphism (msp), insertion (ins), deletion (del), combination of snp/mnp (complex)
REF The nucleotide(s) in the reference
ALT The alternate nucleotide(s) supported by the reads
EVIDENCE Frequency counts for REF and ALT
FTYPE Class of feature affected: CDS tRNA rRNA ...
STRAND Strand the feature was on: + -
NT_POS Nucleotide position of the variant within the feature / Length in nucleotides
AA_POS Residue position / Length in amino acids (only if FTYPE is CDS)
EFFECT The snpEff annotated consequence of this variant
LOCUS_TAG The /locus_tag of the feature (if it existed)
GENE The /gene tag of the feature (if it existed)
PRODUCT The /product tag of the feature (if it existed)

COVERAGE – output from the STATS step

In this directory, there are report text files for each of the samples, as well as a file called “neissflow_coverage.tsv”

The “neissflow_coverage.tsv” file contains rows for each isolate in the input directory. The columns for the following fields can be found in the output file:

Metric Description
ID Isolate ID
%target>10x Percent of the target genome (FA19) with greater than 10x coverage

The content contained in the report text files is not used for further analysis or QC checks at this time, but information on metrics included can be found in SOP-TBA or on the Samtools stats manual page.

MASH – output from the MASH and COMBINE_MASH_REPORTS steps

In this directory there are TSV reports for each sample containing the complete Mash screening report for each respective sample, as well as Mash_contaminants.tsv and Mash_top_hit_report.tsv.

  • neissflow_Mash_top_hit_report.tsv - Reports the top species hit (greatest identity score + fraction of hashes) from the Mash screening with 1000 total hashes. (non plasmid hits)
  • neissflow_Mash_contaminants.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) non-Neisseria hits from the Mash screening. (non plasmid hits)
  • neissflow_Mash_plasmids.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) plasmid hits from the Mash screening.

neissflow_Mash_contaminants.tsv, neissflow_Mash_plasmids, and neissflow_Mash_top_hit_report.tsv contain the following columns:

Metric Description
ID Isolate ID
ident What fraction of the bases are shared between the reference genome and input reads. Sequencing errors and gaps in coverage reduce this.
hashes For each k-mer in the dataset screened, a “hash” is created and compared with the k-mers or “hashes” in each reference genome in the databases. When all k-mers are shared with a particular reference genome then 1000/1000 is the value for shared hashes.
median_mult Is a “rough estimate” for abundance, but it is affected by the size of the reference genome. Small genomes tend to produce large values, while having low identity and few shared hashes.
p_val The probability of observing the number of shared k-mers with the estimated identity
hit_name Truncated name of the reference genome. Ex: Neisseria_gonorrhoeae_MS11

The individual reports contain the same information; however, the columns are not labeled and the “hit_name” is the long version of the Refseq FASTA hit.

processed_genomes – output from the Assembly Subworkflow

In this directory there are FASTA assemblies for each sample, a QUAST directory, and a Denovo_assembly_Stats_QC_report.txt file.

  • QUAST - Contains subdirectories for each of the samples in the set. These directories contain output from QUAST, which are reports on the quality metrics for that sample's assembly. The description of the QUAST output can be found below:
report.txt      summary table
report.tsv      tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)  
report.tex      Latex version
report.pdf      PDF version, includes all tables and plots for some statistics
report.html     everything in an interactive HTML file
icarus.html     Icarus main menu with links to interactive viewers
contigs_reports/        
  misassemblies_report  detailed report on misassemblies
  unaligned_report      detailed report on unaligned and partially unaligned contigs

  • Sample Assemblies – FASTA files produced by shovill, follow sample_name_contigs.fa nomenclature

  • neissflow_Denovo_assembly_Stats_QC_report.txt – Compiled QC metrics report for all of the isolates in the collection. This report contains the following columns:

    Metric Description
    Filename Isolate ID
    Contig_Count Total number of contigs
    Bases_In_Contigs Total number of bases included in contigs
    Large_Contig_Count Total number large contigs (length > 10,000)
    Small_Contig_Count Total number of small contigs (length <= 10,000)
    >500bp_Contig_Count Total number of contigs with >500 basepairs
    Bases_In_Large_Contigs Total number of bases included in large contigs (length > 10,000)
    Bases_In_Small_Contigs Total number of bases included in small contigs (length <= 10,000)
    Fraction_Of_Contigs_That_Are_Large (large contig count) / (total contig count)
    Min_Coverage_Large_Contigs Minimum of the coverages of the large contigs
    Max_Ratio_of_Coverage_Large_Contigs (maximum of the coverages of the large contigs) / (minimum of the coverages of the large contigs)
    Low_Coverage_Contig_Count Total number of contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2)
    Low_Coverage_Contig_Bases Total number of bases in contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2)
    Mean_Coverage ( ∀ contig; contig ∊ assembly ∑(|contig| * [coverage of contig]) ) / ( total number of bases in the assembly )
    Ambiguous_nucleotides Total number of ambiguous nucleotides in the assembly
    N50 The length for which the collection of all contigs of that length or longer cover at least half the assembly
    N75 Similar to N50, but using 75% of the assembly covered
    N90 Similar to N50, and N75 but using 90% of the assembly covered

QC_CHECK – output from the second QC check

In this directory there are 2 files named neissflow_failed_qc.tsv and neissflow_passed_qc.tsv.

These reports are an aggregation of some of the generated reports up to this point in order to perform a species and assembly QC check.

Aggregated Reports:
- QC_FASTQ – neissflow_FASTQ_QC_report.tsv
- COVERAGE – neissflow_coverage.tsv
- MASH – neissflow_Mash_top_hit_report.tsv
- processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt

  • neissflow_failed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that failed the species and assembly QC check and did not continue for further analysis.

  • neissflow_passed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that passed the species and assembly QC check and continued for further analysis.

amr_profiler – output from the AMR_Profiler Subworkflow

In this directory there are subdirectories for each of the isolates, as well as two aggregated reports, those being the amr_report.tsv and avg_depth_report.tsv. These reports are based on those generated by the GC-Genome-Profiler, and largely contain the same columns/information. The isolate directories contain the various reports for that isolate. These reports are aggregated to generate the amr_report.tsv and avg_depth_report.tsv files to contain the results for all of the samples in the set. The amr_report.tsv is an aggregation of the < isolate >_amr_report.tsv files. The avg_depth_report.tsv is an aggregation of the < isolate >_amr_depth.tsv reports. The reports found in the isolate directories are:

  • < isolate >_amr_blast.tsv - this report contains the results of the blastn runs for that isolate, all of the columns are included in the amr_report.tsv report
Metric Description
Sample Isolate ID
penA allele Best hit from penA allele BLAST database
porB allele Best hit from porB allele BLAST database
mtrR mosaic True if isolate reaches 98% match threshold
  • < isolate >_amr_depth.tsv - this report contains the depth of coverage for each of the relevant AMR genes
Metric Description
Sample Isolate ID
ermC Depth of coverage for the erythromycin resistance ermC gene
ermB Depth of coverage for the erythromycin resistance ermB gene
ermF Depth of coverage for the erythromycin resistance ermF gene
gyrB Depth of coverage for the gyrB gene (positions 1547933-1550323 in FA19)
gyrA Depth of coverage for the gyrA gene (positions 357412-360162 in FA19)
mtrR-CDEprom Depth of coverage for the mtrR CDE promoter (positions 1110651-1110900 in FA19)
macA_and_prom Depth of coverage for the macA gene and promoter (positions 1191001-1192230 in FA19)
norMprom Depth of coverage for the norM promoter (positions 129494-129496 in FA19)
ftsX Depth of coverage for the ftsX gene (positions 1707518-1708435 in FA19)
ponA Depth of coverage for the ponA gene (positions 2078911-2081307 in FA19)
TetM-partial Depth of coverage for partial reference of the tetracycline resistance determinant (tetM) gene
FA19_16SrRNA Depth of coverage for 16S rRNA from FA19
porB Depth of coverage for the porB gene (positions 1598044-1599027 in FA19)
23SrRNA Depth of coverage for the 23S rRNA from FA19
penA Depth of coverage for the penA gene (positions 1301424-1303169 in FA19)
blaTEM Depth of coverage for extended spectrum beta-lactamase gene blaTEM
Nm_sodC Depth of coverage for Neisseria meningitidis gene sodC
mefA Depth of coverage for Macrolide efflux protein A (mefA)
parC Depth of coverage for the parC gene (positions 993563-995866 in FA19)
acnB Depth of coverage for the acnB gene (positions 963428-966013 in FA19)
rplD Depth of coverage for the rplD gene (positions 1614768-1615388 in FA19)
rplV Depth of coverage for the rplV gene (positions 1612996-1613325 in FA19)
mtrR Depth of coverage for the mtrR gene (positions 1110901-1111533 in FA19)
mtrD Depth of coverage for the mtrD gene (positions 1106197-1109400 in FA19)
rpsE Depth of coverage for the rpsE gene (positions 1607799-1608317 in FA19)
rpsJ Depth of coverage for the rpsJ gene (positions 1616818-1617129 in FA19)
  • < isolate >_amr_report.tsv - This report includes data from the blast, mlst, ngmaster, and variant reports for this isolate.

  • < isolate >_amr_vcf.tsv - This is a tab delimited version of a VCF file containing all of the variants found in the AMR associated genes for the isolate. This includes SNPs that are not reported in the variant report. If new positions of interest are added later, these files can be analyzed to see if they appeared in previous neissflow runs. This file is formatted exactly like the TAB file produced by Snippy, shown in the Snippy output section of this README.

  • < isolate >_mlst.tsv - Contains the MLST sequence type for the isolate

  • < isolate >_ngmaster.tsv - This report contains the NGMAST and NGSTAR sequence types for the isolate as well as the allele calls made to determine those types.

Metric Description
Sample Isolate ID
SCHEME ngmaSTar
NG-MAST Sequence Type for the NG-MAST typing scheme
NG-STAR Sequence Type for the NG-STAR typing scheme
porB_NG-MAST porB allele per the NG-MAST typing scheme
tbpB tbpB allele
penA penA allele
mtrR mtrR allele
porB_NG-STAR porB allele per the NG-STAR typing scheme
ponA ponA allele
gyrA gyrA allele
parC parC allele
23S 23S allele
  • < isolate >_variant_report.tsv - This report includes either a mutation variant call or the FA19 nucleotide or amino acid for AMR loci as well as frequencies for certain calls, whether certain genes have an early stop, and the presence of certain horizontally transferred genes.

Metric Description Default Associated Drug(s)
Sample Isolate ID - -
23S-2611 base FA19 or SNP base call at nucleotide 2599 in the 23S gene C AZM
23S-2611 freq Frequency of FA19 or SNP base call at nucleotide 2599 in the 23S gene 1.0 AZM
23S-2059 base FA19 or SNP base call at nucleotide 2047 in the 23S gene A AZM
23S-2059 freq Frequency of FA19 or SNP base call at nucleotide 2047 in the 23S gene 1.0 AZM
23S-2058 base FA19 or SNP base call at nucleotide 2046 in the 23S gene A AZM
23S-2058 freq Frequency of FA19 or SNP base call at nucleotide 2046 in the 23S gene 1.0 AZM
mtrR promoter The mtrR promoter contains the sequence AAAAA, and this field is to report a mutation at any one of these positions (but only one nucleotide is reported). The FA19 default is reported as A and if a mutation is found at any of the 5 positions, that nucleotide will be reported instead. A AZM/PEN/TET/CFM/CRO
mtr120 promoter FA19 or SNP base call at nucleotide 1110770 relative to the reference for the mtr120 promoter G AZM/PEN/TET/CFM/CRO
mtrR -35 FA19 or SNP base call at nucleotide 1110836 relative to the reference for mtrR -35 G AZM/PEN/TET/CFM/CRO
mtrR WHOP FA19 or SNP base call at nucleotide 1110839 relative to the reference for mtrR -35 C AZM/PEN/TET/CFM/CRO
mtrA binding site FA19 or SNP base call at nucleotide 1110865 relative to the reference for the mtrA binding site G AZM/PEN/TET/CFM/CRO
mtrR aa39 FA19 or mutated variant amino acid at position 39 in the mtrR gene A AZM/PEN/TET/CFM/CRO
mtrR aa44 FA19 or mutated variant amino acid at position 44 in the mtrR gene R AZM/PEN/TET/CFM/CRO
mtrR aa45 FA19 or mutated variant amino acid at position 45 in the mtrR gene G AZM/PEN/TET/CFM/CRO
mtrR aa47 FA19 or mutated variant amino acid at position 47 in the mtrR gene L AZM/PEN/TET/CFM/CRO
mtrR aa79 FA19 or mutated variant amino acid at position 79 in the mtrR gene D AZM/PEN/TET/CFM/CRO
mtrR aa105 FA19 or mutated variant amino acid at position 105 in the mtrR gene H AZM/PEN/TET/CFM/CRO
mtrR premature stop True if early stop is found in the mtrR gene False AZM/PEN/TET/CFM/CRO
penA aa311 FA19 or mutated variant amino acid at position 311 in the penA gene A PEN/CFM/CRO
penA aa312 FA19 or mutated variant amino acid at position 312 in the penA gene I PEN/CFM/CRO
penA aa316 FA19 or mutated variant amino acid at position 316 in the penA gene V PEN/CFM/CRO
penA aa483 FA19 or mutated variant amino acid at position 483 in the penA gene T PEN/CFM/CRO
penA aa501 FA19 or mutated variant amino acid at position 501 in the penA gene A PEN/CFM/CRO
penA aa504 FA19 or mutated variant amino acid at position 504 in the penA gene L PEN/CFM/CRO
penA aa512 FA19 or mutated variant amino acid at position 512 in the penA gene N PEN/CFM/CRO
penA aa516 FA19 or mutated variant amino acid at position 516 in the penA gene A PEN/CFM/CRO
penA aa542 FA19 or mutated variant amino acid at position 542 in the penA gene G PEN/CFM/CRO
penA aa545 FA19 or mutated variant amino acid at position 545 in the penA gene G PEN/CFM/CRO
penA aa549 FA19 or mutated variant amino acid at position 549 in the penA gene A PEN/CFM/CRO
penA aa551 FA19 or mutated variant amino acid at position 551 in the penA gene P PEN/CFM/CRO
penA D345ins True if there is an Asparagine (D) at amino acid position 345 in the penA gene False PEN/CFM/CRO
ponA aa375 FA19 or mutated variant amino acid at position 375 in the ponA gene A PEN/CFM/CRO
ponA aa421 FA19 or mutated variant amino acid at position 421 in the ponA gene L PEN/CFM/CRO
pilQ full length False if early stop is found in the pilQ gene True PEN/TET
pilQ aa341 FA19 or mutated variant amino acid at position 341 in the pilQ gene S PEN/TET
pilQ aa526 FA19 or mutated variant amino acid at position 526 in the pilQ gene D PEN/TET
pilQ aa648 FA19 or mutated variant amino acid at position 648 in the pilQ gene S PEN/TET
pilQ aa666 FA19 or mutated variant amino acid at position 666 in the pilQ gene E PEN/TET
gyrA aa91 FA19 or mutated variant amino acid at position 91 in the gyrA gene S CIP
gyrA aa92 FA19 or mutated variant amino acid at position 92 in the gyrA gene A CIP
gyrA aa95 FA19 or mutated variant amino acid at position 95 in the gyrA gene D CIP
parC aa86 FA19 or mutated variant amino acid at position 86 in the parC gene D CIP
parC aa87 FA19 or mutated variant amino acid at position 87 in the parC gene S CIP
parC aa88 FA19 or mutated variant amino acid at position 88 in the parC gene S CIP
parC aa91 FA19 or mutated variant amino acid at position 91 in the parC gene E CIP
blaTEM present True if blaTEM is found with sufficient depth (>=2) False PEN
tetM present True if tetM is found with sufficient depth (>=2) False TET
rpsJ aa57 FA19 or mutated variant amino acid at position 57 in the rpsJ gene V TET
ftsX aa31 FA19 or mutated variant amino acid at position 31 in the ftsX gene T CTX
rplD aa68 FA19 or mutated variant amino acid at position 68 in the rplD gene G AZM
rplD aa70 FA19 or mutated variant amino acid at position 70 in the rplD gene G AZM
rplV ins True if any insertions are found in the rplV gene False ASK!!
macA aa99 FA19 or mutated variant amino acid at position 99 in the macA gene N AZM
mtrD aa42 FA19 or mutated variant amino acid at position 42 in the mtrD gene T AZM
mtrD aa46 FA19 or mutated variant amino acid at position 46 in the mtrD gene H AZM
mtrD aa48 FA19 or mutated variant amino acid at position 48 in the mtrD gene I AZM
mtrD aa101 FA19 or mutated variant amino acid at position 101 in the mtrD gene N AZM
mtrD aa174 FA19 or mutated variant amino acid at position 174 in the mtrD gene R AZM
mtrD aa612 FA19 or mutated variant amino acid at position 612 in the mtrD gene F AZM
mtrD aa662 FA19 or mutated variant amino acid at position 669 in the mtrD gene V AZM
mtrD aa669 FA19 or mutated variant amino acid at position 669 in the mtrD gene E AZM
mtrD aa714 FA19 or mutated variant amino acid at position 714 in the mtrD gene R AZM
mtrD aa821 FA19 or mutated variant amino acid at position 821 in the mtrD gene S AZM
mtrD aa823 FA19 or mutated variant amino acid at position 823 in the mtrD gene K AZM
mtrD aa826 FA19 or mutated variant amino acid at position 826 in the mtrD gene A AZM
macA promoter FA19 or SNP base call at nucleotide 1192227 relative to the reference for the macA promoter A AZM
norM promoter FA19 or SNP base call at nucleotide 129495 relative to the reference for the norM promoter G CIP
ermB present True if ermB is found with sufficient depth (>=2) False AZM
ermC present True if ermC is found with sufficient depth (>=2) False AZM
ermF present True if ermF is found with sufficient depth (>=2) False AZM
mefA present True if mefA is found with sufficient depth (>=2) False AZM
gyrB aa429 FA19 or mutated variant amino acid at position 429 in the gyrB gene D ETX0914
gyrB aa450 FA19 or mutated variant amino acid at position 450 in the gyrB gene K ETX0914
acnB aa348 FA19 or mutated variant amino acid at position 348 in the acnB gene G CMF/CRO
acnB aa371 FA19 or mutated variant amino acid at position 371 in the acnB gene Q CMF/CRO
16S-1053 base FA19 or SNP base call at nucleotide 1053 in the 16S gene G ERA
16S-1053 freq Frequency of FA19 or SNP base call at nucleotide 1053 in the 16S gene 1.0 ERA
16S-1186 base FA19 or SNP base call at nucleotide 1186 in the 16S gene G ERA
16S-1186 freq Frequency of FA19 or SNP base call at nucleotide 1186 in the 16S gene 1.0 ERA
rpsE aa24 FA19 or mutated variant amino acid at position 24 in the rpsE gene T SPC
rpsE aa28 FA19 or mutated variant amino acid at position 28 in the rpsE gene K SPC

phylogeny - output from the Phylogeny Subworkflow

The phylogeny directory contains the snippy_core, gubbins, snp_dists, RAxML, and Gotree subdirectories. It also contains the phylogeny_qc_report.tsv.

snippy_core - The snippy_core directory contains the output from running the entire collection of isolates against the same reference, building a core genome, and then generating core alignment files using that core genome. This information can be used for building a phylogenetic tree. The core genome is comprised of genomic positions that are present in all of the isolates in the collection. The files found in the snippy_core directory are detailed in the table below:

File Description
core.aln A core SNP alignment in the --aformat format (default FASTA)
core.full.aln A whole genome SNP alignment (includes invariant sites)
core.tab Tab-separated columnar list of core SNP sites with alleles but NO annotations
core.txt Tab-separated columnar list of alignment/core-size statistics
core.ref.fa FASTA version/copy of the --ref

The core.txt file contains columns with the metrics seen in the following table with each of the isolates and the reference as rows.

Metric Description
ID Sample or reference name
LENGTH Total length of the genome measured in bases
ALIGNED Number of bases that aligned with the reference
UNALIGNED Number of bases that did not align with the reference
VARIANT Total number of variants relative to the reference
HET Number of heterozygotes
MASKED Number of masked bases
LOWCOV Number of bases with low coverage

gubbins - The gubbins subdirectory contains output from Gubbins, A tool used to mark regions as recombinations and construct a phylogeny based on mutations outside of those recombination regions. The Phylip file from this output is used during phylogenetic analysis. This directory also contains _partition_data.txt which contains the monomorphic counts for the unambiguous nucleotides, as well as _partition.txt which references _partition_data.txt and is used for the ascertainment bias correction in the RAxML step.

The following table includes all of the files produced by Gubbins that can be found in the gubbins directory, as detailed in the Gubbins manual:

Extension Description
.recombination_predictions.embl Recombination predictions in EMBL file format
.recombination_predictions.gff Recombination predictions in GFF format
.branch_base_reconstruction.embl Base substitution reconstruction in EMBL format
.summary_of_snp_distribution.vcf Per branch reporting of the base substitutions inside and outside recombination events
.filtered_polymorphic_sites.fasta FASTA format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration
.filtered_polymorphic_sites.phylip Phylip format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration
.final_tree.tre This file contains the final phylogeny in Newick format; branch lengths are in point mutations
.node_labelled.final_tree.tre Final phylogenetic tree in Newick format but with the internal node labels; branch lengths are in point mutations
.log A log file specifying the software used at each step of the analysis, with accompanying citations
.per_branch_statistics.csv File containing summary statistics for each branch in the tree in comma delimited format

The .per_branch_statistics.csv file contains columns with the metrics seen in the following table with rows for each sample, per the gubbins manual.

Metric Description
Node Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered
Total SNPs Total number of base substitutions reconstructed onto the branch
Number of SNPs Inside Recombinations Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (r)
Number of SNPs Outside Recombinations Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (m)
Number of Recombination Blocks Total number of recombination blocks reconstructed onto the branch
Bases in Recombinations Total length of all recombination events reconstructed onto the branch
Cumulative Bases in Recombinations Total number of bases in the alignment affected by recombination on this branch and its ancestors
r/m The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch
rho/theta The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation
Genome Length The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either
Bases in Clonal Frame The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation)

snp_dists - The snp_dists directory contains all of the output from snpdists as well as any potential outbreak clusters detected in the sample set.

This output includes the following files:

  • isolate_clusters.txt - Each line has a distinct outbreak cluster which includes the name of each isolate in the outbreak cluster as well as a list after each isolate that shows the SNP distance between that isolate and the rest of the isolates in the clusters. The clusters are identified by first constructing a graph with the isolates as nodes, and edges existing between isolates with a SNP distance of < 20 (as the default, this value can be changed). An analysis using disjoint set union (DSU) is then performed on this graph to identify the connected components, which are then listed as the potential outbreak clusters.
    Example cluster:
    GCWGS-27036-WA-M5130-240405_S28_L001 [ 0 22 24 14 20] GCWGS-27038-WA-M5130-240405_S30_L001 [22 0 30 16 14] GCWGS-27039-WA-M5130-240405_S31_L001 [24 30 0 21 18] GCWGS-27046-WA-M5130-240405_S38_L001 [14 16 21 0 16] GCWGS-27419-WA-M5130-240417_S68_L001 [20 14 18 16 0]
  • matrix.tsv - Contains the pairwise SNP distances between all isolates in the sample set in matrix format.

RAxML - The RAxML directory contains the output produced by RAxML, the RAxML_bestTree.core_ is the best tree generated during the phylogenetic analysis in Newick format. This is the tree visualized by Gotree.

The following table includes all of the files produced by RAxML that can be found in the RAxML directory, as detailed in the RAxML manual:

Prefix Description
RAxML_bestTree. Contains the best-scoring ML tree of a thorough ML analysis
RAxML_bipartitions. BS support values on the best tree found during the ML search
RAxML_bipartitionsBranchLabels. Contains the same information as the file above, but support values are correctly displayed as Newick branch labels and not node labels! Support values always refer to branches/splits of trees and never to nodes of the tree.
RAxML_bootstrap. All final bootstrapped trees
RAxML_info Contains information about the model and algorithm used and how RAxML was called. The final GAMMA-based likelihood(s) as well as the alpha shape parameter(s) are printed to this file. In addition, if the rearrangement setting was determined automatically (­i has not been used) the rearrangement setting found by the program will be indicated. This is the most important output file because it tells you what RAxML did and is always written irrespective of the command line option. In addition, it contains information about all other output files that were written by your run.

Gotree - The Gotree directory contains the output from Gotree, including the midrooted version of the best tree output by RAxML and bestTree.png, which is an image of the midrooted phylogenetic tree. The clusters identified in the outbreak detection step are color coded in this tree, with the cluster_annotation.tsv file containing this annotation information.This directory also contains phylogeny_report.html which can be opened in your browser. This report contains the image of the phylogenetic tree as well as the outbreak clusters from isolate_clusters.txt.

phylogeny_qc_report.tsv - This file contains the QC report for the phylogenetic analysis. The table below contains an example of this report.

QC Parameter Accepted Value Actual Value Pass/Fail
Num_Samples_Aligned 7 7 pass
Match_Ref_Length true true pass
Num_Lines_w_Invalid_Nuc 0 0 pass
All_Present_in_Tree true true pass
Num_Outliers - 0 NA
Core_Mono_Nuc_bp_Count - 1892518 NA

The following table contains the descriptions of these parameters:

Metric Description
Num_Samples_Aligned Number of samples in the .full.aln FASTA alignment produced by snippy-core. A run will pass if this value (reported under Actual Value) matches the number of samples given to the pipeline (reported under Accepted Value)
Match_Ref_Length “true” if the sequences in the .full.aln FASTA file produced by snippy-core are the same length as the reference. This being “true” results in a pass.
Num_Lines_w_Invalid_Nuc Number of lines in the sequences in the .full.aln FASTA file produced by snippy-core that contain invalid nucleotides. There being 0 lines containing invalid nucleotides results in a pass.
All_Present_in_Tree “true” if all of the samples in the run are present in the Newick file produced by RAxML. This being “true” results in a pass.
Num_Outliers Reports the number of statistical outliers in the resulting tree. A statistical outlier being a sample whose branch has a length of greater than the mean branch length + 2*(the standard deviation of branch lengths). There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run. If you perform a lineage specific (same MLST or coregeno group) phylogenetic analysis, you should not have any outliers and if there are any outliers we suggest to look into your isolate list and phylogenetic visuals inorder to assess whether any isolates with unusually long branch lengths belonged to the same lineage or not.
Core_Mono_Nuc_bp_Count Reports size of the core alignment, excluding any portions that are not perfectly aligned between the samples in the set. There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.