neissflow: Output

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

fastp - a tool for all-in-one FASTQ processing, including quality filtering, adaptor-trimming, and quality-trimming, as well as quality profiling
Samtools stats - a tool for collecting statistics from BAM files and outputting them in a text format
Mash - a tool for species screening via fast genome and metagenome distance estimation using MinHash
Shovill - an assembly tool for illumina paired end reads
QUAST - a tool for evaluating assemblies through calculating and reporting quality metrics
Snippy - a tool for rapid haploid variant calling and core genome alignment
mlst - a tool for scanning contigs against PubMLST typing schemes.
NGMASTER - a tool for performing multi-antigen sequence typing for Neisseria gonorrhoeae (NG-MAST) and Neisseria gonorrhoeae sequence typing for antimicrobial resistance (NG-STAR)
BLASTn - basic local alignment search tool (BLAST) for comparing nucleotide sequences to those in a database.
Samtools depth - a tool for calculating the read depth at a given position from an alignment.
snp-dists - a tool for generating a SNP distance matrix from a FASTA core alignment
Gubbins - a tool for marking recombination regions and constructing a phylogeny based on mutations outside of those regions
RAxML - a tool for performing Maximum Likelihood based inference of large phylogenetic trees
Gotree - tool to manipulate phylogenetic trees and generate visualizations
MultiQC - Aggregate report describing results and QC from the whole pipeline
Pipeline information - Report metrics generated during the workflow execution

run-name_final_report.tsv - aggregated report, is a combination of data from the following reports.

QC_FASTQ – neissflow_FASTQ_QC_report.tsv
MASH – neissflow_Mash_contaminants.tsv, neissflow_Mash_top_hit_report.tsv, neissflow_Mash_plasmids.tsv
processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt
amr_profiler - neissflow_amr_report.tsv, neissflow_avg_depth_report.tsv

QC_final_report.tsv - aggregated report containing only the control samples

This report is present if the QC profile is included in the run
This report contains the same columns as run-name_final_report.tsv

QC_FASTQ – output from the Preprocessing Subworkflow

In this directory, there are 2 subdirectories, named Reports and Samples.

Samples - contains subdirectories for each of the isolates, which contain the trimmed & filtered paired read files produced by fastp for that isolate.

Reports - contains the files neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv.

neissflow_FASTQ_QC_report.tsv - aggregated report containing the sequence reports generated by fastp for all of the samples. This report is used for the first QC check to remove low quality isolates from analysis.
neissflow_failed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the low-quality samples that failed the first QC check
neissflow_passed_qc1.tsv - aggregated report containing the sequence reports generated by fastp for the samples that passed the first QC check and moved on for further analysis

The columns for the following metrics should be found in neissflow_FASTQ_QC_report.tsv, neissflow_failed_qc.tsv, and neissflow_passed_qc.tsv (this information can also be found in the fastp repository):

Metric	Description
Isolate	Isolate ID
before_filtering_total_reads	Total number of reads in the isolate FASTQ file before quality filtering and trimming
before_filtering_total_bases	Total number of bases in the isolate FASTQ file before quality filtering and trimming
before_filtering_q20_bases	Total number of bases with a Phred score of >= Q20 before quality filtering and trimming
before_filtering_q30_bases	Total number of bases with a Phred score of >= Q30 before quality filtering and trimming
before_filtering_read1_mean_length	Mean length of reads for the R1 file for the isolate paired-end read files before quality filtering and trimming
before_filtering_read2_mean_length	Mean length of reads for the R2 file for the isolate paired-end read files before quality filtering and trimming
before_filtering_gc_content	Fraction of bases that are guanine or cytosine before quality filtering and trimming
after_filtering_total_reads	Total number of reads in the isolate FASTQ file after quality filtering and trimming
after_filtering_total_bases	Total number of bases in the isolate FASTQ file after quality filtering and trimming
after_filtering_q20_bases	Total number of bases with a phred score of >= Q20 after quality filtering and trimming
after_filtering_q30_bases	Total number of bases with a phred score of >= Q30 after quality filtering and trimming
after_filtering_read1_mean_length	Mean length of reads for the R1 file for the isolate paired-end read files after quality filtering and trimming
after_filtering_read2_mean_length	Mean length of reads for the R2 file for the isolate paired-end read files after quality filtering and trimming
after_filtering_gc_content	Fraction of bases that are guanine or cytosine after quality filtering and trimming
passed_filter_reads	Total number of reads in the isolate FASTQ file that passed the quality filter
low_quality_reads	Total number of reads in the isolate FASTQ file that did NOT pass the quality filter
too_many_N_reads	Total number of reads that contain too many “N” bases (bases that were not able to basecalled by the sequencer)
too_short_reads	Total number of reads that were filtered out due to having length <= 100 bases
too_long_reads	Total number of reads that were filtered out due to being too long (limit not currently set)

snippy – output from the SNIPPY module and from the AMR_Profiler Subworkflow SNIPPY_AMR module

In this directory, there are subdirectories for each reference used (default is FA19 and AMR), which contain directories for for each of the isolates in the collection.

The isolate directories contain all of the Snippy output for that specific isolate. Below is a table of the produced files (this information can also be found in the Snippy repository):

Extension	Description
.tab	A simple tab-separated summary of all the variants
.csv	A comma-separated version of the .tab file
.html	A HTML version of the .tab file
.vcf	The final annotated variants in VCF format
.bed	The variants in BED format
.gff	The variants in GFF3 format
.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai	Index for the .bam file
.log	A log file with the commands run and their outputs
.aligned.fa	A version of the reference but with - at position with depth=0 and N for 0 < depth < --mincov (does not have variants)
.consensus.fa	A version of the reference genome with all variants instantiated
.consensus.subs.fa	A version of the reference genome with only substitution variants instantiated
.raw.vcf	The unfiltered variant calls from Freebayes
.filt.vcf	The filtered variant calls from Freebayes
.vcf.gz	Compressed .vcf file via BGZIP
.vcf.gz.csi	Index for the .vcf.gz via bcftools index

The TAB, CSV, and HTML files all should contain the same variant summary information in tables with the following columns:

Name	Description
CHROM	The sequence the variant was found in eg. the name after the > in the FASTA reference
POS	Position in the sequence, counting from 1
TYPE	The variant type: single nucleotide polymorphism (snp), multinucleotide polymorphism (msp), insertion (ins), deletion (del), combination of snp/mnp (complex)
REF	The nucleotide(s) in the reference
ALT	The alternate nucleotide(s) supported by the reads
EVIDENCE	Frequency counts for REF and ALT
FTYPE	Class of feature affected: CDS tRNA rRNA ...
STRAND	Strand the feature was on: + -
NT_POS	Nucleotide position of the variant within the feature / Length in nucleotides
AA_POS	Residue position / Length in amino acids (only if FTYPE is CDS)
EFFECT	The snpEff annotated consequence of this variant
LOCUS_TAG	The /locus_tag of the feature (if it existed)
GENE	The /gene tag of the feature (if it existed)
PRODUCT	The /product tag of the feature (if it existed)

COVERAGE – output from the STATS step

In this directory, there are report text files for each of the samples, as well as a file called “neissflow_coverage.tsv”

The “neissflow_coverage.tsv” file contains rows for each isolate in the input directory. The columns for the following fields can be found in the output file:

Metric	Description
ID	Isolate ID
%target>10x	Percent of the target genome (FA19) with greater than 10x coverage

The content contained in the report text files is not used for further analysis or QC checks at this time, but information on metrics included can be found in SOP-TBA or on the Samtools stats manual page.

MASH – output from the MASH and COMBINE_MASH_REPORTS steps

In this directory there are TSV reports for each sample containing the complete Mash screening report for each respective sample, as well as Mash_contaminants.tsv and Mash_top_hit_report.tsv.

neissflow_Mash_top_hit_report.tsv - Reports the top species hit (greatest identity score + fraction of hashes) from the Mash screening with 1000 total hashes. (non plasmid hits)
neissflow_Mash_contaminants.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) non-Neisseria hits from the Mash screening. (non plasmid hits)
neissflow_Mash_plasmids.tsv - Reports all the good (identity score of >= 0.95 and fraction of hashes of >= 0.95) plasmid hits from the Mash screening.

neissflow_Mash_contaminants.tsv, neissflow_Mash_plasmids, and neissflow_Mash_top_hit_report.tsv contain the following columns:

Metric	Description
ID	Isolate ID
ident	What fraction of the bases are shared between the reference genome and input reads. Sequencing errors and gaps in coverage reduce this.
hashes	For each k-mer in the dataset screened, a “hash” is created and compared with the k-mers or “hashes” in each reference genome in the databases. When all k-mers are shared with a particular reference genome then 1000/1000 is the value for shared hashes.
median_mult	Is a “rough estimate” for abundance, but it is affected by the size of the reference genome. Small genomes tend to produce large values, while having low identity and few shared hashes.
p_val	The probability of observing the number of shared k-mers with the estimated identity
hit_name	Truncated name of the reference genome. Ex: Neisseria_gonorrhoeae_MS11

The individual reports contain the same information; however, the columns are not labeled and the “hit_name” is the long version of the Refseq FASTA hit.

processed_genomes – output from the Assembly Subworkflow

In this directory there are FASTA assemblies for each sample, a QUAST directory, and a Denovo_assembly_Stats_QC_report.txt file.

QUAST - Contains subdirectories for each of the samples in the set. These directories contain output from QUAST, which are reports on the quality metrics for that sample's assembly. The description of the QUAST output can be found below:

report.txt      summary table
report.tsv      tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)  
report.tex      Latex version
report.pdf      PDF version, includes all tables and plots for some statistics
report.html     everything in an interactive HTML file
icarus.html     Icarus main menu with links to interactive viewers
contigs_reports/        
  misassemblies_report  detailed report on misassemblies
  unaligned_report      detailed report on unaligned and partially unaligned contigs

Sample Assemblies – FASTA files produced by shovill, follow sample_name_contigs.fa nomenclature

neissflow_Denovo_assembly_Stats_QC_report.txt – Compiled QC metrics report for all of the isolates in the collection. This report contains the following columns:

Metric	Description
Filename	Isolate ID
Contig_Count	Total number of contigs
Bases_In_Contigs	Total number of bases included in contigs
Large_Contig_Count	Total number large contigs (length > 10,000)
Small_Contig_Count	Total number of small contigs (length <= 10,000)
>500bp_Contig_Count	Total number of contigs with >500 basepairs
Bases_In_Large_Contigs	Total number of bases included in large contigs (length > 10,000)
Bases_In_Small_Contigs	Total number of bases included in small contigs (length <= 10,000)
Fraction_Of_Contigs_That_Are_Large	(large contig count) / (total contig count)
Min_Coverage_Large_Contigs	Minimum of the coverages of the large contigs
Max_Ratio_of_Coverage_Large_Contigs	(maximum of the coverages of the large contigs) / (minimum of the coverages of the large contigs)
Low_Coverage_Contig_Count	Total number of contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2)
Low_Coverage_Contig_Bases	Total number of bases in contigs with low coverage (coverage < (minimum of the coverages of the large contigs)/2)
Mean_Coverage	( ∀ contig; contig ∊ assembly ∑(\|contig\| * [coverage of contig]) ) / ( total number of bases in the assembly )
Ambiguous_nucleotides	Total number of ambiguous nucleotides in the assembly
N50	The length for which the collection of all contigs of that length or longer cover at least half the assembly
N75	Similar to N50, but using 75% of the assembly covered
N90	Similar to N50, and N75 but using 90% of the assembly covered

QC_CHECK – output from the second QC check

In this directory there are 2 files named neissflow_failed_qc.tsv and neissflow_passed_qc.tsv.

These reports are an aggregation of some of the generated reports up to this point in order to perform a species and assembly QC check.

Aggregated Reports:
- QC_FASTQ – neissflow_FASTQ_QC_report.tsv
- COVERAGE – neissflow_coverage.tsv
- MASH – neissflow_Mash_top_hit_report.tsv
- processed_genomes - neissflow_Denovo_assembly_Stats_QC_report.txt

neissflow_failed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that failed the species and assembly QC check and did not continue for further analysis.
neissflow_passed_qc2.tsv - contains the same columns as initial_merge.tsv, but only contains data for the samples that passed the species and assembly QC check and continued for further analysis.

amr_profiler – output from the AMR_Profiler Subworkflow

In this directory there are subdirectories for each of the isolates, as well as two aggregated reports, those being the amr_report.tsv and avg_depth_report.tsv. These reports are based on those generated by the GC-Genome-Profiler, and largely contain the same columns/information. The isolate directories contain the various reports for that isolate. These reports are aggregated to generate the amr_report.tsv and avg_depth_report.tsv files to contain the results for all of the samples in the set. The amr_report.tsv is an aggregation of the < isolate >_amr_report.tsv files. The avg_depth_report.tsv is an aggregation of the < isolate >_amr_depth.tsv reports. The reports found in the isolate directories are:

< isolate >_amr_blast.tsv - this report contains the results of the blastn runs for that isolate, all of the columns are included in the amr_report.tsv report

Metric	Description
Sample	Isolate ID
penA allele	Best hit from penA allele BLAST database
porB allele	Best hit from porB allele BLAST database
mtrR mosaic	True if isolate reaches 98% match threshold

< isolate >_amr_depth.tsv - this report contains the depth of coverage for each of the relevant AMR genes

Metric	Description
Sample	Isolate ID
ermC	Depth of coverage for the erythromycin resistance ermC gene
ermB	Depth of coverage for the erythromycin resistance ermB gene
ermF	Depth of coverage for the erythromycin resistance ermF gene
gyrB	Depth of coverage for the gyrB gene (positions 1547933-1550323 in FA19)
gyrA	Depth of coverage for the gyrA gene (positions 357412-360162 in FA19)
mtrR-CDEprom	Depth of coverage for the mtrR CDE promoter (positions 1110651-1110900 in FA19)
macA_and_prom	Depth of coverage for the macA gene and promoter (positions 1191001-1192230 in FA19)
norMprom	Depth of coverage for the norM promoter (positions 129494-129496 in FA19)
ftsX	Depth of coverage for the ftsX gene (positions 1707518-1708435 in FA19)
ponA	Depth of coverage for the ponA gene (positions 2078911-2081307 in FA19)
TetM-partial	Depth of coverage for partial reference of the tetracycline resistance determinant (tetM) gene
FA19_16SrRNA	Depth of coverage for 16S rRNA from FA19
porB	Depth of coverage for the porB gene (positions 1598044-1599027 in FA19)
23SrRNA	Depth of coverage for the 23S rRNA from FA19
penA	Depth of coverage for the penA gene (positions 1301424-1303169 in FA19)
blaTEM	Depth of coverage for extended spectrum beta-lactamase gene blaTEM
Nm_sodC	Depth of coverage for Neisseria meningitidis gene sodC
mefA	Depth of coverage for Macrolide efflux protein A (mefA)
parC	Depth of coverage for the parC gene (positions 993563-995866 in FA19)
acnB	Depth of coverage for the acnB gene (positions 963428-966013 in FA19)
rplD	Depth of coverage for the rplD gene (positions 1614768-1615388 in FA19)
rplV	Depth of coverage for the rplV gene (positions 1612996-1613325 in FA19)
mtrR	Depth of coverage for the mtrR gene (positions 1110901-1111533 in FA19)
mtrD	Depth of coverage for the mtrD gene (positions 1106197-1109400 in FA19)
rpsE	Depth of coverage for the rpsE gene (positions 1607799-1608317 in FA19)
rpsJ	Depth of coverage for the rpsJ gene (positions 1616818-1617129 in FA19)

< isolate >_amr_report.tsv - This report includes data from the blast, mlst, ngmaster, and variant reports for this isolate.
< isolate >_amr_vcf.tsv - This is a tab delimited version of a VCF file containing all of the variants found in the AMR associated genes for the isolate. This includes SNPs that are not reported in the variant report. If new positions of interest are added later, these files can be analyzed to see if they appeared in previous neissflow runs. This file is formatted exactly like the TAB file produced by Snippy, shown in the Snippy output section of this README.
< isolate >_mlst.tsv - Contains the MLST sequence type for the isolate
< isolate >_ngmaster.tsv - This report contains the NGMAST and NGSTAR sequence types for the isolate as well as the allele calls made to determine those types.

Metric	Description
Sample	Isolate ID
SCHEME	ngmaSTar
NG-MAST	Sequence Type for the NG-MAST typing scheme
NG-STAR	Sequence Type for the NG-STAR typing scheme
porB_NG-MAST	porB allele per the NG-MAST typing scheme
tbpB	tbpB allele
penA	penA allele
mtrR	mtrR allele
porB_NG-STAR	porB allele per the NG-STAR typing scheme
ponA	ponA allele
gyrA	gyrA allele
parC	parC allele
23S	23S allele

< isolate >_variant_report.tsv - This report includes either a mutation variant call or the FA19 nucleotide or amino acid for AMR loci as well as frequencies for certain calls, whether certain genes have an early stop, and the presence of certain horizontally transferred genes.

Metric	Description	Default	Associated Drug(s)
Sample	Isolate ID	-	-
23S-2611 base	FA19 or SNP base call at nucleotide 2599 in the 23S gene	C	AZM
23S-2611 freq	Frequency of FA19 or SNP base call at nucleotide 2599 in the 23S gene	1.0	AZM
23S-2059 base	FA19 or SNP base call at nucleotide 2047 in the 23S gene	A	AZM
23S-2059 freq	Frequency of FA19 or SNP base call at nucleotide 2047 in the 23S gene	1.0	AZM
23S-2058 base	FA19 or SNP base call at nucleotide 2046 in the 23S gene	A	AZM
23S-2058 freq	Frequency of FA19 or SNP base call at nucleotide 2046 in the 23S gene	1.0	AZM
mtrR promoter	The mtrR promoter contains the sequence AAAAA, and this field is to report a mutation at any one of these positions (but only one nucleotide is reported). The FA19 default is reported as A and if a mutation is found at any of the 5 positions, that nucleotide will be reported instead.	A	AZM/PEN/TET/CFM/CRO
mtr120 promoter	FA19 or SNP base call at nucleotide 1110770 relative to the reference for the mtr120 promoter	G	AZM/PEN/TET/CFM/CRO
mtrR -35	FA19 or SNP base call at nucleotide 1110836 relative to the reference for mtrR -35	G	AZM/PEN/TET/CFM/CRO
mtrR WHOP	FA19 or SNP base call at nucleotide 1110839 relative to the reference for mtrR -35	C	AZM/PEN/TET/CFM/CRO
mtrA binding site	FA19 or SNP base call at nucleotide 1110865 relative to the reference for the mtrA binding site	G	AZM/PEN/TET/CFM/CRO
mtrR aa39	FA19 or mutated variant amino acid at position 39 in the mtrR gene	A	AZM/PEN/TET/CFM/CRO
mtrR aa44	FA19 or mutated variant amino acid at position 44 in the mtrR gene	R	AZM/PEN/TET/CFM/CRO
mtrR aa45	FA19 or mutated variant amino acid at position 45 in the mtrR gene	G	AZM/PEN/TET/CFM/CRO
mtrR aa47	FA19 or mutated variant amino acid at position 47 in the mtrR gene	L	AZM/PEN/TET/CFM/CRO
mtrR aa79	FA19 or mutated variant amino acid at position 79 in the mtrR gene	D	AZM/PEN/TET/CFM/CRO
mtrR aa105	FA19 or mutated variant amino acid at position 105 in the mtrR gene	H	AZM/PEN/TET/CFM/CRO
mtrR premature stop	True if early stop is found in the mtrR gene	False	AZM/PEN/TET/CFM/CRO
penA aa311	FA19 or mutated variant amino acid at position 311 in the penA gene	A	PEN/CFM/CRO
penA aa312	FA19 or mutated variant amino acid at position 312 in the penA gene	I	PEN/CFM/CRO
penA aa316	FA19 or mutated variant amino acid at position 316 in the penA gene	V	PEN/CFM/CRO
penA aa483	FA19 or mutated variant amino acid at position 483 in the penA gene	T	PEN/CFM/CRO
penA aa501	FA19 or mutated variant amino acid at position 501 in the penA gene	A	PEN/CFM/CRO
penA aa504	FA19 or mutated variant amino acid at position 504 in the penA gene	L	PEN/CFM/CRO
penA aa512	FA19 or mutated variant amino acid at position 512 in the penA gene	N	PEN/CFM/CRO
penA aa516	FA19 or mutated variant amino acid at position 516 in the penA gene	A	PEN/CFM/CRO
penA aa542	FA19 or mutated variant amino acid at position 542 in the penA gene	G	PEN/CFM/CRO
penA aa545	FA19 or mutated variant amino acid at position 545 in the penA gene	G	PEN/CFM/CRO
penA aa549	FA19 or mutated variant amino acid at position 549 in the penA gene	A	PEN/CFM/CRO
penA aa551	FA19 or mutated variant amino acid at position 551 in the penA gene	P	PEN/CFM/CRO
penA D345ins	True if there is an Asparagine (D) at amino acid position 345 in the penA gene	False	PEN/CFM/CRO
ponA aa375	FA19 or mutated variant amino acid at position 375 in the ponA gene	A	PEN/CFM/CRO
ponA aa421	FA19 or mutated variant amino acid at position 421 in the ponA gene	L	PEN/CFM/CRO
pilQ full length	False if early stop is found in the pilQ gene	True	PEN/TET
pilQ aa341	FA19 or mutated variant amino acid at position 341 in the pilQ gene	S	PEN/TET
pilQ aa526	FA19 or mutated variant amino acid at position 526 in the pilQ gene	D	PEN/TET
pilQ aa648	FA19 or mutated variant amino acid at position 648 in the pilQ gene	S	PEN/TET
pilQ aa666	FA19 or mutated variant amino acid at position 666 in the pilQ gene	E	PEN/TET
gyrA aa91	FA19 or mutated variant amino acid at position 91 in the gyrA gene	S	CIP
gyrA aa92	FA19 or mutated variant amino acid at position 92 in the gyrA gene	A	CIP
gyrA aa95	FA19 or mutated variant amino acid at position 95 in the gyrA gene	D	CIP
parC aa86	FA19 or mutated variant amino acid at position 86 in the parC gene	D	CIP
parC aa87	FA19 or mutated variant amino acid at position 87 in the parC gene	S	CIP
parC aa88	FA19 or mutated variant amino acid at position 88 in the parC gene	S	CIP
parC aa91	FA19 or mutated variant amino acid at position 91 in the parC gene	E	CIP
blaTEM present	True if blaTEM is found with sufficient depth (>=2)	False	PEN
tetM present	True if tetM is found with sufficient depth (>=2)	False	TET
rpsJ aa57	FA19 or mutated variant amino acid at position 57 in the rpsJ gene	V	TET
ftsX aa31	FA19 or mutated variant amino acid at position 31 in the ftsX gene	T	CTX
rplD aa68	FA19 or mutated variant amino acid at position 68 in the rplD gene	G	AZM
rplD aa70	FA19 or mutated variant amino acid at position 70 in the rplD gene	G	AZM
rplV ins	True if any insertions are found in the rplV gene	False	ASK!!
macA aa99	FA19 or mutated variant amino acid at position 99 in the macA gene	N	AZM
mtrD aa42	FA19 or mutated variant amino acid at position 42 in the mtrD gene	T	AZM
mtrD aa46	FA19 or mutated variant amino acid at position 46 in the mtrD gene	H	AZM
mtrD aa48	FA19 or mutated variant amino acid at position 48 in the mtrD gene	I	AZM
mtrD aa101	FA19 or mutated variant amino acid at position 101 in the mtrD gene	N	AZM
mtrD aa174	FA19 or mutated variant amino acid at position 174 in the mtrD gene	R	AZM
mtrD aa612	FA19 or mutated variant amino acid at position 612 in the mtrD gene	F	AZM
mtrD aa662	FA19 or mutated variant amino acid at position 669 in the mtrD gene	V	AZM
mtrD aa669	FA19 or mutated variant amino acid at position 669 in the mtrD gene	E	AZM
mtrD aa714	FA19 or mutated variant amino acid at position 714 in the mtrD gene	R	AZM
mtrD aa821	FA19 or mutated variant amino acid at position 821 in the mtrD gene	S	AZM
mtrD aa823	FA19 or mutated variant amino acid at position 823 in the mtrD gene	K	AZM
mtrD aa826	FA19 or mutated variant amino acid at position 826 in the mtrD gene	A	AZM
macA promoter	FA19 or SNP base call at nucleotide 1192227 relative to the reference for the macA promoter	A	AZM
norM promoter	FA19 or SNP base call at nucleotide 129495 relative to the reference for the norM promoter	G	CIP
ermB present	True if ermB is found with sufficient depth (>=2)	False	AZM
ermC present	True if ermC is found with sufficient depth (>=2)	False	AZM
ermF present	True if ermF is found with sufficient depth (>=2)	False	AZM
mefA present	True if mefA is found with sufficient depth (>=2)	False	AZM
gyrB aa429	FA19 or mutated variant amino acid at position 429 in the gyrB gene	D	ETX0914
gyrB aa450	FA19 or mutated variant amino acid at position 450 in the gyrB gene	K	ETX0914
acnB aa348	FA19 or mutated variant amino acid at position 348 in the acnB gene	G	CMF/CRO
acnB aa371	FA19 or mutated variant amino acid at position 371 in the acnB gene	Q	CMF/CRO
16S-1053 base	FA19 or SNP base call at nucleotide 1053 in the 16S gene	G	ERA
16S-1053 freq	Frequency of FA19 or SNP base call at nucleotide 1053 in the 16S gene	1.0	ERA
16S-1186 base	FA19 or SNP base call at nucleotide 1186 in the 16S gene	G	ERA
16S-1186 freq	Frequency of FA19 or SNP base call at nucleotide 1186 in the 16S gene	1.0	ERA
rpsE aa24	FA19 or mutated variant amino acid at position 24 in the rpsE gene	T	SPC
rpsE aa28	FA19 or mutated variant amino acid at position 28 in the rpsE gene	K	SPC

phylogeny - output from the Phylogeny Subworkflow

The phylogeny directory contains the snippy_core, gubbins, snp_dists, RAxML, and Gotree subdirectories. It also contains the phylogeny_qc_report.tsv.

snippy_core - The snippy_core directory contains the output from running the entire collection of isolates against the same reference, building a core genome, and then generating core alignment files using that core genome. This information can be used for building a phylogenetic tree. The core genome is comprised of genomic positions that are present in all of the isolates in the collection. The files found in the snippy_core directory are detailed in the table below:

File	Description
core.aln	A core SNP alignment in the --aformat format (default FASTA)
core.full.aln	A whole genome SNP alignment (includes invariant sites)
core.tab	Tab-separated columnar list of core SNP sites with alleles but NO annotations
core.txt	Tab-separated columnar list of alignment/core-size statistics
core.ref.fa	FASTA version/copy of the --ref

The core.txt file contains columns with the metrics seen in the following table with each of the isolates and the reference as rows.

Metric	Description
ID	Sample or reference name
LENGTH	Total length of the genome measured in bases
ALIGNED	Number of bases that aligned with the reference
UNALIGNED	Number of bases that did not align with the reference
VARIANT	Total number of variants relative to the reference
HET	Number of heterozygotes
MASKED	Number of masked bases
LOWCOV	Number of bases with low coverage

gubbins - The gubbins subdirectory contains output from Gubbins, A tool used to mark regions as recombinations and construct a phylogeny based on mutations outside of those recombination regions. The Phylip file from this output is used during phylogenetic analysis. This directory also contains _partition_data.txt which contains the monomorphic counts for the unambiguous nucleotides, as well as _partition.txt which references _partition_data.txt and is used for the ascertainment bias correction in the RAxML step.

The following table includes all of the files produced by Gubbins that can be found in the gubbins directory, as detailed in the Gubbins manual:

Extension	Description
.recombination_predictions.embl	Recombination predictions in EMBL file format
.recombination_predictions.gff	Recombination predictions in GFF format
.branch_base_reconstruction.embl	Base substitution reconstruction in EMBL format
.summary_of_snp_distribution.vcf	Per branch reporting of the base substitutions inside and outside recombination events
.filtered_polymorphic_sites.fasta	FASTA format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration
.filtered_polymorphic_sites.phylip	Phylip format alignment of filtered polymorphic sites used to generate the phylogeny in the final iteration
.final_tree.tre	This file contains the final phylogeny in Newick format; branch lengths are in point mutations
.node_labelled.final_tree.tre	Final phylogenetic tree in Newick format but with the internal node labels; branch lengths are in point mutations
.log	A log file specifying the software used at each step of the analysis, with accompanying citations
.per_branch_statistics.csv	File containing summary statistics for each branch in the tree in comma delimited format

The .per_branch_statistics.csv file contains columns with the metrics seen in the following table with rows for each sample, per the gubbins manual.

Metric	Description
Node	Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered
Total SNPs	Total number of base substitutions reconstructed onto the branch
Number of SNPs Inside Recombinations	Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (r)
Number of SNPs Outside Recombinations	Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (m)
Number of Recombination Blocks	Total number of recombination blocks reconstructed onto the branch
Bases in Recombinations	Total length of all recombination events reconstructed onto the branch
Cumulative Bases in Recombinations	Total number of bases in the alignment affected by recombination on this branch and its ancestors
*r/m*	The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch
*rho/theta*	The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation
Genome Length	The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either
Bases in Clonal Frame	The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation)

snp_dists - The snp_dists directory contains all of the output from snpdists as well as any potential outbreak clusters detected in the sample set.

This output includes the following files:

isolate_clusters.txt - Each line has a distinct outbreak cluster which includes the name of each isolate in the outbreak cluster as well as a list after each isolate that shows the SNP distance between that isolate and the rest of the isolates in the clusters. The clusters are identified by first constructing a graph with the isolates as nodes, and edges existing between isolates with a SNP distance of < 20 (as the default, this value can be changed). An analysis using disjoint set union (DSU) is then performed on this graph to identify the connected components, which are then listed as the potential outbreak clusters.
Example cluster:
GCWGS-27036-WA-M5130-240405_S28_L001 [ 0 22 24 14 20] GCWGS-27038-WA-M5130-240405_S30_L001 [22 0 30 16 14] GCWGS-27039-WA-M5130-240405_S31_L001 [24 30 0 21 18] GCWGS-27046-WA-M5130-240405_S38_L001 [14 16 21 0 16] GCWGS-27419-WA-M5130-240417_S68_L001 [20 14 18 16 0]
matrix.tsv - Contains the pairwise SNP distances between all isolates in the sample set in matrix format.

RAxML - The RAxML directory contains the output produced by RAxML, the RAxML_bestTree.core_ is the best tree generated during the phylogenetic analysis in Newick format. This is the tree visualized by Gotree.

The following table includes all of the files produced by RAxML that can be found in the RAxML directory, as detailed in the RAxML manual:

Prefix	Description
RAxML_bestTree.	Contains the best-scoring ML tree of a thorough ML analysis
RAxML_bipartitions.	BS support values on the best tree found during the ML search
RAxML_bipartitionsBranchLabels.	Contains the same information as the file above, but support values are correctly displayed as Newick branch labels and not node labels! Support values always refer to branches/splits of trees and never to nodes of the tree.
RAxML_bootstrap.	All final bootstrapped trees
RAxML_info	Contains information about the model and algorithm used and how RAxML was called. The final GAMMA-based likelihood(s) as well as the alpha shape parameter(s) are printed to this file. In addition, if the rearrangement setting was determined automatically (i has not been used) the rearrangement setting found by the program will be indicated. This is the most important output file because it tells you what RAxML did and is always written irrespective of the command line option. In addition, it contains information about all other output files that were written by your run.

Gotree - The Gotree directory contains the output from Gotree, including the midrooted version of the best tree output by RAxML and bestTree.png, which is an image of the midrooted phylogenetic tree. The clusters identified in the outbreak detection step are color coded in this tree, with the cluster_annotation.tsv file containing this annotation information.This directory also contains phylogeny_report.html which can be opened in your browser. This report contains the image of the phylogenetic tree as well as the outbreak clusters from isolate_clusters.txt.

phylogeny_qc_report.tsv - This file contains the QC report for the phylogenetic analysis. The table below contains an example of this report.

QC Parameter	Accepted Value	Actual Value	Pass/Fail
Num_Samples_Aligned	7	7	pass
Match_Ref_Length	true	true	pass
Num_Lines_w_Invalid_Nuc	0	0	pass
All_Present_in_Tree	true	true	pass
Num_Outliers	-	0	NA
Core_Mono_Nuc_bp_Count	-	1892518	NA

The following table contains the descriptions of these parameters:

Metric	Description
Num_Samples_Aligned	Number of samples in the .full.aln FASTA alignment produced by snippy-core. A run will pass if this value (reported under Actual Value) matches the number of samples given to the pipeline (reported under Accepted Value)
Match_Ref_Length	“true” if the sequences in the .full.aln FASTA file produced by snippy-core are the same length as the reference. This being “true” results in a pass.
Num_Lines_w_Invalid_Nuc	Number of lines in the sequences in the .full.aln FASTA file produced by snippy-core that contain invalid nucleotides. There being 0 lines containing invalid nucleotides results in a pass.
All_Present_in_Tree	“true” if all of the samples in the run are present in the Newick file produced by RAxML. This being “true” results in a pass.
Num_Outliers	Reports the number of statistical outliers in the resulting tree. A statistical outlier being a sample whose branch has a length of greater than the mean branch length + 2*(the standard deviation of branch lengths). There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run. If you perform a lineage specific (same MLST or coregeno group) phylogenetic analysis, you should not have any outliers and if there are any outliers we suggest to look into your isolate list and phylogenetic visuals inorder to assess whether any isolates with unusually long branch lengths belonged to the same lineage or not.
Core_Mono_Nuc_bp_Count	Reports size of the core alignment, excluding any portions that are not perfectly aligned between the samples in the set. There is no “Accepted Value” to compare this against, and this is not a deterministic value for passing/failing a run.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neissflow: Output

Introduction

Pipeline overview

run-name_final_report.tsv - aggregated report, is a combination of data from the following reports.

QC_final_report.tsv - aggregated report containing only the control samples

QC_FASTQ – output from the Preprocessing Subworkflow

snippy – output from the SNIPPY module and from the AMR_Profiler Subworkflow SNIPPY_AMR module

COVERAGE – output from the STATS step

MASH – output from the MASH and COMBINE_MASH_REPORTS steps

processed_genomes – output from the Assembly Subworkflow

QC_CHECK – output from the second QC check

amr_profiler – output from the AMR_Profiler Subworkflow

phylogeny - output from the Phylogeny Subworkflow

MultiQC

Pipeline information

FilesExpand file tree

output.md

Latest commit

History

output.md

File metadata and controls

neissflow: Output

Introduction

Pipeline overview

run-name_final_report.tsv - aggregated report, is a combination of data from the following reports.

QC_final_report.tsv - aggregated report containing only the control samples

QC_FASTQ – output from the Preprocessing Subworkflow

snippy – output from the SNIPPY module and from the AMR_Profiler Subworkflow SNIPPY_AMR module

COVERAGE – output from the STATS step

MASH – output from the MASH and COMBINE_MASH_REPORTS steps

processed_genomes – output from the Assembly Subworkflow

QC_CHECK – output from the second QC check

amr_profiler – output from the AMR_Profiler Subworkflow

phylogeny - output from the Phylogeny Subworkflow

MultiQC

Pipeline information