Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dada2 MergePairs : consensus between merging & concatenating reads #803

Merged
merged 16 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Added`

- []() - New parameters introduced related to `--concatenate_reads consensus`

| **Parameter** | **Description** | **Default Value** |
| ------------------------- | ----------------------------------------------------------------------------------------- | ----------------- |
| **asv_match** | The score assigned for each matching base pair during sequence alignment. | 1 |
| **asv_mismatch** | The penalty score assigned for each mismatched base pair during sequence alignment. | 0 |
| **asv_gap** | The penalty score assigned for each gap introduced during sequence alignment. | -64 |
| **asv_minoverlap** | The minimum number of overlapping base pairs required to merge forward and reverse reads. | 12 |
| **asv_maxmismatch** | The maximum number of mismatches allowed within the overlapping region for merging reads. | 0 |
| **asv_percentile_cutoff** | The percentile cutoff determining the minimum observed overlap in the dataset. | 0.001 |

### `Changed`

- []() - Changed DADA2_DENOISING to support new method named "consensus", by setting `--concatenate_reads consensus`.

### `Fixed`

### `Dependencies`
Expand Down
9 changes: 7 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,13 @@ process {
].join(',').replaceAll('(,)*$', "")
// setting from https://rdrr.io/bioc/dada2/man/mergePairs.html & https://rdrr.io/bioc/dada2/man/nwalign.html & match = getDadaOpt("MATCH"), mismatch = getDadaOpt("MISMATCH"), gap = getDadaOpt("GAP_PENALTY"), missing from the list below is: 'band = -1'
ext.args2 = [
'minOverlap = 12, maxMismatch = 0, returnRejects = FALSE, propagateCol = character(0), trimOverhang = FALSE, match = 1, mismatch = -64, gap = -64, homo_gap = NULL, endsfree = TRUE, vec = FALSE',
params.concatenate_reads ? "justConcatenate = TRUE" : "justConcatenate = FALSE"
"minOverlap = ${params.asv_minoverlap ?: 12}, maxMismatch = ${params.asv_maxmismatch ?: 0}, propagateCol = character(0), gap = ${params.asv_gap ?: -64}, homo_gap = NULL, endsfree = TRUE, vec = FALSE",
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
params.asv_concatenate_reads == "consensus" ?
"returnRejects = TRUE, match = ${params.match ?: 5}, mismatch = ${params.mismatch ?: -6}" :
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
"justConcatenate = ${params.asv_concatenate_reads == 'concatenate' ? 'TRUE' : 'FALSE'}, returnRejects = FALSE, match = ${params.asv_match ?: 1}, mismatch = ${params.asv_mismatch ?: -64}"
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
].join(',').replaceAll('(,)*$', "")
ext.args3 = [
"quantile = ${params.asv_percentile_cutoff ?: 0.001}"
].join(',').replaceAll('(,)*$', "")
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
publishDir = [
[
Expand Down
52 changes: 47 additions & 5 deletions modules/local/dada2_denoising.nf
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@ process DADA2_DENOISING {
def prefix = task.ext.prefix ?: "prefix"
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''
def args3 = task.ext.args3 ?: '0.001'
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
if (!meta.single_end) {
"""
#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(dada2))

errF = readRDS("${errormodel[0]}")
errR = readRDS("${errormodel[1]}")
errF <- readRDS("${errormodel[0]}")
errR <- readRDS("${errormodel[1]}")

filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE), method = "radix")
filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE), method = "radix")
Expand All @@ -45,9 +46,50 @@ process DADA2_DENOISING {
saveRDS(dadaRs, "${prefix}_2.dada.rds")
sink(file = NULL)

#make table
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, verbose=TRUE)
saveRDS(mergers, "${prefix}.mergers.rds")
# merge
if ("${params.concatenate_reads}" == "consensus") {
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, justConcatenate = FALSE, verbose=TRUE)
concats <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, justConcatenate = TRUE, verbose=TRUE)

# in case there is only one sample in the entire run
if (is.data.frame(mergers)) {
mergers <- list(sample = mergers)
concats <- list(sample = concats)
}

# define the overlap threshold to decide if concatenation or not
min_overlap_obs <- lapply(mergers, function(X) {
mergers_accepted <- X[["accept"]]
if (sum(mergers_accepted) > 0) {
min_overlap_obs <- X[["nmatch"]][mergers_accepted] + X[["nmismatch"]][mergers_accepted]
rep(min_overlap_obs, X[["abundance"]][mergers_accepted])
} else {
NA
}
})

min_overlap_obs <- Reduce(c, min_overlap_obs)
min_overlap_obs <- min_overlap_obs[!is.na(min_overlap_obs)]
min_overlap_obs <- quantile(min_overlap_obs, $args3)
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved

for (x in names(mergers)) {
to_concat <- !mergers[[x]][["accept"]] & (mergers[[x]][["nmismatch"]] + mergers[[x]][["nmatch"]]) < min_overlap_obs

if (sum(to_concat) > 0) {
mergers[[x]][to_concat, ] <- concats[[x]][to_concat, ]
# filter out unaccepted non concatenated sequences
mergers[[x]] <- mergers[[x]][mergers[[x]][["accept"]], ]
}

}

} else {
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, $args2, verbose=TRUE)
}

saveRDS(mergers, "${meta.run}.mergers.rds")
d4straub marked this conversation as resolved.
Show resolved Hide resolved

# make table
seqtab <- makeSequenceTable(mergers)
saveRDS(seqtab, "${prefix}.seqtab.rds")

Expand Down
8 changes: 7 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,13 @@ params {
sample_inference = "independent"
illumina_novaseq = false
illumina_pe_its = false
concatenate_reads = false
asv_concatenate_reads = false
asv_minoverlap = 12
asv_maxmismatch = 0
asv_gap = -64
asv_match = 1
asv_mismatch = -64
asv_percentile_cutoff = 0.001
cut_its = "none"
its_partial = 0
picrust = false
Expand Down
98 changes: 86 additions & 12 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,9 @@
"pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$"
}
},
"required": ["outdir"],
"required": [
"outdir"
],
"fa_icon": "fas fa-terminal"
},
"sequencing_input": {
Expand Down Expand Up @@ -259,12 +261,57 @@
"default": "independent",
"help_text": "If samples are treated independent (lowest sensitivity and lowest resources), pooled (highest sensitivity and resources) or pseudo-pooled (balance between required resources and sensitivity).",
"description": "Mode of sample inference: \"independent\", \"pooled\" or \"pseudo\"",
"enum": ["independent", "pooled", "pseudo"]
"enum": [
"independent",
"pooled",
"pseudo"
]
},
"concatenate_reads": {
"type": "boolean",
"description": "Not recommended: When paired end reads are not sufficiently overlapping for merging.",
"help_text": "This parameters specifies that paired-end reads are not merged after denoising but concatenated (separated by 10 N's). This is of advantage when an amplicon was sequenced that is too long for merging (i.e. bad experimental design). This is an alternative to only analyzing the forward or reverse read in case of non-overlapping paired-end sequencing data.\n\n**This parameter is not recommended! Only if all other options fail.**"
"asv_concatenate_reads": {
"type": "string",
"default": "false",
weber8thomas marked this conversation as resolved.
Show resolved Hide resolved
"description": "Strategy to merge paired end reads. When paired end reads are not sufficiently overlapping for merging, you can use \"concatenate\" (not recommended). When you have a mix of overlapping and non overlapping reads use \"consensus\"",
"help_text": "This parameters specifies that paired-end reads are not merged after denoising but concatenated (separated by 10 N's). This is of advantage when an amplicon was sequenced that is too long for merging (i.e. bad experimental design). This is an alternative to only analyzing the forward or reverse read in case of non-overlapping paired-end sequencing data.\n\n**This parameter is not recommended! Only if all other options fail.**",
d4straub marked this conversation as resolved.
Show resolved Hide resolved
"enum": [
"false",
"true",
"consensus"
]
},
"asv_match": {
"type": "integer",
"default": 1,
d4straub marked this conversation as resolved.
Show resolved Hide resolved
"description": "The score assigned for each matching base pair during sequence alignment.",
"help_text": "This parameter specifies the numerical value added to the alignment score for every pair of bases that match between the forward and reverse reads. A higher value increases the preference for alignments with more matching bases."
},
"asv_mismatch": {
"type": "integer",
"default": 0,
"description": "The penalty score assigned for each mismatched base pair during sequence alignment.",
"help_text": "This parameter defines the numerical penalty subtracted from the alignment score for each base pair mismatch between the forward and reverse reads. A higher penalty reduces the likelihood of accepting alignments with mismatches."
},
"asv_gap": {
"type": "integer",
"default": -64,
"description": "The penalty score assigned for each gap introduced during sequence alignment.",
"help_text": "This parameter sets the numerical penalty subtracted from the alignment score for each gap (insertion or deletion) introduced to align the forward and reverse reads. A higher penalty discourages alignments that require gaps."
},
"asv_minoverlap": {
"type": "integer",
"default": 12,
"description": "The minimum number of overlapping base pairs required to merge forward and reverse reads.",
"help_text": "This parameter specifies the smallest number of consecutive base pairs that must overlap between the forward and reverse reads for them to be merged. Ensuring sufficient overlap is crucial for accurate merging."
},
"asv_maxmismatch": {
"type": "integer",
"default": 0,
"description": "The maximum number of mismatches allowed within the overlapping region for merging reads.",
"help_text": "This parameter defines the highest number of mismatched base pairs permitted in the overlap region between forward and reverse reads for a successful merge. Setting this value helps control the stringency of read merging, balancing between sensitivity and accuracy."
},
"asv_percentile_cutoff": {
"type": "number",
"default": 0.001,
"description": "The percentile used to determine a stringent cutoff which will correspond to the minimum observed overlap in the dataset. This ensures that only read pairs with high overlap are merged into consensus sequences. Those with insufficient overlap are concatenated."
}
},
"fa_icon": "fas fa-braille"
Expand Down Expand Up @@ -438,7 +485,10 @@
"type": "string",
"description": "Method used for alignment, \"hmmer\" or \"mafft\"",
"default": "hmmer",
"enum": ["hmmer", "mafft"]
"enum": [
"hmmer",
"mafft"
]
},
"pplace_taxonomy": {
"type": "string",
Expand All @@ -454,7 +504,13 @@
"type": "string",
"help_text": "Choose any of the supported databases, and optionally also specify the version. Database and version are separated by an equal sign (`=`, e.g. `silva=138`) . This will download the desired database and initiate taxonomic classification with QIIME2 and the chosen database.\n\nIf both, `--dada_ref_taxonomy` and `--qiime_ref_taxonomy` are used, DADA2 classification will be used for downstream analysis.\n\nThe following databases are supported:\n- SILVA ribosomal RNA gene database project - 16S rRNA\n- UNITE - eukaryotic nuclear ribosomal ITS region - ITS\n- Greengenes (only testing!)\n\nGenerally, using `silva`, `unite-fungi`, or `unite-alleuk` will select the most recent supported version. For testing purposes, the tiny database `greengenes85` (dereplicated at 85% sequence similarity) is available. For details on what values are valid, please either use an invalid value such as `x` (causing the pipeline to send an error message with all valid values) or see `conf/ref_databases.config`.",
"description": "Name of supported database, and optionally also version number",
"enum": ["silva=138", "silva", "greengenes85", "greengenes2", "greengenes2=2022.10"]
"enum": [
"silva=138",
"silva",
"greengenes85",
"greengenes2",
"greengenes2=2022.10"
]
},
"qiime_ref_tax_custom": {
"type": "string",
Expand Down Expand Up @@ -529,7 +585,12 @@
"help_text": "If data is long read ITS sequences, that need to be cut to ITS region (full ITS, only ITS1, or only ITS2) for taxonomy assignment.",
"description": "Part of ITS region to use for taxonomy assignment: \"full\", \"its1\", or \"its2\"",
"default": "none",
"enum": ["none", "full", "its1", "its2"]
"enum": [
"none",
"full",
"its1",
"its2"
]
},
"its_partial": {
"type": "integer",
Expand All @@ -549,7 +610,13 @@
"type": "string",
"help_text": "",
"description": "Name of supported database, and optionally also version number",
"enum": ["silva", "silva=128", "greengenes", "greengenes=13_8", "greengenes88"]
"enum": [
"silva",
"silva=128",
"greengenes",
"greengenes=13_8",
"greengenes88"
]
},
"sidle_ref_tax_custom": {
"type": "string",
Expand Down Expand Up @@ -822,7 +889,14 @@
"description": "Method used to save pipeline results to output directory.",
"help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.",
"fa_icon": "fas fa-copy",
"enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"],
"enum": [
"symlink",
"rellink",
"link",
"copy",
"copyNoFollow",
"move"
],
"hidden": true
},
"email_on_fail": {
Expand Down Expand Up @@ -996,4 +1070,4 @@
"$ref": "#/$defs/institutional_config_options"
}
]
}
}
Loading