VSEARCH 2.11.0: Added ability to filter paired reads + xee option

torognes · Feb 13, 2019 · 97c8924 · 97c8924
1 parent 6f6f30e
commit 97c8924
Show file tree

Hide file tree

Showing 94 changed files with 818 additions and 466 deletions.
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
 
 ## Getting Help
 
-If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
+If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
 
 ## Example
 
@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
 
 ```
-wget https://github.com/torognes/vsearch/archive/v2.10.4.tar.gz
-tar xzf v2.10.4.tar.gz
-cd vsearch-2.10.4
+wget https://github.com/torognes/vsearch/archive/v2.11.0.tar.gz
+tar xzf v2.11.0.tar.gz
+cd vsearch-2.11.0
 ./autogen.sh
 ./configure
 make
@@ -68,43 +68,43 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
 Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-x86_64.tar.gz
-tar xzf vsearch-2.10.4-linux-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch-2.11.0-linux-x86_64.tar.gz
+tar xzf vsearch-2.11.0-linux-x86_64.tar.gz
 ```
 
 Or these commands if you are using a Linux ppc64le system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-ppc64le.tar.gz
-tar xzf vsearch-2.10.4-linux-ppc64le.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch-2.11.0-linux-ppc64le.tar.gz
+tar xzf vsearch-2.11.0-linux-ppc64le.tar.gz
 ```
 
 Or these commands if you are using a Linux aarch64 system:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-linux-aarch64.tar.gz
-tar xzf vsearch-2.10.4-linux-aarch64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch-2.11.0-linux-aarch64.tar.gz
+tar xzf vsearch-2.11.0-linux-aarch64.tar.gz
 ```
 
 Or these commands if you are using a Mac:
 
 ```sh
-wget https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-macos-x86_64.tar.gz
-tar xzf vsearch-2.10.4-macos-x86_64.tar.gz
+wget https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch-2.11.0-macos-x86_64.tar.gz
+tar xzf vsearch-2.11.0-macos-x86_64.tar.gz
 ```
 
 Or if you are using Windows, download and extract (unzip) the contents of this file:
 
 ```
-https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch-2.10.4-win-x86_64.zip
+https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch-2.11.0-win-x86_64.zip
 ```
 
-Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.4-linux-x86_64` or `vsearch-2.10.4-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
+Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.11.0-linux-x86_64` or `vsearch-2.11.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
 
-Windows: You will now have the binary distribution in a folder called `vsearch-2.10.4-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
+Windows: You will now have the binary distribution in a folder called `vsearch-2.11.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
 
 
-**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.4/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
+**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.11.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
 
 
 ## Plugins, packages, and wrappers
@@ -176,11 +176,11 @@ The code is written in C++ but most of it is actually mostly C with some C++ syn
 
 File | Description
 ---|---
-**abundance.cc** | Code for extracting and printing abundance information from FASTA headers
 **align.cc** | New Needleman-Wunsch global alignment, serial. Only for testing.
 **align_simd.cc** | SIMD parallel global alignment of 1 query with 8 database sequences
 **allpairs.cc** | All-vs-all optimal global pairwise alignment (no heuristics)
 **arch.cc** | Architecture specific code (Mac/Linux)
+**attributes.cc** | Extraction and printing of attributes in FASTA headers
 **bitmap.cc** | Implementation of bitmaps
 **chimera.cc** | Chimera detection
 **city.cc** | CityHash code

diff --git a/configure.ac b/configure.ac
@@ -2,7 +2,7 @@
 # Process this file with autoconf to produce a configure script.
 
 AC_PREREQ([2.63])
-AC_INIT([vsearch], [2.10.4], [[email protected]])
+AC_INIT([vsearch], [2.11.0], [[email protected]])
 AC_CANONICAL_TARGET
 AM_INIT_AUTOMAKE([subdir-objects])
 AC_LANG([C++])

diff --git a/man/vsearch.1 b/man/vsearch.1
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "January 10, 2019" "version 2.10.4" "USER COMMANDS"
+.TH vsearch 1 "February 13, 2019" "version 2.11.0" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication and
@@ -51,9 +51,9 @@ FASTA/FASTQ file processing:
 \fBvsearch\fR (\-\-fastq_eestats | \-\-fastq_eestats2) \fIfastqfile\fR
 \-\-output \fIoutputfile\fR [\fIoptions\fR]
 .PP
-\fBvsearch\fR \-\-fastq_filter \fIfastqfile\fR (\-\-fastaout |
-\-\-fastaout_discarded | \-\-fastqout | \-\-fastqout_discarded)
-\fIoutputfile\fR [\fIoptions\fR]
+\fBvsearch\fR \-\-fastq_filter \fIfastqfile\fR [\-\-reverse
+\fIfastqfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout |
+\-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev | \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR]
 .PP
 \fBvsearch\fR \-\-fastq_join \fIfastqfile\fR \-\-reverse
 \fIfastqfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR
@@ -68,7 +68,11 @@ FASTA/FASTQ file processing:
 \fBvsearch\fR \-\-fastq_stats \fIfastqfile\fR
 [\-\-log \fIlogfile\fR] [\fIoptions\fR]
 .PP
-\fBvsearch\fR \-\-fastx_revcomp \fIfastxfile\fR (\-\-fastaout |
+\fBvsearch\fR \-\-fastx_filter \fIinputfile\fR [\-\-reverse
+\fIinputfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout |
+\-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev | \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR]
+.PP
+\fBvsearch\fR \-\-fastx_revcomp \fIinputfile\fR (\-\-fastaout |
 \-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
 .PP
 \fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
@@ -957,15 +961,15 @@ file.
 FASTA/FASTQ file processing options:
 .RS
 .PP
-Analyse, shorten, filter, convert or merge sequences in FASTQ files,
-or reverse complement sequences in FASTA or FASTQ files. The
+Analyse, trim, filter, convert or merge sequences in FASTQ files, or
+reverse complement sequences in FASTA or FASTQ files. The
 \-\-fastq_chars command can be used to analyse FASTQ files to identify
 the quality encoding and the range of quality score values used. To
 convert between different FASTQ file variants, use the
 \-\-fastq_convert command. Statistical analysis of the quality and
 length of the sequences in a FASTQ file may be performed with the
 \-\-fastq_stats, \-\-fastq_eestats, and \-\-fastq_eestats2
-commands. Sequences may be shortened, filtered and converted by the
+commands. Sequences may be trimmed, filtered and converted by the
 \-\-fastq_filter or \-\-fastx_filter commands. Paired-end reads can be
 merged using the \-\-fastq_mergepairs command. The \-\-fastx_revcomp
 command reverse-complements sequences. Finally, the \-\-sff_convert
@@ -975,7 +979,9 @@ command can be used to convert SFF files to FASTQ.
 .B \-\-eeout
 When using \-\-fastq_filter or \-\-fastq_mergepairs, include the
 number of expected errors (ee) in the sequence header of FASTQ and
-FASTA files. This option is a synonym of the \-\-fastq_eeout option.
+FASTA files. This option is a synonym of the \-\-fastq_eeout
+option. Use the \-\-xee option to remove this information from
+headers.
 .TP
 .BI \-\-eetabbedout \0filename
 When specified with the \-\-fastq_mergepairs command, write statistics
@@ -992,6 +998,11 @@ When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
 write to the given FASTA-formatted file the sequences passing the
 filter, or the merged sequences.
 .TP
+.BI \-\-fastaout_rev \0filename
+When using \-\-fastq_filter, or \-\-fastx_filter,
+write to the given FASTA-formatted file the reverse reads passing the
+filter.
+.TP
 .BI \-\-fastaout_notmerged_fwd \0filename
 When using \-\-fastq_mergepairs, write forward reads not merged to the
 specified FASTA file.
@@ -1004,6 +1015,11 @@ specified FASTA file.
 Write sequences that do not pass the filter of the \-\-fastq_filter or
 \-\-fastx_filter command to the given FASTA-formatted file.
 .TP
+.BI \-\-fastaout_discarded_rev \0filename
+Write reverse reads that do not pass the filter of the
+\-\-fastq_filter or \-\-fastx_filter command to the given
+FASTA-formatted file.
+.TP
 .B \-\-fastq_allowmergestagger
 When using \-\-fastq_mergepairs, allow to merge staggered read
 pairs. Staggered pairs are pairs where the 3' end of the reverse read
@@ -1051,9 +1067,11 @@ be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout
 options. The output file is specified with the \-\-fastqout option.
 .TP
 .B \-\-fastq_eeout
-When using \-\-fastq_filter or \-\-fastq_mergepairs, include the
-number of expected errors (ee) in the sequence header of FASTQ and
-FASTA files. This option is a synonym of the \-\-eeout option.
+When using \-\-fastq_filter, \-\-fastx_filter or \-\-fastq_mergepairs,
+include the number of expected errors (ee) in the sequence header of
+FASTQ and FASTA files. This option is a synonym of the \-\-eeout
+option. Use the \-\-xee option to remove this information from
+headers.
 .TP
 .BI \-\-fastq_eestats \0filename
 Analyze a FASTQ file and report statistics on the distributions of
@@ -1098,7 +1116,7 @@ as its argument. The default setting is "0.5,1.0,2.0" that indicates
 that expected error levels of 0.5, 1.0 and 2.0 should be used.
 .TP
 .BI \-\-fastq_filter \0filename
-Shorten and/or filter sequences in the given FASTQ file. Similar to
+Trim and/or filter sequences in the given FASTQ file. Similar to
 the \-\-fastx_filter command, but works only on FASTQ files. See
 \-\-fastx_filter for details.
 .TP
@@ -1341,10 +1359,19 @@ When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter,
 write to the given FASTQ-formatted file the sequences passing the
 filter, or the merged sequences.
 .TP
+.BI \-\-fastqout_rev \0filename
+When using \-\-fastq_filter or \-\-fastx_filter,
+write to the given FASTQ-formatted file the reverse reads passing the
+filter.
+.TP
 .BI \-\-fastqout_discarded \0filename
 When using \-\-fastq_filter or \-\-fastx_filter, write sequences that
 do not pass the filter to the given FASTQ-formatted file.
 .TP
+.BI \-\-fastqout_discarded_rev \0filename
+When using \-\-fastq_filter or \-\-fastx_filter, write reverse reads that
+do not pass the filter to the given FASTQ-formatted file.
+.TP
 .BI \-\-fastqout_notmerged_fwd \0filename
 When using \-\-fastq_mergepairs, write forward reads not merged to the
 specified FASTQ file.
@@ -1354,26 +1381,34 @@ When using \-\-fastq_mergepairs, write reverse reads not merged to the
 specified FASTQ file.
 .TP
 .BI \-\-fastx_filter \0filename
-Shorten and/or filter the sequences in the given FASTA or FASTQ file
-and output the remaining sequences to the FASTQ file specified with
-the \-\-fastqout option and to the FASTA file specified with the
-\-\-fastaout option. The discarded sequences are written to the files
+Trim and/or filter the sequences in the given FASTA or FASTQ file and
+output the remaining sequences to the FASTQ file specified with the
+\-\-fastqout option and/or to the FASTA file specified with the
+\-\-fastaout option. Discarded sequences are written to the files
 specified with the \-\-fastaout_discarded and \-\-fastqout_discarded
 options. The input format (FASTA or FASTQ) is automatically
-detected. Output can not be written to FASTQ files if the input is in
-FASTA format. Sequences may be shortened using the options
-\-\-fastq_stripleft, \-\-fastq_stripright, \-\-fastq_truncee,
-\-\-fastq_trunclen, \-\-fastq_trunclen_keep and
-\-\-fastq_truncqual. The sequences may be filtered using the options
+detected. If the input consists of paired sequences, an input file
+with reverse reads may be specified with the \-\-reverse option, and
+corresponding output will be written to the files specified with the
+\-\-fastqout_rev, \-\-fastaout_rev, \-\-fastqout_discarded_rev, and
+\-\-fastaout_discarded_rev options. Output can not be written to FASTQ files
+if the input is in FASTA format. The sequences are first trimmed and
+then filtered based on the remaining bases. Sequences may be trimmed
+using the options \-\-fastq_stripleft, \-\-fastq_stripright,
+\-\-fastq_truncee, \-\-fastq_trunclen, \-\-fastq_trunclen_keep and
+\-\-fastq_truncqual.  The sequences may be filtered using the options
 \-\-fastq_maxee, \-\-fastq_maxee_rate, \-\-fastq_maxlen,
-\-\-fastq_maxns, \-\-fastq_minlen, \-\-fastq_trunclen, \-\-maxsize,
-and \-\-minsize. If shortening results in an empty sequence, it is
-discarded. The sequences are first shortened and then filtered based
-on the remaining bases. If no shortening or filtering options are
-given, all sequences are written to the output files, possibly after
-conversion from FASTQ to FASTA format. The \-\-relabel option may be
-used to relabel the output sequences. The \-\-eeout may be used to
-output the expected number of errors in each sequence.
+\-\-fastq_maxns, \-\-fastq_minlen (default 1), \-\-fastq_trunclen,
+\-\-maxsize, and \-\-minsize. Sequences not satisfying the
+requirements are discarded. For pairs of sequences, both sequences in
+a pair must satisfy the requirements, otherwise both are
+discarded. If no shortening or filtering options are given, all
+sequences are written to the output files, possibly after conversion
+from FASTQ to FASTA format. The \-\-relabel option may be used to
+relabel the output sequences. The \-\-eeout option may be used to output the
+expected number of errors in each sequence. After all sequences have
+been processed, the number of kept and discarded sequences will be
+shown, as well as how many of the kept sequences were trimmed.
 .TP
 .BI \-\-fastx_revcomp \0filename
 Reverse-complement the sequences in the given FASTA or FASTQ file to a
@@ -1426,8 +1461,9 @@ Please see the description of the same option under Chimera detection
 for details.
 .TP
 .BI \-\-reverse \0filename
-When using \-\-fastq_mergepairs or \-\-fastq_join, specify the FASTQ
-file containing containing the reverse reads.
+When using \-\-fastq_filter, \-\-fastx_filter, \-\-fastq_mergepairs or
+\-\-fastq_join, specify the FASTQ file containing containing the
+reverse reads.
 .TP
 .BI \-\-sff_convert \0filename
 Convert the given SFF file to FASTQ. The FASTQ output file is
@@ -1447,6 +1483,11 @@ default no clipping is performed.
 .B \-\-xsize
 Strip abundance information from the headers when writing the output
 file.
+.TP
+.B \-\-xee
+Strip information about expected errors (ee) from the output file
+headers. This information is added by the \-\-fastq_eeout and
+\-\-eeout options.
 .RE
 .PP
 .\" ----------------------------------------------------------------------------
@@ -3508,6 +3549,12 @@ Fixed serious bug in x86_64 SIMD alignment code introduced in version
 2.10.3. Added link to BioConda in README. Fixed bug in fastq_stats
 with sequence length 1. Fixed use of equals symbol in UC files for
 identical sequences with cluster_fast.
+.TP
+.BR v2.11.0\~ "released February 13th, 2019"
+Added ability to trim and filter paired-end reads using the reverse
+option with the fastx_filter and fastq_filter commands. Added \-\-xee
+option to remove ee attributes from FASTA headers. Minor invisible
+improvement to the progress indicator.
 .RE
 .LP
 .\" ============================================================================

diff --git a/src/Makefile.am b/src/Makefile.am
@@ -15,11 +15,11 @@ AM_CFLAGS=$(AM_CXXFLAGS)
 export MACOSX_DEPLOYMENT_TARGET
 
 VSEARCHHEADERS=\
-abundance.h \
 align.h \
 align_simd.h \
 allpairs.h \
 arch.h \
+attributes.h \
 bitmap.h \
 chimera.h \
 city.h \
@@ -108,11 +108,11 @@ endif
 endif
 
 __top_builddir__bin_vsearch_SOURCES = $(VSEARCHHEADERS) \
-abundance.cc \
 align.cc \
 align_simd.cc \
 allpairs.cc \
 arch.cc \
+attributes.cc \
 bitmap.cc \
 chimera.cc \
 cluster.cc \

diff --git a/src/align.cc b/src/align.cc
@@ -2,7 +2,7 @@
 
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <[email protected]>,

diff --git a/src/align.h b/src/align.h
@@ -2,7 +2,7 @@
 
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <[email protected]>,

diff --git a/src/align_simd.h b/src/align_simd.h
@@ -2,7 +2,7 @@
 
   VSEARCH: a versatile open source tool for metagenomics
 
-  Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
+  Copyright (C) 2014-2019, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
   All rights reserved.
 
   Contact: Torbjorn Rognes <[email protected]>,