Skip to content

Commit

Permalink
VSEARCH 1.0.1: Bug fixes and minor changes
Browse files Browse the repository at this point in the history
  • Loading branch information
torognes committed Dec 1, 2014
1 parent 8da0059 commit 75c03e7
Show file tree
Hide file tree
Showing 16 changed files with 228 additions and 96 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t

The latest releases of VSEARCH are available [here](https://github.com/torognes/vsearch/releases).

Executable binaries of VSEARCH are available in the `bin` folder for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-osx-x86_64). These binaries include support for input files compressed by zlib and bzip2 (with files usually ending in .gz or .bz2).
Binary executables of VSEARCH are available in the `bin` folder for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-osx-x86_64). These executables include support for input files compressed by zlib and bzip2 (with files usually ending in .gz or .bz2).

Download the appropriate binary and make a symbolic link in a folder included in your `$PATH` from `vsearch` to the appropriate binary. You may use the following commands (assuming `~/bin` is in your `$PATH`):
Download the appropriate executable and make a symbolic link in a folder included in your `$PATH` from `vsearch` to the appropriate binary. You may use the following commands (assuming `~/bin` is in your `$PATH`):

```sh
cd ~
Expand All @@ -56,15 +56,15 @@ The alternative makefiles Makefile.ZLIB, Makefile.BZLIB and Makefile.static may

## Implementation details and initial assessment

**Search algorithm:** VSEARCH indexes the unique kmers in the database in a way similar USEARCH, but is currently limited to continuous words (non-spaced seeds) of length 3--15. It samples every unique kmer from each query sequence and identifies the number of matching kmers in each database sequence. It then examines the database sequences in order of decreasing number of kmer matches. A full global alignment is computed and those target sequences that satisfy all accept options are retained while the others are rejected. The `--maxrejects` and `--maxaccepts` options are supported in this process, indicating the maximum number of non-matching and matching target sequences considered, respectively. Please see the USEARCH paper and supplementary for details.
**Search algorithm:** VSEARCH indexes the unique kmers in the database in a way similar to USEARCH, but is currently limited to continuous words (non-spaced seeds) of length 3-15. It samples every unique kmer from each query sequence and identifies the number of matching kmers in each database sequence. It then examines the database sequences in order of decreasing number of kmer matches. A full global alignment is computed and those target sequences that satisfy all accept options are retained while the others are rejected. The `--maxrejects` and `--maxaccepts` options are supported in this process, indicating the maximum number of non-matching and matching target sequences considered, respectively. Please see the USEARCH paper and supplementary for details.

**Kmer selection:** How many and which kmers USEARCH chooses from the query sequence is not well documented. It is also not known exactly which database sequences are examined, and in which order. We have therefore experimented with various strategies in order to obtain good performance. Our procedure seems to give results equal to or better than USEARCH.

We have chosen to select all unique kmers from the query. At least 6 of these kmers must be present in the database sequence before it will be considered. Also, at least 1 out of 16 query kmers need to be present in the database sequence. Furthermore, if several database sequences have the same number of kmer matches, they will be examined in order of decreasing sequence length.

It appears that there are differences in usearch between the searches performed by the `--usearch_global` command and the clustering commands. Notably, it appears like `--usearch_global` simply ignores the options `--wordlength`, `--slots` and `--pattern`, while the clustering commands takes them into account. VSEARCH supports the `--wordlength` option for kmer lengths from 3 to 15, but the options `--slots` and `--pattern` are ignored.

**Alignment:** VSEARCH uses a 8-way 16-bit SIMD vectorized implementation of the full dynamic programming algorithm (Needleman-Wunsch) for global sequence alignment. It is an adaptation of the method described by Rognes (2011). USEARCH by default uses heuristic procedure involving seeding, extension and banded dynamic programming. If the `--fulldp` option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.
**Alignment:** VSEARCH uses a 8-way 16-bit SIMD vectorized implementation of the full dynamic programming algorithm (Needleman-Wunsch) for global sequence alignment. It is an adaptation of the method described by Rognes (2011). USEARCH by default uses a heuristic procedure involving seeding, extension and banded dynamic programming. If the `--fulldp` option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.

**Search Accuracy:** The accuracy of VSEARCH searches has been assessed and compared to USEARCH version 7.0.1090. The Rfam 11.0 database was used for the assessment, as described on the [USEARCH website](http://drive5.com/usearch/benchmark_rfam.html). A similar procedure was described in the USEARCH paper using the Rfam 9.1 database.

Expand Down Expand Up @@ -98,7 +98,7 @@ VSEARCH is about 40% faster than USEARCH on *de novo* chimera detection and abou

**Extensions:** A shuffle command has been added. By specifying a FASTA file using the `--shuffle` option, and an output file with the `--output` option, VSEARCH will shuffle the sequences in a pseudo-random order. An integer may be specified as the seed with the `--seed` option to generate the same shuffling several times. By default, or when `--seed 0` is specified, the pseudo-random number generator will be initialized with pseudo-random data from the machine to give different numbers each time it is run.

Another extension implemented is that dereplication will honour the `--sizein` option and add together the abundances of the sequences that are merged.
Another extension implemented is that `derep_fulllength` and `--cluster_fast` will honour the `--sizein` option and add together the abundances of the sequences that are merged.

The commands `--sortbylength` and `--sortbysize` supports the `--topn` option to output no more than the given number of sequences.

Expand Down Expand Up @@ -323,7 +323,7 @@ The following people have contributed to VSEARCH:
## Citing VSEARCH

No papers about VSEARCH have been published yet, but a manuscript is in preparation.
Please refer to the [VSEARCH GitHub repository](https://github.com/torognes/vsearch).
For now, please cite the [VSEARCH GitHub repository](https://github.com/torognes/vsearch).


## Test datasets
Expand Down
25 changes: 18 additions & 7 deletions doc/vsearch.1
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
.\" ============================================================================
.TH vsearch 1 "November 28, 2014" "version 1.0.0" "USER COMMANDS"
.TH vsearch 1 "December 1, 2014" "version 1.0.1" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication, masking, searching, shuffling and sorting of amplicons from metagenomic projects.
Expand Down Expand Up @@ -391,7 +391,7 @@ headers).
.TP
.B --sizeout
Add abundance annotations to the output fasta files (add the pattern
";size=\fIinteger\fR" to sequence headers). If --sizein is specified,
";size=\fIinteger\fR;" to sequence headers). If --sizein is specified,
abundance annotations are reported to output files, and each cluster
centroid receives a new abundance value corresponding to the total
abundance of the amplicons included in the cluster (--centroids
Expand Down Expand Up @@ -447,7 +447,7 @@ and sorted by decreasing abundance. Identical sequences receive the
header of the first sequence of their group. If --sizeout is used, the
number of occurrences (i.e. abundance) of each sequence is indicated
at the end of their fasta header using the pattern
";size=\fIinteger\fR".
";size=\fIinteger\fR;".
.TP
.B --sizein
Take into account the abundance annotations present in the input fasta
Expand All @@ -456,7 +456,7 @@ headers).
.TP
.B --sizeout
Add abundance annotations to the output fasta file (add the pattern
";size=\fIinteger\fR" to sequence headers). If --sizein is specified,
";size=\fIinteger\fR;" to sequence headers). If --sizein is specified,
each unique sequence receives a new abundance value corresponding to
its total abundance (sum of the abundances of its occurrences). If
--sizein is not specified, input abundances are set to 1, and each
Expand Down Expand Up @@ -607,7 +607,7 @@ commands become case sensitive. The default is to mask using
Write database target sequences matching at least one query sequence
to \fIfilename\fR, in fasta format. If the option --sizeout is used,
the number of queries that matched each target sequence is indicated
using the pattern ";size=\fIinteger\fR".
using the pattern ";size=\fIinteger\fR;".
.TP
.BI --dbnotmatched \0filename
Write database target sequences not matching query sequences to
Expand Down Expand Up @@ -847,7 +847,7 @@ identical.
.TP
.B --sizeout
Add abundance annotations to the output of the option --dbmatched
(using the pattern ";size=\fIinteger\fR").
(using the pattern ";size=\fIinteger\fR;").
.TP
.BI --strand\~ "plus|both"
When searching for similar sequences, check the \fIplus\fR strand only
Expand Down Expand Up @@ -958,7 +958,7 @@ abundance annotations.
.TP
.B --sizeout
When using --relabel, report abundance annotations to the output fasta
file (using the pattern ";size=\fIinteger\fR").
file (using the pattern ";size=\fIinteger\fR;").
.TP
.BI --sortbylength \0filename
Sort by decreasing length the sequences contained in
Expand Down Expand Up @@ -1183,6 +1183,11 @@ definitions that were removed from usearch.
.PP
\fBvsearch\fR extends the --topn option to sorting commands.
.PP
\fBvsearch\fR extends the --sizein option to dereplication and
--cluster_fast.
.PP
\fBvsearch\fR treats T and U as identical nucleotides for dereplication.
.PP
.\" ============================================================================
.SH NOVELTIES
\fBvsearch\fR introduces new options not present in usearch. They are
Expand Down Expand Up @@ -1327,6 +1332,12 @@ or minor bug releases are not mentioned):
.TP
.BR v1.0.0\~ "released November 28th, 2014"
First public release
.TP
.BR v1.0.1\~ "released December 1st, 2014"
Bug fixes (sortbysize, semicolon after size annotation in headers)
and minor changes
(labels as secondary sort key for most sorts, treat T and U as identical
for dereplication, only output size in dbmatched file if sizeout specified).
.LP
.\" ============================================================================
.\" TODO:
Expand Down
Binary file modified doc/vsearch_manual.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion eval/eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ DB=../data/Rfam_11_0.fasta
if [ $(uname -s) == "Linux" ]; then
VSEARCH=$(ls -v ../bin/vsearch*linux* | tail -1)
else
VSEARCH=$(ls -t ../bin/vsearch*macosx* | head -1)
VSEARCH=$(ls -t ../bin/vsearch*osx* | head -1)
fi

#VSEARCH=../src/vsearch
Expand Down
61 changes: 41 additions & 20 deletions src/db.cc
Original file line number Diff line number Diff line change
Expand Up @@ -464,7 +464,7 @@ void db_read(const char * filename, int upcase)
/* read sizein annotation if appropriate */
if ((opt_usearch_global ||
opt_uchime_denovo ||
(opt_sortbysize && opt_relabel && opt_sizeout) ||
opt_sortbysize ||
(opt_sortbylength && opt_relabel && opt_sizeout) ||
(opt_cluster_fast && opt_sizein) ||
(opt_cluster_smallmem && opt_sizein) ||
Expand Down Expand Up @@ -549,16 +549,21 @@ void db_fprint_fasta_with_size(FILE * fp, unsigned long seqno, unsigned long siz
int pat_start = pmatch[0].rm_so;
int pat_end = pmatch[0].rm_eo;

fprintf(fp, ">%.*s%ssize=%lu%s%.*s\n",
fprintf(fp,
">%.*s%s%.*s%ssize=%lu;\n",
pat_start, hdr,
pat_start > 0 ? ";" : "",
size,
pat_end < hdrlen ? ";" : "",
hdrlen - pat_end, hdr + pat_end);
hdrlen - pat_end, hdr + pat_end,
((pat_end < hdrlen) && (hdr[hdrlen - 1] != ';')) ? ";" : "",
size);
}
else
{
fprintf(fp, ">%s;size=%lu\n", hdr, size);
fprintf(fp,
">%s%ssize=%lu;\n",
hdr,
((hdrlen == 0) || (hdr[hdrlen - 1] != ';')) ? ";" : "",
size);
}

fprint_fasta_seq_only(fp, seq, seqlen, opt_fasta_width);
Expand All @@ -569,39 +574,55 @@ int compare_bylength(const void * a, const void * b)
seqinfo_t * x = (seqinfo_t *) a;
seqinfo_t * y = (seqinfo_t *) b;

/* longest first, otherwise keep order */
/* longest first, then by label, otherwise keep order */

if (x->seqlen < y->seqlen)
return +1;
else if (x->seqlen > y->seqlen)
return -1;
else
if (x < y)
return -1;
else if (x > y)
return +1;
else
return 0;
{
int r = strcmp(x->header, y->header);
if (r != 0)
return r;
else
{
if (x < y)
return -1;
else if (x > y)
return +1;
else
return 0;
}
}
}

inline int compare_byabundance(const void * a, const void * b)
{
seqinfo_t * x = (seqinfo_t *) a;
seqinfo_t * y = (seqinfo_t *) b;

/* most abundant first, otherwise keep order */
/* most abundant first, then by label, otherwise keep order */

if (x->size > y->size)
return -1;
else if (x->size < y->size)
return +1;
else
if (x < y)
return -1;
else if (x > y)
return +1;
else
return 0;
{
int r = strcmp(x->header, y->header);
if (r != 0)
return r;
else
{
if (x < y)
return -1;
else if (x > y)
return +1;
else
return 0;
}
}
}

void db_sortbylength()
Expand Down
62 changes: 45 additions & 17 deletions src/derep.cc
Original file line number Diff line number Diff line change
Expand Up @@ -52,28 +52,56 @@ int derep_compare(const void * a, const void * b)
struct bucket * x = (struct bucket *) a;
struct bucket * y = (struct bucket *) b;

/* highest abundance first, otherwize keep order */
/* highest abundance first, then by label, otherwise keep order */

if (x->size < y->size)
return +1;
else if (x->size > y->size)
return -1;
else
if (x->seqno_first < y->seqno_first)
return -1;
else if (x->seqno_first > y->seqno_first)
return +1;
else
return 0;
{
int r = strcmp(db_getheader(x->seqno_first),
db_getheader(y->seqno_first));
if (r != 0)
return r;
else
{
if (x->seqno_first < y->seqno_first)
return -1;
else if (x->seqno_first > y->seqno_first)
return +1;
else
return 0;
}
}
}

void string_upcase(char * upcase, char * s, unsigned long len)
void string_normalize(char * normalized, char * s, unsigned long len)
{
/* convert string to upper case, given all chars are in A-Za-z */
/* convert string to upper case and replace U by T */
char * p = s;
char * q = upcase;
char * q = normalized;
for(unsigned long i=0; i<len; i++)
*q++ = *p++ & 0xdf;
*q++ = chrmap_normalize[(int)(*p++)];
}

int seqcmp(char * a, char * b, int n)
{
char * p = a;
char * q = b;

if (n <= 0)
return 0;

while ((n-- > 0) && (chrmap_4bit[(int)(*p)] == chrmap_4bit[(int)(*q)]))
{
if ((n == 0) || (*p == 0) || (*q == 0))
break;
p++;
q++;
}

return chrmap_4bit[(int)(*p)] - chrmap_4bit[(int)(*q)];
}

void derep_fulllength()
Expand Down Expand Up @@ -142,8 +170,8 @@ void derep_fulllength()
unsigned long seqlen = db_getsequencelen(i);
char * seq = db_getsequence(i);

/* upper case sequence */
string_upcase(seq_up, seq, seqlen);
/* normalize sequence: uppercase and replace U by T */
string_normalize(seq_up, seq, seqlen);

/* reverse complement if necessary */
if (opt_strand > 1)
Expand All @@ -170,7 +198,7 @@ void derep_fulllength()
&&
((bp->hash != hash) ||
(seqlen != db_getsequencelen(bp->seqno_first)) ||
(strncasecmp(seq_up, db_getsequence(bp->seqno_first), seqlen))))
(seqcmp(seq_up, db_getsequence(bp->seqno_first), seqlen))))
{
bp++;
j++;
Expand Down Expand Up @@ -199,9 +227,9 @@ void derep_fulllength()
&&
((rc_bp->hash != rc_hash) ||
(seqlen != db_getsequencelen(rc_bp->seqno_first)) ||
(strncasecmp(rc_seq_up,
db_getsequence(rc_bp->seqno_first),
seqlen))))
(seqcmp(rc_seq_up,
db_getsequence(rc_bp->seqno_first),
seqlen))))
{
rc_bp++;
k++;
Expand Down
Loading

0 comments on commit 75c03e7

Please sign in to comment.