VSEARCH 1.0.1: Bug fixes and minor changes

torognes · Dec 1, 2014 · 75c03e7 · 75c03e7
1 parent 8da0059
commit 75c03e7
Show file tree

Hide file tree

Showing 16 changed files with 228 additions and 96 deletions.
diff --git a/README.md b/README.md
@@ -32,9 +32,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
 
 The latest releases of VSEARCH are available [here](https://github.com/torognes/vsearch/releases).
 
-Executable binaries of VSEARCH are available in the `bin` folder for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-osx-x86_64). These binaries include support for  input files compressed by zlib and bzip2 (with files usually ending in .gz or .bz2).
+Binary executables of VSEARCH are available in the `bin` folder for [GNU/Linux on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-linux-x86_64) and [Apple Mac OS X on x86-64 systems](https://github.com/torognes/vsearch/blob/master/bin/vsearch-1.0.0-osx-x86_64). These executables include support for  input files compressed by zlib and bzip2 (with files usually ending in .gz or .bz2).
 
-Download the appropriate binary and make a symbolic link in a folder included in your `$PATH` from `vsearch` to the appropriate binary. You may use the following commands (assuming `~/bin` is in your `$PATH`):
+Download the appropriate executable and make a symbolic link in a folder included in your `$PATH` from `vsearch` to the appropriate binary. You may use the following commands (assuming `~/bin` is in your `$PATH`):
 
 ```sh
 cd ~
@@ -56,15 +56,15 @@ The alternative makefiles Makefile.ZLIB, Makefile.BZLIB and Makefile.static may
 
 ## Implementation details and initial assessment
 
-**Search algorithm:** VSEARCH indexes the unique kmers in the database in a way similar USEARCH, but is currently limited to continuous words (non-spaced seeds) of length 3--15. It samples every unique kmer from each query sequence and identifies the number of matching kmers in each database sequence. It then examines the database sequences in order of decreasing number of kmer matches. A full global alignment is computed and those target sequences that satisfy all accept options are retained while the others are rejected. The `--maxrejects` and `--maxaccepts` options are supported in this process, indicating the maximum number of non-matching and matching target sequences considered, respectively. Please see the USEARCH paper and supplementary for details.
+**Search algorithm:** VSEARCH indexes the unique kmers in the database in a way similar to USEARCH, but is currently limited to continuous words (non-spaced seeds) of length 3-15. It samples every unique kmer from each query sequence and identifies the number of matching kmers in each database sequence. It then examines the database sequences in order of decreasing number of kmer matches. A full global alignment is computed and those target sequences that satisfy all accept options are retained while the others are rejected. The `--maxrejects` and `--maxaccepts` options are supported in this process, indicating the maximum number of non-matching and matching target sequences considered, respectively. Please see the USEARCH paper and supplementary for details.
 
 **Kmer selection:** How many and which kmers USEARCH chooses from the query sequence is not well documented. It is also not known exactly which database sequences are examined, and in which order. We have therefore experimented with various strategies in order to obtain good performance. Our procedure seems to give results equal to or better than USEARCH.
 
 We have chosen to select all unique kmers from the query. At least 6 of these kmers must be present in the database sequence before it will be considered. Also, at least 1 out of 16 query kmers need to be present in the database sequence. Furthermore, if several database sequences have the same number of kmer matches, they will be examined in order of decreasing sequence length.
 
 It appears that there are differences in usearch between the searches performed by the `--usearch_global` command and the clustering commands. Notably, it appears like `--usearch_global` simply ignores the options `--wordlength`, `--slots` and `--pattern`, while the clustering commands takes them into account. VSEARCH supports the `--wordlength` option for kmer lengths from 3 to 15, but the options `--slots` and `--pattern` are ignored.
 
-**Alignment:** VSEARCH uses a 8-way 16-bit SIMD vectorized implementation of the full dynamic programming algorithm (Needleman-Wunsch) for global sequence alignment. It is an adaptation of the method described by Rognes (2011). USEARCH by default uses heuristic procedure involving seeding, extension and banded dynamic programming. If the `--fulldp` option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.
+**Alignment:** VSEARCH uses a 8-way 16-bit SIMD vectorized implementation of the full dynamic programming algorithm (Needleman-Wunsch) for global sequence alignment. It is an adaptation of the method described by Rognes (2011). USEARCH by default uses a heuristic procedure involving seeding, extension and banded dynamic programming. If the `--fulldp` option is specified to USEARCH it will also use a full dynamic programming approach, but USEARCH is then considerably slower.
 
 **Search Accuracy:** The accuracy of VSEARCH searches has been assessed and compared to USEARCH version 7.0.1090. The Rfam 11.0 database was used for the assessment, as described on the [USEARCH website](http://drive5.com/usearch/benchmark_rfam.html). A similar procedure was described in the USEARCH paper using the Rfam 9.1 database.
 
@@ -98,7 +98,7 @@ VSEARCH is about 40% faster than USEARCH on *de novo* chimera detection and abou
 
 **Extensions:** A shuffle command has been added. By specifying a FASTA file using the `--shuffle` option, and an output file with the `--output` option, VSEARCH will shuffle the sequences in a pseudo-random order. An integer may be specified as the seed with the `--seed` option to generate the same shuffling several times. By default, or when `--seed 0` is specified, the pseudo-random number generator will be initialized with pseudo-random data from the machine to give different numbers each time it is run.
 
-Another extension implemented is that dereplication will honour the `--sizein` option and add together the abundances of the sequences that are merged.
+Another extension implemented is that `derep_fulllength` and `--cluster_fast` will honour the `--sizein` option and add together the abundances of the sequences that are merged.
 
 The commands `--sortbylength` and `--sortbysize` supports the `--topn` option to output no more than the given number of sequences.
 
@@ -323,7 +323,7 @@ The following people have contributed to VSEARCH:
 ## Citing VSEARCH
 
 No papers about VSEARCH have been published yet, but a manuscript is in preparation.
-Please refer to the [VSEARCH GitHub repository](https://github.com/torognes/vsearch).
+For now, please cite the [VSEARCH GitHub repository](https://github.com/torognes/vsearch).
 
 
 ## Test datasets

diff --git a/doc/vsearch.1 b/doc/vsearch.1
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH vsearch 1 "November 28, 2014" "version 1.0.0" "USER COMMANDS"
+.TH vsearch 1 "December 1, 2014" "version 1.0.1" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 vsearch \(em chimera detection, clustering, dereplication, masking, searching, shuffling and sorting of amplicons from metagenomic projects.
@@ -391,7 +391,7 @@ headers).
 .TP
 .B --sizeout
 Add abundance annotations to the output fasta files (add the pattern
-";size=\fIinteger\fR" to sequence headers). If --sizein is specified,
+";size=\fIinteger\fR;" to sequence headers). If --sizein is specified,
 abundance annotations are reported to output files, and each cluster
 centroid receives a new abundance value corresponding to the total
 abundance of the amplicons included in the cluster (--centroids
@@ -447,7 +447,7 @@ and sorted by decreasing abundance. Identical sequences receive the
 header of the first sequence of their group. If --sizeout is used, the
 number of occurrences (i.e. abundance) of each sequence is indicated
 at the end of their fasta header using the pattern
-";size=\fIinteger\fR".
+";size=\fIinteger\fR;".
 .TP
 .B --sizein
 Take into account the abundance annotations present in the input fasta
@@ -456,7 +456,7 @@ headers).
 .TP
 .B --sizeout
 Add abundance annotations to the output fasta file (add the pattern
-";size=\fIinteger\fR" to sequence headers).  If --sizein is specified,
+";size=\fIinteger\fR;" to sequence headers).  If --sizein is specified,
 each unique sequence receives a new abundance value corresponding to
 its total abundance (sum of the abundances of its occurrences). If
 --sizein is not specified, input abundances are set to 1, and each
@@ -607,7 +607,7 @@ commands become case sensitive. The default is to mask using
 Write database target sequences matching at least one query sequence
 to \fIfilename\fR, in fasta format. If the option --sizeout is used,
 the number of queries that matched each target sequence is indicated
-using the pattern ";size=\fIinteger\fR".
+using the pattern ";size=\fIinteger\fR;".
 .TP
 .BI --dbnotmatched \0filename
 Write database target sequences not matching query sequences to
@@ -847,7 +847,7 @@ identical.
 .TP
 .B --sizeout
 Add abundance annotations to the output of the option --dbmatched
-(using the pattern ";size=\fIinteger\fR").
+(using the pattern ";size=\fIinteger\fR;").
 .TP
 .BI --strand\~ "plus|both"
 When searching for similar sequences, check the \fIplus\fR strand only
@@ -958,7 +958,7 @@ abundance annotations.
 .TP
 .B --sizeout
 When using --relabel, report abundance annotations to the output fasta
-file (using the pattern ";size=\fIinteger\fR").
+file (using the pattern ";size=\fIinteger\fR;").
 .TP
 .BI --sortbylength \0filename
 Sort by decreasing length the sequences contained in
@@ -1183,6 +1183,11 @@ definitions that were removed from usearch.
 .PP
 \fBvsearch\fR extends the --topn option to sorting commands.
 .PP
+\fBvsearch\fR extends the --sizein option to dereplication and
+--cluster_fast.
+.PP
+\fBvsearch\fR treats T and U as identical nucleotides for dereplication.
+.PP
 .\" ============================================================================
 .SH NOVELTIES
 \fBvsearch\fR introduces new options not present in usearch. They are
@@ -1327,6 +1332,12 @@ or minor bug releases are not mentioned):
 .TP
 .BR v1.0.0\~ "released November 28th, 2014"
 First public release
+.TP
+.BR v1.0.1\~ "released December 1st, 2014"
+Bug fixes (sortbysize, semicolon after size annotation in headers)
+and minor changes
+(labels as secondary sort key for most sorts, treat T and U as identical
+for dereplication, only output size in dbmatched file if sizeout specified).
 .LP
 .\" ============================================================================
 .\" TODO:

diff --git a/doc/vsearch_manual.pdf b/doc/vsearch_manual.pdf
diff --git a/eval/eval.sh b/eval/eval.sh
@@ -12,7 +12,7 @@ DB=../data/Rfam_11_0.fasta
 if [ $(uname -s) == "Linux" ]; then
     VSEARCH=$(ls -v ../bin/vsearch*linux* | tail -1)
 else
-    VSEARCH=$(ls -t ../bin/vsearch*macosx* | head -1)
+    VSEARCH=$(ls -t ../bin/vsearch*osx* | head -1)
 fi
 
 #VSEARCH=../src/vsearch

diff --git a/src/db.cc b/src/db.cc
@@ -464,7 +464,7 @@ void db_read(const char * filename, int upcase)
     /* read sizein annotation if appropriate */
     if ((opt_usearch_global || 
          opt_uchime_denovo ||
-         (opt_sortbysize && opt_relabel && opt_sizeout) ||
+         opt_sortbysize ||
          (opt_sortbylength && opt_relabel && opt_sizeout) ||
          (opt_cluster_fast && opt_sizein) ||
          (opt_cluster_smallmem && opt_sizein) ||
@@ -549,16 +549,21 @@ void db_fprint_fasta_with_size(FILE * fp, unsigned long seqno, unsigned long siz
       int pat_start = pmatch[0].rm_so;
       int pat_end = pmatch[0].rm_eo;
 
-      fprintf(fp, ">%.*s%ssize=%lu%s%.*s\n",
+      fprintf(fp,
+              ">%.*s%s%.*s%ssize=%lu;\n",
               pat_start, hdr,
               pat_start > 0 ? ";" : "",
-              size,
-              pat_end < hdrlen ? ";" : "",
-              hdrlen - pat_end, hdr + pat_end);
+              hdrlen - pat_end, hdr + pat_end,
+              ((pat_end < hdrlen) && (hdr[hdrlen - 1] != ';')) ? ";" : "",
+              size);
     }
   else
     {
-      fprintf(fp, ">%s;size=%lu\n", hdr, size);
+      fprintf(fp,
+              ">%s%ssize=%lu;\n", 
+              hdr,
+              ((hdrlen == 0) || (hdr[hdrlen - 1] != ';')) ? ";" : "", 
+              size);
     }
 
   fprint_fasta_seq_only(fp, seq, seqlen, opt_fasta_width);
@@ -569,39 +574,55 @@ int compare_bylength(const void * a, const void * b)
   seqinfo_t * x = (seqinfo_t *) a;
   seqinfo_t * y = (seqinfo_t *) b;
 
-  /* longest first, otherwise keep order */
+  /* longest first, then by label, otherwise keep order */
 
   if (x->seqlen < y->seqlen)
     return +1;
   else if (x->seqlen > y->seqlen)
     return -1;
   else
-    if (x < y)
-      return -1;
-    else if (x > y)
-      return +1;
-    else
-      return 0;
+    {
+      int r = strcmp(x->header, y->header);
+      if (r != 0)
+        return r;
+      else
+        {
+          if (x < y)
+            return -1;
+          else if (x > y)
+            return +1;
+          else
+            return 0;
+        }
+    }
 }
 
 inline int compare_byabundance(const void * a, const void * b)
 {
   seqinfo_t * x = (seqinfo_t *) a;
   seqinfo_t * y = (seqinfo_t *) b;
 
-  /* most abundant first, otherwise keep order */
+  /* most abundant first, then by label, otherwise keep order */
 
   if (x->size > y->size)
     return -1;
   else if (x->size < y->size)
     return +1;
   else
-    if (x < y)
-      return -1;
-    else if (x > y)
-      return +1;
-    else
-      return 0;
+    {
+      int r = strcmp(x->header, y->header);
+      if (r != 0)
+        return r;
+      else
+        {
+          if (x < y)
+            return -1;
+          else if (x > y)
+            return +1;
+          else
+            return 0;
+        }
+    }
 }
 
 void db_sortbylength()

diff --git a/src/derep.cc b/src/derep.cc
@@ -52,28 +52,56 @@ int derep_compare(const void * a, const void * b)
   struct bucket * x = (struct bucket *) a;
   struct bucket * y = (struct bucket *) b;
 
-  /* highest abundance first, otherwize keep order */
+  /* highest abundance first, then by label, otherwise keep order */
 
   if (x->size < y->size)
     return +1;
   else if (x->size > y->size)
     return -1;
   else
-    if (x->seqno_first < y->seqno_first)
-      return -1;
-    else if (x->seqno_first > y->seqno_first)
-      return +1;
-    else
-      return 0;
+    {
+      int r = strcmp(db_getheader(x->seqno_first),
+                     db_getheader(y->seqno_first));
+      if (r != 0)
+        return r;
+      else
+        {
+          if (x->seqno_first < y->seqno_first)
+            return -1;
+          else if (x->seqno_first > y->seqno_first)
+            return +1;
+          else
+            return 0;
+        }
+    }
 }
 
-void string_upcase(char * upcase, char * s, unsigned long len)
+void string_normalize(char * normalized, char * s, unsigned long len)
 {
-  /* convert string to upper case, given all chars are in A-Za-z */
+  /* convert string to upper case and replace U by T */
   char * p = s;
-  char * q = upcase;
+  char * q = normalized;
   for(unsigned long i=0; i<len; i++)
-    *q++ = *p++ & 0xdf;
+    *q++ = chrmap_normalize[(int)(*p++)];
+}
+
+int seqcmp(char * a, char * b, int n)
+{
+  char * p = a;
+  char * q = b;
+
+  if (n <= 0)
+    return 0;
+
+  while ((n-- > 0) && (chrmap_4bit[(int)(*p)] == chrmap_4bit[(int)(*q)]))
+    {
+      if ((n == 0) || (*p == 0) || (*q == 0))
+        break;
+      p++;
+      q++;
+    }
+
+  return chrmap_4bit[(int)(*p)] - chrmap_4bit[(int)(*q)];
 }
 
 void derep_fulllength()
@@ -142,8 +170,8 @@ void derep_fulllength()
       unsigned long seqlen = db_getsequencelen(i);
       char * seq = db_getsequence(i);
 
-      /* upper case sequence */
-      string_upcase(seq_up, seq, seqlen);
+      /* normalize sequence: uppercase and replace U by T  */
+      string_normalize(seq_up, seq, seqlen);
 
       /* reverse complement if necessary */
       if (opt_strand > 1)
@@ -170,7 +198,7 @@ void derep_fulllength()
              &&
              ((bp->hash != hash) ||
               (seqlen != db_getsequencelen(bp->seqno_first)) ||
-              (strncasecmp(seq_up, db_getsequence(bp->seqno_first), seqlen))))
+              (seqcmp(seq_up, db_getsequence(bp->seqno_first), seqlen))))
         {
           bp++;
           j++;
@@ -199,9 +227,9 @@ void derep_fulllength()
                  &&
                  ((rc_bp->hash != rc_hash) ||
                   (seqlen != db_getsequencelen(rc_bp->seqno_first)) ||
-                  (strncasecmp(rc_seq_up,
-                               db_getsequence(rc_bp->seqno_first), 
-                               seqlen))))
+                  (seqcmp(rc_seq_up,
+                          db_getsequence(rc_bp->seqno_first),
+                          seqlen))))
             {
               rc_bp++;
               k++;