Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

tao-bioinfo · 2025-02-21T01:13:55Z

using --top_hits_only in usearch_global is very dangerous since there exists taxonomical mis-annotations in the reference database.

For example, if the identity of hit A is 99.124% while B is 99.123%, the option --top_hits_only will only keep hit A. However I have encountered frequently that A is taxonomical mis-labelled while B seems correct.

Currently, my strategy is to set a low identity threshold such as --id 0.6 to obtain as many hits as possible, and then select the top-N-hits. The remaining hits are useless.

I would be glad if there is an option called --top_N_hits_only N, while the conventional --top_hits_only is equivalent to --top_N_hits_only 1

The text was updated successfully, but these errors were encountered:

torognes · 2025-02-28T10:09:55Z

Hi, thanks for your suggestion.

I am not sure if I fully understand your request.

The option --top-hits-only is designed to include only the top hits that have exactly the same residue identity percentage as the best hit. It may include several hits if they all have exactly the same ID percentage.

However, for vsearch to include more than one hit in the results you need to adjust the argument to the --maxaccepts option. The default here is 1, which makes vsearch stop as soon as it has found one acceptable hit. If you use --maxaccepts 10 it will show up to 10 acceptable hits. Combining this with --top-hits-only will show up 10 hits that have exactly the same ID percentage. If you use --maxaccepts 0 it will show all acceptable hits.

You may also need to adjust the argument to --maxrejects to get all the hits you want, because vsearch will also stop when it has encountered a certain number of unacceptable hits (e.g. ID% too low). The default is 32, but you may increase it to for example 100 or 1000. If you set it to 0, it will examine all target sequences.

Please note that increasing these values will slow vsearch down considerably.

The way vsearch works is to order all target sequences may the number of shared k-mers (usually 8 nucleotides in a row) with the query and then start examining those with the highest number of shared k-mers, until a certain number of acceptable or unacceptable sequences have been checked. So it is heuristic and may not always find all hits you expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

tao-bioinfo commented Feb 21, 2025 •

edited

Loading

torognes commented Feb 28, 2025

Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

Comments

tao-bioinfo commented Feb 21, 2025 • edited Loading

torognes commented Feb 28, 2025

tao-bioinfo commented Feb 21, 2025 •

edited

Loading