You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using --top_hits_only in usearch_global is very dangerous since there exists taxonomical mis-annotations in the reference database.
For example, if the identity of hit A is 99.124% while B is 99.123%, the option --top_hits_only will only keep hit A. However I have encountered frequently that A is taxonomical mis-labelled while B seems correct.
Currently, my strategy is to set a low identity threshold such as --id 0.6 to obtain as many hits as possible, and then select the top-N-hits. The remaining hits are useless.
I would be glad if there is an option called --top_N_hits_only N, while the conventional --top_hits_only is equivalent to --top_N_hits_only 1
The text was updated successfully, but these errors were encountered:
The option --top-hits-only is designed to include only the top hits that have exactly the same residue identity percentage as the best hit. It may include several hits if they all have exactly the same ID percentage.
However, for vsearch to include more than one hit in the results you need to adjust the argument to the --maxaccepts option. The default here is 1, which makes vsearch stop as soon as it has found one acceptable hit. If you use --maxaccepts 10 it will show up to 10 acceptable hits. Combining this with --top-hits-only will show up 10 hits that have exactly the same ID percentage. If you use --maxaccepts 0 it will show all acceptable hits.
You may also need to adjust the argument to --maxrejects to get all the hits you want, because vsearch will also stop when it has encountered a certain number of unacceptable hits (e.g. ID% too low). The default is 32, but you may increase it to for example 100 or 1000. If you set it to 0, it will examine all target sequences.
Please note that increasing these values will slow vsearch down considerably.
The way vsearch works is to order all target sequences may the number of shared k-mers (usually 8 nucleotides in a row) with the query and then start examining those with the highest number of shared k-mers, until a certain number of acceptable or unacceptable sequences have been checked. So it is heuristic and may not always find all hits you expected.
using
--top_hits_only
inusearch_global
is very dangerous since there exists taxonomical mis-annotations in the reference database.For example, if the identity of
hit A
is 99.124% whileB
is 99.123%, the option--top_hits_only
will only keephit A
. However I have encountered frequently thatA
is taxonomical mis-labelled whileB
seems correct.Currently, my strategy is to set a low identity threshold such as
--id 0.6
to obtain as many hits as possible, and then select the top-N-hits. The remaining hits are useless.I would be glad if there is an option called
--top_N_hits_only N
, while the conventional--top_hits_only
is equivalent to--top_N_hits_only 1
The text was updated successfully, but these errors were encountered: