Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: change the option "--top_hits_only" to "--top_N_hits_only" #592

Open
tao-bioinfo opened this issue Feb 21, 2025 · 1 comment

Comments

@tao-bioinfo
Copy link

tao-bioinfo commented Feb 21, 2025

using --top_hits_only in usearch_global is very dangerous since there exists taxonomical mis-annotations in the reference database.

For example, if the identity of hit A is 99.124% while B is 99.123%, the option --top_hits_only will only keep hit A. However I have encountered frequently that A is taxonomical mis-labelled while B seems correct.

Currently, my strategy is to set a low identity threshold such as --id 0.6 to obtain as many hits as possible, and then select the top-N-hits. The remaining hits are useless.

I would be glad if there is an option called --top_N_hits_only N, while the conventional --top_hits_only is equivalent to --top_N_hits_only 1

@torognes
Copy link
Owner

Hi, thanks for your suggestion.

I am not sure if I fully understand your request.

The option --top-hits-only is designed to include only the top hits that have exactly the same residue identity percentage as the best hit. It may include several hits if they all have exactly the same ID percentage.

However, for vsearch to include more than one hit in the results you need to adjust the argument to the --maxaccepts option. The default here is 1, which makes vsearch stop as soon as it has found one acceptable hit. If you use --maxaccepts 10 it will show up to 10 acceptable hits. Combining this with --top-hits-only will show up 10 hits that have exactly the same ID percentage. If you use --maxaccepts 0 it will show all acceptable hits.

You may also need to adjust the argument to --maxrejects to get all the hits you want, because vsearch will also stop when it has encountered a certain number of unacceptable hits (e.g. ID% too low). The default is 32, but you may increase it to for example 100 or 1000. If you set it to 0, it will examine all target sequences.

Please note that increasing these values will slow vsearch down considerably.

The way vsearch works is to order all target sequences may the number of shared k-mers (usually 8 nucleotides in a row) with the query and then start examining those with the highest number of shared k-mers, until a certain number of acceptable or unacceptable sequences have been checked. So it is heuristic and may not always find all hits you expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants