-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: reduce minimizer threshold from 0.3 to 0.1 #1409
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
The sort command currently outputs a sequence into a file corresponding to the dataset prefix if a dataset with this prefix is a hit. If we changed this to the more stringent condition that the dataset needs to be the best hit, the sort would work more as a hierarchical sort. |
If I understand this correctly, one could only process the first (best) dataset here: |
This allows to unambiguously and reliably map entries in the input fasta to the entries in the output tsv, which is important in presence of duplicated sequence names.
for very diverse viruses, our current match fraction threshold of 0.3 for minimizers is too stringent. 0.1 is still very sensitive in the sense that no random hits are produced. But we have many suboptimal hits between related viruses. (like RSV-A matching RSV-B). This is not a problem as long as we consider the best hit. But in the
sort
command we end up not sorting anymore.