-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting fastq by length? Feature suggestion #500
Comments
It does, but the memory would be very high as it's in-memory sorting. An alternative way is to use multiple-stage external sorting.
If you do want this, try this multiple commands solution, utilizing the shell command I'm unsure which type of question you are asking, choose one, please.
|
Oh thank you! Interesting, I ended up writing a sorting pipeline using paste and awk but it's painful especially rewriting the fastq afterwards. For my question, I am sorry it was not clear, I am not totally sure any of your solution works, I give a more concrete example: I have a 50 GB (size not bases) and I want to sort the fastq in descending read length order then "slice" through it to get the first 8 GB of the file for instance, using Head -c 8G sorted.fastq > extract.fastq But of course this counts the read names and can break the extract at the end. I found how to rename with seqkit to have very short read names and so reduce the memory size (it's ok if I am not taking exactly 8 GB of pure reads). I also found you have a tool to sanitize a fastq which I haven't yet tried and hope would fix the potential issue of my slice cutting inside the last read, though this would be manually editable, I don't need to do this on many files. Thanks for your patience and memusg which I didn't know at all. Seems like a very useful tool. Thanks again. |
Damn, I was stupid. We can directly sort linearized FASTQ records... Here's a one-line solution. Extracting <=1Mb reads with the minimum number of reads. Please change the buffer size (4G) of
|
It works but cut through, and my file ends with half a header. How could seqkit sana handle that? I guess it could also slice through and have a header and partial read, this seems to cause errors with some assemblers such as hifiasm. Any Ideas? Thanks a lot for your patience, greatly appreciated it helps me become more efficient. Alex |
seqkit sana is not for that. It's the inappropriate column delimiter setting in
|
Hello, for assembly purposes, it's sometimes useful to sort by length and extract the Nth longest reads. There are specific applications where it should be done with fastq and not fasta, and it's surprisingly "complicated". For instance your sort algorithm doesn't work with fastq as input. It's not a bug of course, but maybe it could be a feature to implement? Just an idea. Another would be a tool to extract the Nth longest or shortest reads from the sorted fastq. If it's too specialized feel free to just close. Thanks. (I haven't found a way to do it with seqkit I hope I didn't miss something). Thanks again for the tool, it is very useful for many applications, with fasta it's incredible I still have to find a use case that is not implemented.
The text was updated successfully, but these errors were encountered: