Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting fastq by length? Feature suggestion #500

Open
Axzevd opened this issue Jan 16, 2025 · 5 comments
Open

Sorting fastq by length? Feature suggestion #500

Axzevd opened this issue Jan 16, 2025 · 5 comments

Comments

@Axzevd
Copy link

Axzevd commented Jan 16, 2025

Hello, for assembly purposes, it's sometimes useful to sort by length and extract the Nth longest reads. There are specific applications where it should be done with fastq and not fasta, and it's surprisingly "complicated". For instance your sort algorithm doesn't work with fastq as input. It's not a bug of course, but maybe it could be a feature to implement? Just an idea. Another would be a tool to extract the Nth longest or shortest reads from the sorted fastq. If it's too specialized feel free to just close. Thanks. (I haven't found a way to do it with seqkit I hope I didn't miss something). Thanks again for the tool, it is very useful for many applications, with fasta it's incredible I still have to find a use case that is not implemented.

@shenwei356
Copy link
Owner

your sort algorithm doesn't work with fastq as input

It does, but the memory would be very high as it's in-memory sorting. An alternative way is to use multiple-stage external sorting.

$ seqkit stats GCF_000005845.2.fna.gz.i87.q1000.fastq.gz
file                                       format  type  num_seqs     sum_len  min_len  avg_len  max_len
GCF_000005845.2.fna.gz.i87.q1000.fastq.gz  FASTQ   DNA     46,416  46,416,846      556    1,000    1,042

$ memusg -t -s "seqkit sort -l GCF_000005845.2.fna.gz.i87.q1000.fastq.gz -o t.fq.gz"
[INFO] read sequences ...
[INFO] 46416 sequences loaded
[INFO] sorting ...
[INFO] output ...

elapsed time: 0.899s
peak rss: 225.14 MB

to extract the Nth longest or shortest reads from the sorted fastq

If you do want this, try this multiple commands solution, utilizing the shell command sort (limiting buffer size and using current directory as the tmp dir).

I'm unsure which type of question you are asking, choose one, please.

  • Extract the Nth longest read (sorting reads by descending length, and extract the Nth record). One read.

     # Extract sequence length and sort sequence IDs by sequence length
     seqkit fx2tab -nil GCF_000005845.2.fna.gz.i87.q1000.fastq.gz \
         | sort -k2,2r  -S 4G -T . \
         | gzip -c > seqid2len.tsv.gz
    
     # Extract sequence 123th record
     seqkit grep GCF_000005845.2.fna.gz.i87.q1000.fastq.gz \
         -p $(zcat seqid2len.tsv.gz | awk 'NR==123' | cut -f 1) \
         -o result.fastq.gz
    
  • Extract one or more reads, of which the length is the Nth longest length (sorting unique sequence lengths).

     # Extract sequence length and sort sequence IDs by sequence length
     seqkit fx2tab -nil GCF_000005845.2.fna.gz.i87.q1000.fastq.gz \
         | sort -k2,2r  -S 4G -T . \
         | gzip -c > seqid2len.tsv.gz
    
     # sort unique sequence lengths
     # csvtk uniq is memory-efficient. You can also use 'sort | uniq'.
     csvtk cut -Ht -f 2 seqid2len.tsv.gz \
         | csvtk uniq -Ht  \
         | csvtk sort  -Ht -k 1:nr \
         > seqlen.tsv
    
    # Extract the longest reads. might be more than one.
    csvtk grep -Ht -f 2 -p $(awk 'NR==1' seqlen.tsv) seqid2len.tsv.gz | cut -f 1 \
        | seqkit grep -f - GCF_000005845.2.fna.gz.i87.q1000.fastq.gz -o result.fastq.gz
    

@Axzevd
Copy link
Author

Axzevd commented Jan 18, 2025

Oh thank you! Interesting, I ended up writing a sorting pipeline using paste and awk but it's painful especially rewriting the fastq afterwards. For my question, I am sorry it was not clear, I am not totally sure any of your solution works, I give a more concrete example: I have a 50 GB (size not bases) and I want to sort the fastq in descending read length order then "slice" through it to get the first 8 GB of the file for instance, using

Head -c 8G sorted.fastq > extract.fastq

But of course this counts the read names and can break the extract at the end. I found how to rename with seqkit to have very short read names and so reduce the memory size (it's ok if I am not taking exactly 8 GB of pure reads).

I also found you have a tool to sanitize a fastq which I haven't yet tried and hope would fix the potential issue of my slice cutting inside the last read, though this would be manually editable, I don't need to do this on many files.

Thanks for your patience and memusg which I didn't know at all. Seems like a very useful tool. Thanks again.

@shenwei356
Copy link
Owner

Damn, I was stupid. We can directly sort linearized FASTQ records...

Here's a one-line solution. Extracting <=1Mb reads with the minimum number of reads. Please change the buffer size (4G) of sort, and the total base 1000000.

seqkit fx2tab -l GCF_000005845.2.fna.gz.i87.q1000.fastq.gz \
   | sort -k5,5nr -S 4G -T . \
   | awk -F'\t' '{size+=$4; if(size>=1000000){exit}; print}'  \
   | cut -f 1-3 \
   | seqkit tab2fx -o 1M.fq.gz

# check the result
$ seqkit stats 1M.fq.gz 
file      format  type  num_seqs  sum_len  min_len  avg_len  max_len
1M.fq.gz  FASTQ   DNA      1,000  999,534      967    999.5    1,033

@Axzevd
Copy link
Author

Axzevd commented Jan 22, 2025

It works but cut through, and my file ends with half a header. How could seqkit sana handle that? I guess it could also slice through and have a header and partial read, this seems to cause errors with some assemblers such as hifiasm. Any Ideas?

Thanks a lot for your patience, greatly appreciated it helps me become more efficient.

Alex

@shenwei356
Copy link
Owner

my file ends with half a header.

seqkit sana is not for that. It's the inappropriate column delimiter setting in sort that uses space or/and tab.

seqkit fx2tab -l GCF_000005845.2.fna.gz.i87.q1000.fastq.gz \
   | sort -t$'\t' -k5,5nr -S 4G -T . \
   | awk -F'\t' '{size+=$4; if(size>=1000000){exit}; print}'  \
   | cut -d$'\t' -f 1-3 \
   | seqkit tab2fx -o 1M.fq.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants