Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

MiguelHK · 2024-06-27T20:45:42Z

Currently seqkit seq filters by quality based on an average quality score. However, other tools such as FASTX's fastq_quality_filter, allow the user to select how many nucleotides (as a percentage) he wants to have a minimum PHRED score of X. Example:

We have 5 sequences that are 100 nucleotides long:

50 nucleotides have a phred score of 20.
50 nucleotides have a phred score of 40.

With an average phred score of 30, these sequences might be acceptable using seqkit seq --min-qual 30, but if we want to make sure that a low percentage of the nucleotides have a very low quality (let's say we only want 20% of nucleotides to be below a phred score of 30), all of these sequences would be discarded. This is currently not possible with seqkit but it is possible (albeit slower) with other tools.

Now, knowing how flexible and fast seqkit is, I would love to see this feature included!

shenwei356 · 2024-06-27T20:53:17Z

I'd recommend fastp, which supports this. look here: https://github.com/OpenGene/fastp?tab=readme-ov-file#quality-filter , maybe you can use -q 30 -u 20.

BTW, a read with 50 bp with score 20 and 50 bp with score 40, the average quality score is not 30.

MiguelHK · 2024-06-27T21:28:20Z

I am aware that fastp is capable of doing this, however I use seqkit for several steps and it would be great if this would also be a feature of seqkit.

About the average quality score, you are correct, the average score is ~23. Thanks for pointing it out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

MiguelHK commented Jun 27, 2024

shenwei356 commented Jun 27, 2024

MiguelHK commented Jun 27, 2024

Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

Comments

MiguelHK commented Jun 27, 2024

shenwei356 commented Jun 27, 2024

MiguelHK commented Jun 27, 2024