-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subseq enhancement - fasta ID and comment for bed, gtf and gff #177
Comments
Thank you Thomas! Hmm, seqkit does not mean to implement comprehensive gene feature files manipulations that bedtools and bedops have already provided.
GFF3 is much more complicated, I think.
Good point.
I don't know, may be try later. Sorry, I'm a little bit busy these days. I may handle this when I'm free. |
Sure, no worries! This is really just cherries on the cake. But just in case you pick it up at some point:
Actually not at all. GFF3 derives from GTF and can store more complicated features, but when it comes to the 9th columns with the attributes, they are almost the same from a parsing perspective (they have different reserved keys but we don't care about that here)
The differences in gff3: key and value are separated by "=" instead of a space; values are not enclosed in "". And no extra space before ";". In Perl, for example, they could be read with one and the same regexp
|
Hi,
thanks so much for seqkit - it's one of the tools I use almost on a daily basis and I love it!
There is one feature though, that I feel is lacking a bit. I regularly use
seqkit subseq
to extract features based on annotations files. Ideally, I would like to have more control over the ID that is given to the extracted fasta sequences and possibly the comment as well. Usually, my annotations come in gff3 (which is not supported), so I convert to bed, and then I use a perl-one-liner to turn the fasta comment into the fasta ID. It works, but since I use it so often, and I guess others might too, it feels like it would be very useful if seqkit natively provided that functionality.This issue has been raised before: #154, #89
So what I would which for in terms of enhancements for seqkit would be:
gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.
ID tag: A flag like
--id-tag
could be added, similar to--gtf-tag
. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs...In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple
--gtf-tags gene_id,transcript_id,product
.Obviously, those aren't crucial issues, but I think they would help to further improve seqkit API.
Cheers
Thomas
The text was updated successfully, but these errors were encountered: