Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate sequences for upload to ENA and for download by user #3739

Open
rneher opened this issue Feb 24, 2025 · 1 comment
Open

Annotate sequences for upload to ENA and for download by user #3739

rneher opened this issue Feb 24, 2025 · 1 comment
Labels
deposition related to ENA/INSDC deposition

Comments

@rneher
Copy link

rneher commented Feb 24, 2025

We translate genes and report amino acid mutations, but we don't relay the coordinates of the features to ENA (NCBI wouldn't allow that). Hence sequences we submit look "draft-y" on genbank. Since we have this information, we should make this information available to INSDC. We could also allow users to download annotated genomes, e.g. via a GFF3 or genbank file.

@anna-parker anna-parker added the deposition related to ENA/INSDC deposition label Feb 24, 2025
@theosanderson
Copy link
Member

theosanderson commented Feb 26, 2025

(all agreed with the general idea of this issue)

Since we have this information

I'm not sure to what extent we currently have this information stored in a format useful for INSDC, or indeed for potentially displaying it ourselves (e.g. with Gensplore which is available as a react component). Below I start thinking about this:

As I understand it what we store is:

  • Unaligned genome
  • Nucleotide sequences hard-aligned to reference sequence
  • Amino acids hard-aligned to reference sequence
  • Lists of insertions and deletions, both for amino acids and nucleotides
  • We also have, in the Nextclade dataset, the coordinates for feature locations in the reference genome

What we need for INSDC is:

  • an unaligned genome (which we have)
  • a list of coordinates for various features in that genome. We don't have this to hand. But I guess we can maybe compute it by applying the list of nucleotide insertions and deletions to the coordinates in the reference genome?
  • likely to calculate unaligned amino acid translations, based on these coordinates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deposition related to ENA/INSDC deposition
Projects
None yet
Development

No branches or pull requests

3 participants