Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence context: tasks list #147

Open
2 of 5 tasks
eboileau opened this issue Aug 20, 2024 · 0 comments
Open
2 of 5 tasks

Sequence context: tasks list #147

eboileau opened this issue Aug 20, 2024 · 0 comments
Assignees
Labels
needs:refactoring Code smell type:enhancement New feature or request
Milestone

Comments

@eboileau
Copy link
Collaborator

eboileau commented Aug 20, 2024

A clear and concise description of todo items.

  • FASTA files were all prepared ad hoc, for human and mouse only, copied to the server, and permissions were handled by hand. Ideally, wrangling
    should be integrated into the AssemblyService cf. FileService and CLI, when creating a new assembly for the current version. One problem, though, is that we need bgzip, and samtools for indexing.

    I took some time to figure out how best to do this, without adding too many additional dependencies...

    1. Install samtools from Debian package repositories, but 1.16.1 is far behind the current 1.21.
    2. Install samtools in the Dockerfile, but we add quite a few dependencies, etc. and this doesn't
    work for development or Jenkins.
    3. Pick the samtools binary from installation-free package of scripts and precompiled binaries e.g. Bwakit, but it's untested, I don't
    know if bgzip is there.
    4. Use alternatives, e.g. pyfaidx. I prefer to avoid this for obvious reasons. Biopython might work, but it's a big package...
    

    In the end it was so easy... CrossMap, which is installed as a python dependency, has itself pysam as a dependency.
    I didn't know, but it looks like the wheel package contains everything, or else the htslib source code can be compiled,
    so samtools does not need to be installed separately! The only problem is the API, which is not well-documented.
    I'm using pysam.tabix_compress for bgzip... the size is larger and the number of blocks is slightly different
    than what I had previously generated with samtools/bgzip, but this seems to work anyway.

    pysam=0.22.1 (wraps htslib/samtools/bcftools 1.18).

  • modification_api.get_genomic_sequence_context needs refactoring, and is currently not really "testable".

  • It is currently assumed that pybedtools outputs a FASTA file with a sequence that is in a single line, no matter how long it is, or that the requested sequence fits on the second line. It would be wiser to read this file using a method similar to e.g.

record_dict = SeqIO.index("example.fasta", "fasta")
print(record_dict["gi:12345678"])  # use any record ID
# or 
record_dict = SeqIO.to_dict(SeqIO.parse("example.fasta", "fasta"))
print(record_dict["gi:12345678"])  # use any record ID
  • Should we allow users to query a different context length? I don't believe this is a must, however we should think about cDNA/transcript context, but this requires careful consideration (affect data model or not?, integrate into data annotation?, etc.), and will be handled in a separate issue in due time.

  • Change modification color to primary green (and related docs).

@eboileau eboileau added type:enhancement New feature or request needs:refactoring Code smell labels Aug 20, 2024
@eboileau eboileau added this to the Features milestone Aug 20, 2024
@eboileau eboileau self-assigned this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:refactoring Code smell type:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant