Sequence context: tasks list #147

eboileau · 2024-08-20T10:29:43Z

A clear and concise description of todo items.

FASTA files were all prepared ad hoc, for human and mouse only, copied to the server, and permissions were handled by hand. Ideally, wrangling
should be integrated into the AssemblyService cf. FileService and CLI, when creating a new assembly for the current version. One problem, though, is that we need bgzip, and samtools for indexing.

I took some time to figure out how best to do this, without adding too many additional dependencies...
```
1. Install samtools from Debian package repositories, but 1.16.1 is far behind the current 1.21.
2. Install samtools in the Dockerfile, but we add quite a few dependencies, etc. and this doesn't
work for development or Jenkins.
3. Pick the samtools binary from installation-free package of scripts and precompiled binaries e.g. Bwakit, but it's untested, I don't
know if bgzip is there.
4. Use alternatives, e.g. pyfaidx. I prefer to avoid this for obvious reasons. Biopython might work, but it's a big package...
```
In the end it was so easy... CrossMap, which is installed as a python dependency, has itself pysam as a dependency.
I didn't know, but it looks like the wheel package contains everything, or else the htslib source code can be compiled,
so samtools does not need to be installed separately! The only problem is the API, which is not well-documented.
I'm using pysam.tabix_compress for bgzip... the size is larger and the number of blocks is slightly different
than what I had previously generated with samtools/bgzip, but this seems to work anyway.

pysam=0.22.1 (wraps htslib/samtools/bcftools 1.18).
modification_api.get_genomic_sequence_context needs refactoring, and is currently not really "testable".
It is currently assumed that pybedtools outputs a FASTA file with a sequence that is in a single line, no matter how long it is, or that the requested sequence fits on the second line. It would be wiser to read this file using a method similar to e.g.

record_dict = SeqIO.index("example.fasta", "fasta")
print(record_dict["gi:12345678"])  # use any record ID
# or 
record_dict = SeqIO.to_dict(SeqIO.parse("example.fasta", "fasta"))
print(record_dict["gi:12345678"])  # use any record ID

Should we allow users to query a different context length? I don't believe this is a must, however we should think about cDNA/transcript context, but this requires careful consideration (affect data model or not?, integrate into data annotation?, etc.), and will be handled in a separate issue in due time.
Change modification color to primary green (and related docs).

The text was updated successfully, but these errors were encountered:

eboileau added type:enhancement New feature or request needs:refactoring Code smell labels Aug 20, 2024

eboileau added this to the Features milestone Aug 20, 2024

eboileau self-assigned this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence context: tasks list #147

Sequence context: tasks list #147

eboileau commented Aug 20, 2024 •

edited

Loading

Sequence context: tasks list #147

Sequence context: tasks list #147

Comments

eboileau commented Aug 20, 2024 • edited Loading

A clear and concise description of todo items.

eboileau commented Aug 20, 2024 •

edited

Loading