You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FASTA files were all prepared ad hoc, for human and mouse only, copied to the server, and permissions were handled by hand. Ideally, wrangling
should be integrated into the AssemblyServicecf.FileService and CLI, when creating a new assembly for the current version. One problem, though, is that we need bgzip, and samtools for indexing.
I took some time to figure out how best to do this, without adding too many additional dependencies...
1. Install samtools from Debian package repositories, but 1.16.1 is far behind the current 1.21.
2. Install samtools in the Dockerfile, but we add quite a few dependencies, etc. and this doesn't
work for development or Jenkins.
3. Pick the samtools binary from installation-free package of scripts and precompiled binaries e.g. Bwakit, but it's untested, I don't
know if bgzip is there.
4. Use alternatives, e.g. pyfaidx. I prefer to avoid this for obvious reasons. Biopython might work, but it's a big package...
In the end it was so easy... CrossMap, which is installed as a python dependency, has itself pysam as a dependency.
I didn't know, but it looks like the wheel package contains everything, or else the htslib source code can be compiled,
so samtools does not need to be installed separately! The only problem is the API, which is not well-documented.
I'm using pysam.tabix_compress for bgzip... the size is larger and the number of blocks is slightly different
than what I had previously generated with samtools/bgzip, but this seems to work anyway.
modification_api.get_genomic_sequence_context needs refactoring, and is currently not really "testable".
It is currently assumed that pybedtools outputs a FASTA file with a sequence that is in a single line, no matter how long it is, or that the requested sequence fits on the second line. It would be wiser to read this file using a method similar to e.g.
record_dict=SeqIO.index("example.fasta", "fasta")
print(record_dict["gi:12345678"]) # use any record ID# or record_dict=SeqIO.to_dict(SeqIO.parse("example.fasta", "fasta"))
print(record_dict["gi:12345678"]) # use any record ID
Should we allow users to query a different context length? I don't believe this is a must, however we should think about cDNA/transcript context, but this requires careful consideration (affect data model or not?, integrate into data annotation?, etc.), and will be handled in a separate issue in due time.
Change modification color to primary green (and related docs).
The text was updated successfully, but these errors were encountered:
A clear and concise description of todo items.
FASTA files were all prepared ad hoc, for human and mouse only, copied to the server, and permissions were handled by hand. Ideally, wrangling
should be integrated into the
AssemblyService
cf.FileService
and CLI, when creating a new assembly for the current version. One problem, though, is that we need bgzip, and samtools for indexing.I took some time to figure out how best to do this, without adding too many additional dependencies...
In the end it was so easy... CrossMap, which is installed as a python dependency, has itself pysam as a dependency.
I didn't know, but it looks like the wheel package contains everything, or else the htslib source code can be compiled,
so samtools does not need to be installed separately! The only problem is the API, which is not well-documented.
I'm using
pysam.tabix_compress
forbgzip
... the size is larger and the number of blocks is slightly differentthan what I had previously generated with samtools/bgzip, but this seems to work anyway.
pysam=0.22.1 (wraps htslib/samtools/bcftools 1.18).
modification_api.get_genomic_sequence_context
needs refactoring, and is currently not really "testable".It is currently assumed that pybedtools outputs a FASTA file with a sequence that is in a single line, no matter how long it is, or that the requested sequence fits on the second line. It would be wiser to read this file using a method similar to e.g.
Should we allow users to query a different context length? I don't believe this is a must, however we should think about cDNA/transcript context, but this requires careful consideration (affect data model or not?, integrate into data annotation?, etc.), and will be handled in a separate issue in due time.
Change modification color to primary green (and related docs).
The text was updated successfully, but these errors were encountered: