Releases: B-UMMI/chewBBACA
v3.5.3
-
Fixed issue on the PrepExternalSchema module related to reading empty FASTA files after attempting to translate FASTA files from external schemas that contained no valid alleles. This issue did not affect the end result because the PrepExternalSchema module would detect that no alleles could be translated, skipping the next steps for that locus. However, not reading empty FASTA files avoids a warning raised by Biopython that could lead to errors in future releases.
-
Add support for more recent versions of Numpy, SciPy, and Pandas (the versions of these dependencies were fixed to older versions due to past issues installing Pandas).
-
Drop support for Python<=3.9. chewBBACA now requires Python>=3.10.
v3.5.2
Fix the following bugs introduced in v3.5.1:
- Fixed issue in the PrepExternalSchema module related to the addition of loci prefixes to sequence headers when the original schema did not use prefixes (e.g., schemas from cgMLST.org).
- Fixed issue in the PrepExternalSchema module related to saving some alleles to the final FASTA files in the incorrect orientation.
v3.5.1
chewBBACA no longer checks if input files have unique basename prefixes shorter than 30 characters. In the past, this was performed to ensure that sequence identifiers did not exceed the character limit (50 characters) enforced by BLAST when creating a database. The main changes to file name processing are the following:
- chewBBACA uses the file basename without the file extension as unique identifier (e.g.
GCF_008632635.1.fastais converted toGCF_008632635.1), instead of trying to determine the shortest unique prefix that can be used to identify each input file. It is still necessary for each file to have a unique identifier after the removal of the file extension (e.g.GCF_008632635.1.fastaandGCF_008632635.1.fnahave different file extensions but the same identifier after removing the file extension, which is not allowed). - The CreateSchema module uses the input file basenames without the file extension to define the identifiers for the loci in the created schemas (e.g. loci initially identified in the genomes
GCF_008632635.1.fastaandGCA_000006785.2_ASM678v2.fastaare named asGCF_008632635.1-proteinN.fastaandGCA_000006785.2_ASM678v2-proteinN.fasta, respectively). We still recommend using short and unique file names without special characters (e.g.:!@#?$^*()+) for conciseness and to avoid potential issues. - The AlleleCall module accepts and uses the new loci identifier format used by the CreateSchema module. The input genome or CDS files can also have basenames of any length as long as the basename without the file extension for each input file is unique. The output files created by the AlleleCall module use the full unique basenames (e.g. for the genome
GCA_000006785.2_ASM678v2.fasta, the genome identifier used in the output files will beGCA_000006785.2_ASM678v2, instead ofGCA_000006785used up until chewBBACA v3.5.0). - The PrepExternalSchema module accepts schemas containing loci FASTA files with basenames longer than 30 characters.
Additionally, the CDS identifiers are converted to a different format (lcl|SEQ1, lcl|SEQ2...lcl|SEQN) before creating a BLAST database with makeblastdb and the -parse_seqids option to avoid issues related to some sequence identifiers being interpretd and modified (e.g. interpretd as PDB Chain IDs) when creating a database, resulting in errors when an identifier is modified and no longer matches the original identifier. This allowed to remove the check to verify that unique prefixes are not modified by BLAST during database creation.
Additional changes:
- Added the
--output-maskedoption to the AlleleCall module to create a TSV file with the masked profiles (INF-prefixes are removed and the NIPH, NIPHEM, ASM, ALM, PLOT3, PLOT5, LOTSC, and PAMA classes are converted to0).
v3.5.0
Added the ComputeMSA module to compute MSAs from allele calling results or from a folder containing FASTA files. The ComputeMSA module includes the following functionalities:
- Compute loci, sample and complete MSAs based on the allelic profiles determined by chewBBACA (e.g. at the wg/cgMLST level). Gap sequences (the character used to represent gaps is
-) are added whenever a locus was not identified in a sample (e.g. when working at the wgMLST level). - Compute a MSA for each FASTA file in a folder (just a way to run MAFFT to compute MSAs).
- MSAs can be computed both at the protein and DNA level (i.e. by converting protein MSAs back to DNA).
- The
--output-variableoption identifies the variable positions (SNVs) and creates MSAs only for those positions. When determining variable positions, positions with gaps or ambiguous bases can be excluded (--gaps excludeand--ambiguous exclude) or included (--gaps ignoreand--ambiguous ignore) in the MSA if the sequences have other variable non-gap and non-ambiguous nucleotides or amino acids. - The SchemaEvaluator and AlleleCallEvaluator modules use the ComputeMSA module to compute the loci MSAs (SchemaEvaluator) and the complete MSA used by FastTree to compute a tree (AlleleCallEvaluator).
v3.4.2
-
Fixed issue in the ExtractCgMLST module related to using the deprecated Plotly titlefont attribute. Support for the titlefont attribute was dropped in Plotly v6.0.0. The ExtractCgMLST module would exit with an error and fail to generate the HTML plot if Plotly >= v6.0.0 was installed.
-
The LoadSchema module no longer queries UniProt's SPARQL endpoint to retrieve annotations. The current implementation was failing to retrieve annotations. Users should use the UniprotFinder module or the annotation functionalities provided by Schema Refinery to annotate the schema loci and create a TSV file with annotations to submit to Chewie-NS.
v3.4.1
- Changed the
-max_target_seqsvalue used by the select_representatives function to the square of the number of potential new representative alleles or to a minimum of 100. This change tries to fix an issue where BLASTp would not report the self-alignment for some alleles because it reached the limit of the number of alignments to report before reporting all self-alignments (e.g. for very large datasets, the number of potential new representatives may lead to a number of alignments that exceeds the value passed to-max_target_seqs).
v3.4.0
v3.3.10
v3.3.9
v3.3.8
-
Added support for genetic codes 2, 3, 5, 6, 9, 10, 12-16, 21-25 (complete list available here). Values passed to
--t,--translation-tableare ignored if a training file is used. The CreateSchema, AlleleCall and PrepExternalSchema modules use the genetic code used to create the training file. -
Fixed issue related to data about CDSs close to the contig tips not being available if input FASTA files contain CDSs and
--cdsis used. -
Fixed issue in the AlleleCallEvaluator module related to entirely numeric columns.