Skip to content

Releases: B-UMMI/chewBBACA

v3.5.3

10 Mar 23:16

Choose a tag to compare

  • Fixed issue on the PrepExternalSchema module related to reading empty FASTA files after attempting to translate FASTA files from external schemas that contained no valid alleles. This issue did not affect the end result because the PrepExternalSchema module would detect that no alleles could be translated, skipping the next steps for that locus. However, not reading empty FASTA files avoids a warning raised by Biopython that could lead to errors in future releases.

  • Add support for more recent versions of Numpy, SciPy, and Pandas (the versions of these dependencies were fixed to older versions due to past issues installing Pandas).

  • Drop support for Python<=3.9. chewBBACA now requires Python>=3.10.

v3.5.2

03 Mar 16:36
6f48a82

Choose a tag to compare

Fix the following bugs introduced in v3.5.1:

  • Fixed issue in the PrepExternalSchema module related to the addition of loci prefixes to sequence headers when the original schema did not use prefixes (e.g., schemas from cgMLST.org).
  • Fixed issue in the PrepExternalSchema module related to saving some alleles to the final FASTA files in the incorrect orientation.

v3.5.1

08 Jan 16:01
b0c1e37

Choose a tag to compare

chewBBACA no longer checks if input files have unique basename prefixes shorter than 30 characters. In the past, this was performed to ensure that sequence identifiers did not exceed the character limit (50 characters) enforced by BLAST when creating a database. The main changes to file name processing are the following:

  • chewBBACA uses the file basename without the file extension as unique identifier (e.g. GCF_008632635.1.fasta is converted to GCF_008632635.1), instead of trying to determine the shortest unique prefix that can be used to identify each input file. It is still necessary for each file to have a unique identifier after the removal of the file extension (e.g. GCF_008632635.1.fasta and GCF_008632635.1.fna have different file extensions but the same identifier after removing the file extension, which is not allowed).
  • The CreateSchema module uses the input file basenames without the file extension to define the identifiers for the loci in the created schemas (e.g. loci initially identified in the genomes GCF_008632635.1.fasta and GCA_000006785.2_ASM678v2.fasta are named as GCF_008632635.1-proteinN.fasta and GCA_000006785.2_ASM678v2-proteinN.fasta, respectively). We still recommend using short and unique file names without special characters (e.g.: !@#?$^*()+) for conciseness and to avoid potential issues.
  • The AlleleCall module accepts and uses the new loci identifier format used by the CreateSchema module. The input genome or CDS files can also have basenames of any length as long as the basename without the file extension for each input file is unique. The output files created by the AlleleCall module use the full unique basenames (e.g. for the genome GCA_000006785.2_ASM678v2.fasta, the genome identifier used in the output files will be GCA_000006785.2_ASM678v2, instead of GCA_000006785 used up until chewBBACA v3.5.0).
  • The PrepExternalSchema module accepts schemas containing loci FASTA files with basenames longer than 30 characters.

Additionally, the CDS identifiers are converted to a different format (lcl|SEQ1, lcl|SEQ2...lcl|SEQN) before creating a BLAST database with makeblastdb and the -parse_seqids option to avoid issues related to some sequence identifiers being interpretd and modified (e.g. interpretd as PDB Chain IDs) when creating a database, resulting in errors when an identifier is modified and no longer matches the original identifier. This allowed to remove the check to verify that unique prefixes are not modified by BLAST during database creation.

Additional changes:

  • Added the --output-masked option to the AlleleCall module to create a TSV file with the masked profiles (INF- prefixes are removed and the NIPH, NIPHEM, ASM, ALM, PLOT3, PLOT5, LOTSC, and PAMA classes are converted to 0).

v3.5.0

05 Dec 20:30

Choose a tag to compare

Added the ComputeMSA module to compute MSAs from allele calling results or from a folder containing FASTA files. The ComputeMSA module includes the following functionalities:

  • Compute loci, sample and complete MSAs based on the allelic profiles determined by chewBBACA (e.g. at the wg/cgMLST level). Gap sequences (the character used to represent gaps is -) are added whenever a locus was not identified in a sample (e.g. when working at the wgMLST level).
  • Compute a MSA for each FASTA file in a folder (just a way to run MAFFT to compute MSAs).
  • MSAs can be computed both at the protein and DNA level (i.e. by converting protein MSAs back to DNA).
  • The --output-variable option identifies the variable positions (SNVs) and creates MSAs only for those positions. When determining variable positions, positions with gaps or ambiguous bases can be excluded (--gaps exclude and --ambiguous exclude) or included (--gaps ignore and --ambiguous ignore) in the MSA if the sequences have other variable non-gap and non-ambiguous nucleotides or amino acids.
  • The SchemaEvaluator and AlleleCallEvaluator modules use the ComputeMSA module to compute the loci MSAs (SchemaEvaluator) and the complete MSA used by FastTree to compute a tree (AlleleCallEvaluator).

v3.4.2

02 Sep 23:16

Choose a tag to compare

  • Fixed issue in the ExtractCgMLST module related to using the deprecated Plotly titlefont attribute. Support for the titlefont attribute was dropped in Plotly v6.0.0. The ExtractCgMLST module would exit with an error and fail to generate the HTML plot if Plotly >= v6.0.0 was installed.

  • The LoadSchema module no longer queries UniProt's SPARQL endpoint to retrieve annotations. The current implementation was failing to retrieve annotations. Users should use the UniprotFinder module or the annotation functionalities provided by Schema Refinery to annotate the schema loci and create a TSV file with annotations to submit to Chewie-NS.

v3.4.1

31 Jul 12:54

Choose a tag to compare

  • Changed the -max_target_seqs value used by the select_representatives function to the square of the number of potential new representative alleles or to a minimum of 100. This change tries to fix an issue where BLASTp would not report the self-alignment for some alleles because it reached the limit of the number of alignments to report before reporting all self-alignments (e.g. for very large datasets, the number of potential new representatives may lead to a number of alignments that exceeds the value passed to -max_target_seqs).

v3.4.0

24 Jun 17:48

Choose a tag to compare

  • Add the GetAlleles module to create FASTA files containing the alleles identified by the AlleleCall module.

v3.3.10

06 Aug 16:17

Choose a tag to compare

  • Fixed issue in the UniprotFinder module related to TrEMBL and Swiss-Prot IDs being parsed by BLAST when the qacc and sacc format specifiers were used with -outfmt 6. Switched back to the qseqid and sseqid format specifiers.

v3.3.9

16 Jul 14:48

Choose a tag to compare

  • Fixed an issue related to sequence IDs interpreted by BLAST as PDB chain IDs.

  • Fixed an issue related to CDS counting when gene prediction returns no CDSs for one or more inputs.

v3.3.8

02 Jul 12:17

Choose a tag to compare

  • Added support for genetic codes 2, 3, 5, 6, 9, 10, 12-16, 21-25 (complete list available here). Values passed to --t, --translation-table are ignored if a training file is used. The CreateSchema, AlleleCall and PrepExternalSchema modules use the genetic code used to create the training file.

  • Fixed issue related to data about CDSs close to the contig tips not being available if input FASTA files contain CDSs and --cds is used.

  • Fixed issue in the AlleleCallEvaluator module related to entirely numeric columns.