Skip to content

Commit 4cf89f5

Browse files
authored
Merge pull request #99 from inab/v1.5.0_release
v1.5.0 release
2 parents b335b7d + f97d5ed commit 4cf89f5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1068
-95
lines changed

.readthedocs.yaml

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Read the Docs configuration file for Sphinx projects
2+
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
3+
4+
# Required
5+
version: 2
6+
7+
# Set the OS, Python version and other tools you might need
8+
build:
9+
os: ubuntu-22.04
10+
tools:
11+
python: "3.11"
12+
# You can also specify other tool versions:
13+
# nodejs: "20"
14+
# rust: "1.70"
15+
# golang: "1.20"
16+
17+
# Build documentation in the "docs/" directory with Sphinx
18+
sphinx:
19+
configuration: docs/source/conf.py
20+
# You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
21+
# builder: "dirhtml"
22+
# Fail on all warnings to avoid broken references
23+
# fail_on_warning: true
24+
25+
# Optionally build your docs in additional formats such as PDF and ePub
26+
# formats:
27+
# - pdf
28+
# - epub
29+
30+
# Optional but recommended, declare the Python requirements required
31+
# to build your documentation
32+
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
33+
python:
34+
install:
35+
- requirements: docs/requirements.txt

AUTHORS

+12-8
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,18 @@
1-
** Authors **
1+
* Authors*
2+
Victor Fernández Rodríguez.
3+
INB Hub. Life Sciences Department.
4+
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
5+
e-mail: victor.fernandez _at_ bsc.es
26

3-
Salvador Capella-Gutierrez.
4-
Comparative Genomics Group. Bioinformatics and Genomics Department.
5-
Centre for Genomic Regulation. Barcelona, Spain.
6-
e-mail: scapella _at_ crg.es
7+
Salvador Capella-Gutiérrez.
8+
INB Hub. Life Sciences Department.
9+
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
10+
e-mail: salvador.capella _at_ bsc.es
711

812
Toni Gabaldón.
9-
Comparative Genomics Group. Bioinformatics and Genomics Department.
10-
Centre for Genomic Regulation. Barcelona, Spain.
11-
e-mail: tgabaldon _at_ crg.es
13+
Comparative Genomics Group. Life Sciences Department.
14+
Barcelona Supercomputing Center (BSC). Barcelona, Spain.
15+
e-mail: toni.gabaldon _at_ bsc.es
1216

1317
** Authors (until trimAl v1.1) **
1418

docs/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
sphinxcontrib-email
2+
sphinx_copybutton
3+
sphinx-rtd-theme
30 KB
Loading

docs/source/_static/benchmarkings.pdf

3.44 MB
Binary file not shown.
66.5 KB
Loading

docs/source/_static/gappyout_plot.png

12.4 KB
Loading
62.1 KB
Loading
71.3 KB
Loading

docs/source/_static/strict_plot.png

15.6 KB
Loading
69.2 KB
Loading
2.79 MB
Binary file not shown.
1.1 MB
Binary file not shown.

docs/source/_static/trimal_doc.css

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
2+
3+
.wy-nav-content {
4+
max-width: 1200px;
5+
}

docs/source/algorithms.rst

+174
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
Trimming algorithms
2+
***********************
3+
4+
Manual methods
5+
========================
6+
7+
Custom columns
8+
------------------------
9+
This algorithm eliminates a specified set of columns defined by the user. The set of columns to be removed should be provided as individual column numbers separated by commas, and/or as blocks of consecutive columns indicated by the first and last column numbers separated by a hyphen. In the following example::
10+
11+
-selectcols { n,l,m-k }
12+
13+
where n and l are interpreted single column numbers, while m-k is a range of columns (from column m to column k, both included) to be deleted. Note that column numbering starts from 0. For instance, the command::
14+
15+
-selectcols { 2,7,20-25,80-100 }
16+
17+
will remove columns 2 and 7, along with two blocks of columns ranging from column 20 to 25 and 80 to 100, respectively.
18+
19+
20+
Threshold-based trimming
21+
------------------------
22+
The user can choose to remove all columns that do not meet a specified threshold or a combination of thresholds. The gap threshold (*-gt*) and similarity threshold (*-st*) represent the minimum values of the respective scores explained above and can be used individually or in combination. Like the scores they refer to, both thresholds range from 0 to 1.
23+
24+
trimAl provides two shortcuts to commonly used thresholds: *-nogaps* (equivalent to *-gt 1*), which deletes all columns with at least one gap, and *-noallgaps*, which removes columns composed solely of gaps.
25+
26+
In addition, the user can set a conservation threshold (*-cons*), indicating the minimum percentage of columns from the input alignment that should be retained in the trimmed alignment. This threshold is defined between 0 and 100 and takes precedence over all other thresholds. If any other threshold would result in a trimmed alignment with fewer columns than specified by the conservation threshold, trimAl adds more columns to meet the conservation threshold. These columns are added based on their scores, with a preference for columns with higher scores. In the case of equal scores, columns adjacent to already selected column-blocks and closer to the center of the alignment are added first,
27+
prioritizing the extension of longer and central blocks.
28+
29+
30+
When provided with a set of multiple sequence alignments, trimAl calculates a consistency score for each alignment in the set. Subsequently, the alignment with the highest score is selected. The chosen alignment can undergo various trimming methods, one of which involves removing columns that exhibit lower consistency across the other alignments. To achieve this, the user can utilize the *-ct* parameter to define the minimum values for the consistency score, within the range of 0 to 1. Any columns not meeting this specified value will be removed. Alternatively, the conservation score, as explained previously, can also be employed here. Moreover, it can be used in conjunction with gap and/or similarity methods.
31+
32+
33+
Overlap trimming
34+
------------------------
35+
trimAl can also remove poorly aligned or incomplete sequences considering the rest of
36+
sequences in the MSA. For that purpose, the user has to define two thresholds:
37+
First, the residue overlap threshold (-resoverlap) corresponds to the minimum residue
38+
overlap score for each residue.
39+
Second, the sequence overlap threshold (-seqoverlap) sets up the minimum percentage of
40+
the residues for each sequence that should pass the residue overlap threshold in order to
41+
maintain the sequence in the new alignment. Sequences that do not pass the sequence
42+
overlap threshold will be removed from the alignment. Finally, all columns that only have
43+
gaps in the new alignment will also be removed from the final alignment.
44+
45+
trimAl can effectively eliminate poorly aligned or incomplete sequences while considering the entire multiple sequence alignment (MSA). To achieve this, users need to specify two thresholds:
46+
47+
1. **Residue Overlap Threshold (-resoverlap):** This threshold corresponds to the minimum residue overlap score required for each residue.
48+
49+
2. **Sequence Overlap Threshold (-seqoverlap):** This threshold establishes the minimum percentage of residues within each sequence that must surpass the residue overlap threshold to retain the sequence in the new alignment. Sequences failing to meet this criterion will be excluded from the final alignment.
50+
51+
Additionally, columns exclusively filled with gaps in the new alignment will be systematically removed.
52+
53+
.. _overlap-example-figure:
54+
.. figure:: _static/overlap_example.png
55+
:name: overlap-example
56+
:width: 500px
57+
:align: center
58+
:alt: Overlap example MSA
59+
60+
An example of an alignment trimmed with the overlap method. We have used the same alignment as in :numref:`gappyout-example-figure`. Conserved (grey) and trimmed (white) columns are indicated.
61+
62+
Automated methods
63+
========================
64+
65+
.. _gappyout_method:
66+
Gappyout method
67+
------------------------
68+
This method relies on the gap distribution within the multiple sequence alignment (MSA). This method relies on the gap distribution within the Multiple Sequence Alignment (MSA). Initially, the method calculates gap scores for all columns and arranges them based on this score, generating a plot depicting potential gap score thresholds versus the percentage of the alignment below each threshold (see :numref:`gappyout-figure`). In the subsequent step, for every set of three consecutive points on this plot, trimAl computes the slopes between the first and third point, represented by blue lines. Following a comparison of all slopes, trimAl identifies the point with the maximal variation between consecutive slopes, indicated by a vertical red line in :numref:`gappyout-figure`.
69+
70+
71+
.. _gappyout-figure:
72+
.. figure:: _static/gappyout_plot.png
73+
:name: gappyout-plot
74+
:width: 500px
75+
:align: center
76+
:alt: Gappyout plot
77+
78+
Example of an internal trimAl plot showing possible gap score thresholds (y axis) versus percentages of alignment length below that threshold (x axis). Thin blue lines indicate slopes computed by the program. The vertical red line indicates the cut-off point
79+
selected by the gappyout algorithm.
80+
81+
82+
After determining a gap score cut-off point, trimAl removes all columns that do not meet this specified value (see :numref:`gappyout-example-figure`). In practical terms, this method effectively identifies the bimodal distribution of gap scores (columns rich in gaps and columns with fewer gaps) within an alignment. Subsequently, it eliminates the mode associated with a higher concentration of gaps. Our benchmarks indicate that this method efficiently eliminates a significant portion of poorly aligned regions.
83+
84+
85+
.. _gappyout-example-figure:
86+
.. figure:: _static/gappyout_example.png
87+
:name: gappyout-example
88+
:width: 500px
89+
:align: center
90+
:alt: Gappyout example MSA
91+
92+
An example of an alignment trimmed with the gappyout method. Conserved (grey) and trimmed (white) columns are indicated. This figure has been generated with trimAl -htmlout option.
93+
94+
.. _strict_method:
95+
Strict method
96+
------------------------
97+
98+
This method combines gappyout trimming with subsequent trimming based on an automatically selected similarity threshold. To determine the similarity threshold, trimAl utilizes the residue similarity scores distribution from the multiple sequence alignment (MSA). This distribution is transformed to a logarithmic scale (refer to :numref:`strict-figure`), and the residue similarity cutoff is selected as explained below.
99+
100+
.. _strict-figure:
101+
.. figure:: _static/strict_plot.png
102+
:name: strict-plot
103+
:width: 500px
104+
:align: center
105+
:alt: Strict plot
106+
107+
trimAl's internal plot representing similarity score values versus the percentage of the alignment above that value. Vertical blue lines indicate the significant values at 20 and 80 percentiles. The cut-off point is indicated with a red vertical line.
108+
109+
From this similarity distribution, trimAl selects the values at percentiles 20 and 80 of the alignment length (vertical blue lines in :numref:`strict-figure`). The residue similarity threshold (vertical red line in :numref:`strict-figure`) is computed as follows:
110+
111+
.. math::
112+
113+
P_{20} = \log(\text{Simvalue}_{20})
114+
115+
P_{80} = \log(\text{Simvalue}_{80})
116+
117+
SimThreshold = \left(P_{80} + \frac{{P_{20} - P_{80}}}{10} \right)^{10}
118+
119+
This process is equivalent to establishing upper and lower boundaries for the threshold at percentiles 20 and 80, respectively, of the similarity score distribution in that alignment. The similarity threshold is calculated using the difference between these two boundaries, being at 1/10 to the lower boundary (similarity at P80).
120+
121+
This method of setting the similarity threshold has demonstrated optimal performance in our benchmarks. The lower and upper boundaries ensure that the 20% most conserved columns in the alignment are preserved, while the 20% most dissimilar columns are discarded.
122+
123+
The specific similarity threshold will lie between these boundaries depending on the distribution of similarity scores in the alignments. Alignments with steep similarity score curves and significant differences between the most similar and dissimilar columns will set more columns below the threshold. Conversely, alignments with more columns having scores similar to the most-conserved fraction will apply more relaxed cutoffs. However, the removal of a specific column will depend on its context.
124+
125+
126+
Once trimAl has calculated the residue similarity cutoff, the following steps are taken:
127+
128+
1. The :ref:`gappyout method <gappyout_method>` is applied to identify columns that would be deleted with that method.
129+
2. Residues below the similarity cutoff are marked.
130+
3. After applying these filters, trimAl recovers (unmarks) columns that have not passed the gap and/or similarity thresholds but where three of the four most immediate neighboring columns (two on each side) have passed them.
131+
4. Finally, in a last step, trimAl removes all columns that do not fall within a block of at least five consecutive columns unmarked for deletion.
132+
133+
134+
.. _strict-example-figure:
135+
.. figure:: _static/strict_example.png
136+
:name: strict-example
137+
:width: 500px
138+
:align: center
139+
:alt: Strict example MSA
140+
141+
An example of an alignment trimmed with the strict method. We have used the same alignment as in :numref:`gappyout-example-figure`. Conserved (grey) and trimmed (white) columns are indicated.
142+
143+
144+
Strictplus method
145+
------------------------
146+
This approach is very similar to the strict method. However, the final step of the algorithm is slightly different. In this case, the block size is defined as 1% of the alignment size with a minimum value of 3 and a maximum size of 12.
147+
148+
This method is optimized for neighbor joining phylogenetic tree reconstruction.
149+
150+
.. _strictplus-example-figure:
151+
.. figure:: _static/strictplus_example.png
152+
:name: strictplus-example
153+
:width: 500px
154+
:align: center
155+
:alt: Strictplus example MSA
156+
157+
An example of an alignment trimmed with strictplus method. In this case, the block size has automatically been set to 12 because the alignment length is greater than 1200 residues. Again, the same alignment as the previous figures :numref:`gappyout-example-figure` and :numref:`strict-example-figure` has been used.
158+
159+
160+
Automated1 method
161+
------------------------
162+
Based on our own benchmarks with simulated alignments (see :doc:`benchmarking <benchmarking>`) we have designed a heuristic approach, denoted as automated1, to determine the optimal automatic method for trimming a given alignment. This heuristic is specifically fine-tuned for trimming alignments intended for maximum likelihood phylogenetic analyses.
163+
164+
Making use of a decision tree (:numref:`gappyout-example-figure`) , this heuristic dynamically selects between the :ref:`gappyout <gappyout_method>` and :ref:`strict <strict_method>` methods. In making this choice, trimAl considers the average identity score among all the sequences in the alignment, the average identity score for each most similar pair of sequences in the alignment, as well as the number of sequences in the alignment. We have observed that all these variables were important in deciding which method would provide the highest improvement on a given alignment.
165+
166+
.. _automated1-figure:
167+
.. figure:: _static/automated1_tree.png
168+
:name: automated1
169+
:width: 500px
170+
:align: center
171+
:alt: Automated1 decision tree
172+
173+
A decision tree for the heuristic method automated1. trimAl uses strict (light blue) or gappyout (light grey) methods depending on 1) the average identity score (Avg. identity score) among the sequences in the alignment, 2) the number of sequences in the alignment and 3) the average identity score (max Identity Score) computed from the maximum identity score for each sequence in the alignment. We use light yellow color to highlight the decisions in the tree.
174+

docs/source/conf.py

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# For the full list of built-in configuration values, see the documentation:
4+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
5+
6+
# -- Project information -----------------------------------------------------
7+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8+
9+
project = 'trimAl'
10+
copyright = '2024, trimAl team'
11+
author = 'trimAl team'
12+
release = '1.5.0'
13+
version = release
14+
15+
# -- General configuration ---------------------------------------------------
16+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
17+
18+
extensions = [ 'sphinxcontrib.email', 'sphinx_copybutton']
19+
20+
templates_path = ['_templates']
21+
exclude_patterns = []
22+
23+
24+
25+
# -- Options for HTML output -------------------------------------------------
26+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
27+
28+
#html_theme = 'alabaster'
29+
html_theme = 'sphinx_rtd_theme'
30+
html_static_path = ['_static']
31+
32+
html_title = 'trimAl'
33+
34+
html_css_files = ["https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css", "trimal_doc.css"]
35+
36+
copybutton_prompt_text = "$ "
37+
html_extra_path=['googlef52b8b3434a2466e.html']
38+
39+
numfig = True
40+
language = 'en'
+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
google-site-verification: googlef52b8b3434a2466e.html

docs/source/index.rst

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
2+
.. raw:: html
3+
4+
<a href="https://github.com/inab/trimal">
5+
<i class="fa fa-github" style="text-decoration: none;"></i>
6+
</a>
7+
8+
.. toctree::
9+
:hidden:
10+
:maxdepth: 3
11+
12+
installation
13+
usage
14+
scores
15+
algorithms
16+
benchmarking
17+
18+
19+
Welcome to trimAl's documentation!
20+
21+
trimAl
22+
==================
23+
This is trimAl's information page. You can also find information related to readAl, a MSA format conversor.
24+
trimAl is a tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment.
25+
26+
trimAl can consider several parameters, alone or in multiple combinations, in order to select the most-reliable positions in the alignment.
27+
These include the proportion of sequences with a gap, the level of residue similarity and, if several alignments for the same set of sequences are provided, the consistency level of columns among alignments.
28+
Moreover, trimAl allows to manually select a set of columns and sequences to be removed from the alignment.
29+
30+
trimAl implements a series of automated algorithms that trim the alignment searching for optimum thresholds based on inherent characteristics of the input alignment, to be used so that the signal-to-noise ratio after alignment trimming phase is increased.
31+
32+
Among trimAl's additional features, trimAl allows getting the complementary alignment (columns that were trimmed), to compute statistics from the alignment, to select the output file format , to get a summary of trimAl's trimming in HTML and SVG formats, and many other options.
33+
34+
35+
36+
Publications
37+
============
38+
- `trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses <https://doi.org/10.1093/bioinformatics/btp348>`_
39+
- `trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses (pdf) <https://academic.oup.com/bioinformatics/article-pdf/25/15/1972/48994574/bioinformatics_25_15_1972.pdf>`_
40+
- `Supplementary material (pdf) <_static/supplementary_material.pdf>`_
41+
42+
Citing this tool
43+
==================
44+
::
45+
46+
Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England), 25(15), 1972–1973. https://doi.org/10.1093/bioinformatics/btp348
47+
48+
49+
License
50+
========
51+
This program is free software: you can redistribute it and/or modify
52+
it under the terms of the GNU General Public License as published by
53+
the Free Software Foundation, the last available version.
54+
55+
Contact
56+
========
57+
If you have any doubt or problem feel free to open a `Github issue <https://github.com/inab/trimal/issues>`_ in the repository.
58+
59+
- :email:`Nicolás Díaz Roussel <[email protected]>`
60+
- :email:`Salvador Capella Gutierrez <[email protected]>`
61+
62+
63+
Development team
64+
================
65+
- Salvador Capella Gutiérrez
66+
- Toni Gabaldón
67+
- Víctor Fernández Rodríguez
68+
- Jose M. Silla Martínez (former)
69+

0 commit comments

Comments
 (0)