@@ -8,11 +8,11 @@ LZ-ANI is a key component of [Vclust](https://github.com/refresh-bio/vclust), a
8
8
LZ-ANI offers six similarity measures between two genomic sequences:
9
9
10
10
- ** ANI** : The number of identical bases across local alignments divided by the total length of the alignments.
11
- - ** Global ANI (gANI)** : The number of identical bases across local alignments divided by the length of the query/target genome.
12
- - ** Total ANI (tANI)** : The number of identical bases between query-target and target -query genomes divided by the sum length of both genomes.
13
- - ** Coverage (alignment fraction)** : The proportion of the query sequence aligned with the target sequence.
11
+ - ** Global ANI (gANI)** : The number of identical bases across local alignments divided by the length of the query/reference genome.
12
+ - ** Total ANI (tANI)** : The number of identical bases between query-reference and referece -query genomes divided by the sum length of both genomes.
13
+ - ** Coverage (alignment fraction)** : The proportion of the query/reference sequence aligned with the reference/query sequence.
14
14
- ** Number of local alignments** : The count of individual alignments found between the sequences.
15
- - ** Ratio between query and target genome lengths** : A measure comparing the lengths of the two genomes.
15
+ - ** Ratio between query and reference genome lengths** : A measure comparing the lengths of the two genomes.
16
16
17
17
18
18
## Installation
@@ -78,6 +78,7 @@ Currently, LZ-ANI operates exclusively in the all2all mode, where sequence simil
78
78
* ` lite=idx1,idx2,tani,gani,ani,cov,num_alns,len_ratio `
79
79
* ` standard=idx1,idx2,id1,id2,tani,gani,ani,cov,num_alns,len_ratio `
80
80
* ` (default: standard) `
81
+ * ` --out-alignment <file_name> ` &mdash ; output file name for alignments (optional)
81
82
* ` --out-filter <par> <float> ` &mdash ; store only results with ` <par> ` (can be: ` tani ` , ` gani ` , ` ani ` , ` cov ` ) at least ` <float> ` ; can be used multiple times
82
83
83
84
#### LZ-parsing options:
@@ -118,101 +119,95 @@ LZ-ANI creates two TSV files: one contains ANI values for genome pairs, and the
118
119
./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv
119
120
```
120
121
121
- For brevity, only the first 15 lines of output are shown:
122
+ For brevity, only the first 12 lines of output are shown:
122
123
123
124
```
124
- id1 id2 tani gani ani cov len_ratio
125
- NC_025457.alt2 NC_005091.alt2 0.013765 0.011564 0.577882 0.020011 1.007347
126
- NC_005091.alt2 NC_025457.alt2 0.013765 0.015982 0.575792 0.027757 0.992706
127
- NC_025457.alt2 NC_005091.alt1 0.014603 0.013995 0.565491 0.024749 1.116770
128
- NC_005091.alt1 NC_025457.alt2 0.014603 0.015282 0.555345 0.027517 0.895440
129
- NC_025457.alt2 NC_005091.ref 0.014644 0.012671 0.576596 0.021975 1.116770
130
- NC_005091.ref NC_025457.alt2 0.014644 0.016848 0.569077 0.029606 0.895440
131
- NC_025457.alt2 NC_002486.alt 0.022687 0.018328 0.604938 0.030297 1.405995
132
- NC_002486.alt NC_025457.alt2 0.022687 0.028815 0.594216 0.048492 0.711240
133
- NC_025457.alt2 NC_002486.ref 0.020692 0.017268 0.604474 0.028567 1.405995
134
- NC_002486.ref NC_025457.alt2 0.020692 0.025506 0.609424 0.041853 0.711240
135
- NC_025457.alt2 NC_025457.ref 0.752589 0.658220 0.910059 0.723272 1.504290
136
- NC_025457.ref NC_025457.alt2 0.752589 0.894547 0.915166 0.977470 0.664765
137
- NC_025457.alt2 NC_025457.alt1 0.595191 0.502322 0.895679 0.560829 1.562460
138
- NC_025457.alt1 NC_025457.alt2 0.595191 0.740296 0.909148 0.814275 0.640016
139
- NC_025457.alt2 NC_010807.alt2 0.027875 0.022115 0.570567 0.038760 1.582148
125
+ qidx ridx query reference tani gani ani qcov rcov num_alns len_ratio
126
+ 9 8 NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 0.997608 60 0.9836
127
+ 8 9 NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 0.973177 67 0.9836
128
+ 10 8 NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 0.999901 34 0.9571
129
+ 8 10 NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 0.999923 36 0.9571
130
+ 11 8 NC_010807.ref NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 1.000000 14 0.9571
131
+ 8 11 NC_010807.alt2 NC_010807.ref 0.989807 0.990063 0.990063 1.000000 0.999923 14 0.9571
132
+ 10 9 NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 0.972575 71 0.9730
133
+ 9 10 NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 0.998686 70 0.9730
134
+ 11 9 NC_010807.ref NC_010807.alt3 0.983839 0.997166 0.997217 0.999948 0.974230 52 0.9730
135
+ 9 11 NC_010807.alt3 NC_010807.ref 0.983839 0.970871 0.996552 0.974230 0.999948 52 0.9730
136
+ 11 10 NC_010807.ref NC_010807.alt1 0.997462 0.997475 0.997475 1.000000 1.000000 23 1.0000
137
+ 10 11 NC_010807.alt1 NC_010807.ref 0.997462 0.997449 0.997449 1.000000 1.000000 23 1.0000
140
138
```
141
139
142
140
### Output format
143
141
144
142
The ` --out-format ` provides three output views: ` standard ` , ` lite ` , and ` complete ` .
145
143
146
- | Field | Standard | Lite | Complete | Description |
144
+ | Column | Standard | Lite | Complete | Description |
147
145
| --- | :---: | :---: | :---: | --- |
148
- | idx1 | + | + | + | index of sequence 1 |
149
- | idx2 | + | + | + | index of sequence 2 |
150
- | id1 | + | - | + | identifier (name) of sequence 1 |
151
- | id2 | + | - | + | identifier (name) of sequence 2 |
146
+ | qidx | + | + | + | Index of query sequence |
147
+ | ridx | + | + | + | Index of reference sequence |
148
+ | query | + | - | + | Identifier (name) of query sequence |
149
+ | reference | + | - | + | Identifier (name) of reference sequence |
152
150
| tani | + | + | + | total ANI [ 0-1] |
153
151
| gani | + | + | + | global ANI [ 0-1] |
154
152
| ani | + | + | + | ANI [ 0-1] |
155
- | cov | + | + | + | Coverage (alignment fraction) [ 0-1] |
156
- | num_alns | + | + | + | Number of alignments |
157
- | len_ratio | + | + | + | Length ratio between sequence 1 and sequence 2 |
158
- | len1 | - | - | + | Length of sequence 1 |
159
- | len2 | - | - | + | Length of sequence 2|
153
+ | qcov | + | + | + | Query coverage (aligned fraction) [ 0-1] |
154
+ | rcov | + | + | + | Reference coverage (aligned fraction) [ 0-1] |
155
+ | num_alns | + | + | + | Number of local alignments |
156
+ | len_ratio | + | + | + | Length ratio between shorter and longer sequence [ 0-1] |
157
+ | qlen | - | - | + | Query sequence length |
158
+ | rlen | - | - | + | Reference sequence length |
160
159
| nt_match | - | - | + | Number of matching nucleotides across alignments |
161
160
| nt_mismatch | - | - | + | Number of mismatching nucleotides across alignments |
162
161
163
162
164
163
In addition, the ` --out-format ` option permits formatting arbitrary fields from the LZ-ANI tab-separated-value (TSV) format:
165
164
166
165
``` bash
167
- ./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-format id1,id2 ,ani,cov
166
+ ./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-format query,reference ,ani,qcov,rcov
168
167
```
169
168
170
169
```
171
- id1 id2 ani cov
172
- NC_025457.alt2 NC_005091.alt2 0.577882 0.020011
173
- NC_005091.alt2 NC_025457.alt2 0.575792 0.027757
174
- NC_025457.alt2 NC_005091.alt1 0.565491 0.024749
175
- NC_005091.alt1 NC_025457.alt2 0.555345 0.027517
176
- NC_025457.alt2 NC_005091.ref 0.576596 0.021975
177
- NC_005091.ref NC_025457.alt2 0.569077 0.029606
178
- NC_025457.alt2 NC_002486.alt 0.604938 0.030297
179
- NC_002486.alt NC_025457.alt2 0.594216 0.048492
180
- NC_025457.alt2 NC_002486.ref 0.604474 0.028567
181
- NC_002486.ref NC_025457.alt2 0.609424 0.041853
182
- NC_025457.alt2 NC_025457.ref 0.910059 0.723272
183
- NC_025457.ref NC_025457.alt2 0.915166 0.977470
184
- NC_025457.alt2 NC_025457.alt1 0.895679 0.560829
185
- NC_025457.alt1 NC_025457.alt2 0.909148 0.814275
186
- NC_025457.alt2 NC_010807.alt2 0.570567 0.038760
170
+ query reference ani qcov rcov
171
+ NC_010807.alt2 NC_025457.alt2 0.572519 0.0646036 0.0387601
172
+ NC_025457.alt2 NC_010807.alt2 0.570567 0.0387601 0.0646036
173
+ NC_010807.alt3 NC_025457.alt2 0.586745 0.0514402 0.0354560
174
+ NC_025457.alt2 NC_010807.alt3 0.565714 0.0354560 0.0514402
175
+ NC_010807.alt1 NC_025457.alt2 0.577825 0.0604148 0.0394770
176
+ NC_025457.alt2 NC_010807.alt1 0.568496 0.0394770 0.0604148
177
+ NC_010807.ref NC_025457.alt2 0.57375 0.0618318 0.0395705
178
+ NC_025457.alt2 NC_010807.ref 0.567546 0.0395705 0.0618318
179
+ NC_005091.alt1 NC_005091.alt2 0.937913 0.996571 0.996907
180
+ NC_005091.alt2 NC_005091.alt1 0.940487 0.996907 0.996571
181
+ NC_005091.ref NC_005091.alt2 0.964911 0.999495 0.999859
182
+ NC_005091.alt2 NC_005091.ref 0.968125 0.999859 0.999495
183
+ NC_002486.alt NC_005091.alt2 0.558574 0.0129065 0.00871326
187
184
...
188
185
```
189
186
190
187
191
188
### Output filtering
192
189
193
- The ` --out-filter ` option allows you to filter the output by setting minimum similarity thresholds, enabling you to report only those genome pairs that meet the specified criteria, thus significantly reducing the output TSV file size. For example, the following command outputs only genome pairs with ANI ≥ 0.95 and coverage ≥ 0.85:
190
+ The ` --out-filter ` option allows you to filter the output by setting minimum similarity thresholds, enabling you to report only those genome pairs that meet the specified criteria, thus significantly reducing the output TSV file size. For example, the following command outputs only genome pairs with ANI ≥ 0.95 and query coverage ≥ 0.85:
194
191
195
192
``` bash
196
- ./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-filter ani 0.95 --out-filter cov 0.85
193
+ ./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-filter ani 0.95 --out-filter qcov 0.85
197
194
```
198
195
199
196
```
200
- id1 id2 tani gani ani cov len_ratio
201
- NC_005091.alt2 NC_005091.ref 0.966298 0.967989 0.968125 0.999859 1.108624
202
- NC_005091.ref NC_005091.alt2 0.966298 0.964424 0.964911 0.999495 0.902019
203
- NC_005091.alt1 NC_005091.ref 0.970072 0.970151 0.971368 0.998747 1.000000
204
- NC_005091.ref NC_005091.alt1 0.970072 0.969994 0.971245 0.998712 1.000000
205
- NC_002486.alt NC_002486.ref 1.000000 1.000000 1.000000 1.000000 1.000000
206
- NC_002486.ref NC_002486.alt 1.000000 1.000000 1.000000 1.000000 1.000000
207
- NC_025457.alt1 NC_025457.ref 0.809496 0.845785 0.985613 0.858131 0.962770
208
- NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 1.016645
209
- NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 0.983627
210
- NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 1.044828
211
- NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 0.957095
212
- NC_010807.alt2 NC_010807.ref 0.989807 0.990063 0.990063 1.000000 1.044828
213
- NC_010807.ref NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 0.957095
214
- NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 1.027721
215
- NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 0.973026
197
+ qidx ridx query reference tani gani ani qcov num_alns len_ratio
198
+ 7 6 NC_025457.alt1 NC_025457.ref 0.809496 0.845785 0.985613 0.858131 123 0.9628
199
+ 9 8 NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 60 0.9836
200
+ 8 9 NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 67 0.9836
201
+ 10 8 NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 34 0.9571
202
+ 8 10 NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 36 0.9571
203
+ 11 8 NC_010807.ref NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 14 0.9571
204
+ 8 11 NC_010807.alt2 NC_010807.ref 0.989807 0.990063 0.990063 1 14 0.9571
205
+ 10 9 NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 71 0.9730
206
+ 9 10 NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 70 0.9730
207
+ 11 9 NC_010807.ref NC_010807.alt3 0.983839 0.997166 0.997217 0.999948 52 0.9730
208
+ 9 11 NC_010807.alt3 NC_010807.ref 0.983839 0.970871 0.996552 0.974230 52 0.9730
209
+ 11 10 NC_010807.ref NC_010807.alt1 0.997462 0.997475 0.997475 1 23 1
210
+ 10 11 NC_010807.alt1 NC_010807.ref 0.997462 0.997449 0.997449 1 23 1
216
211
...
217
212
```
218
213
@@ -250,6 +245,47 @@ kmer-db distance ani-shorter -sparse -above 0.7 all2all.txt
250
245
mv all2all.txt fltr.txt
251
246
```
252
247
248
+ ### Alignments
249
+
250
+ LZ-ANI can output alignment details in a separate TSV file. This output format is similar to the BLASTn tabular output and includes information on each local alignment between two genomes, such as the coordinates in both the query and reference sequences, strand orientation, the number of matched and mismatched nucleotides, and the percentage of sequence identity.
251
+
252
+ ``` bash
253
+ ./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-alignment ani.aln.tsv
254
+ ```
255
+
256
+ Sample output:
257
+
258
+ ```
259
+ query reference pident alnlen qstart qend rstart rend nt_match nt_mismatch
260
+ NC_025457.alt2 NC_025457.ref 89.2893 999 22119 23117 14207 15163 892 107
261
+ NC_025457.alt2 NC_025457.ref 89.8305 826 3373 4198 2202 3020 742 84
262
+ NC_025457.alt2 NC_025457.ref 91.0804 796 41697 42492 27680 28475 725 71
263
+ NC_025457.alt2 NC_025457.ref 87.2483 745 38039 38783 24969 25688 650 95
264
+ NC_025457.alt2 NC_025457.ref 89.8860 702 7269 7970 5077 5778 631 71
265
+ NC_025457.alt2 NC_025457.ref 93.2081 692 62572 63263 41329 42020 645 47
266
+ NC_025457.alt2 NC_025457.ref 90.9565 575 31121 31695 20438 21003 523 52
267
+ NC_025457.alt2 NC_025457.ref 90.6195 565 11476 12040 7999 8563 512 53
268
+ NC_025457.alt2 NC_025457.ref 91.6211 549 10905 11453 7455 8003 503 46
269
+ NC_025457.alt2 NC_025457.ref 86.7041 534 29624 30157 19067 19586 463 71
270
+ NC_025457.alt2 NC_025457.ref 93.5673 513 10149 10661 6915 7427 480 33
271
+ NC_025457.alt2 NC_025457.ref 89.3701 508 34017 34524 22188 22695 454 54
272
+ NC_025457.alt2 NC_025457.ref 88.0240 501 18330 18830 11549 12049 441 60
273
+ ```
274
+
275
+ | Column | Description |
276
+ | --- | --- |
277
+ | query | Identifier (name) of query sequence |
278
+ | reference | Identifier (name) of reference sequence |
279
+ | pident | Percent identity of local alignment |
280
+ | alnlen | Alignment length |
281
+ | qstart | Start of alignment in query |
282
+ | qend | End of alignment in query |
283
+ | rstart | Start of alignment in reference |
284
+ | rend | End of alignment in reference |
285
+ | nt_match | Number of matched (identical) nucleotides |
286
+ | nt_mismatch | Number of mismatching nucleotides |
287
+
288
+
253
289
## Further clustering
254
290
255
291
The LZ-ANI output files, [ ani.tsv] ( ./example/output/ani.tsv ) and [ ani.ids.tsv] ( ./example/output.ani.ids.tsv ) , can be used as input for clustering with [ Clusty] ( https://github.com/refresh-bio/clusty ) . Clustering can use one of similarity measures (e.g., ` tani ` , ` ani ` ), with the user specifying the minimum similarity threshold for connecting genomes.
@@ -264,7 +300,7 @@ Clusty can also apply additional thresholds for various similarity measures. If
264
300
265
301
``` bash
266
302
# Cluster genomes based on ANI, connecting them only if ANI ≥ 95% and coverage ≥ 85%.
267
- clusty --objects-file example/output/ani.ids.tsv --algo complete --distance-col ani --similarity --numeric-ids --min ani 0.95 --min cov 0.85 example/output/ani.tsv clusters.txt
303
+ clusty --objects-file example/output/ani.ids.tsv --algo complete --distance-col ani --similarity --numeric-ids --min ani 0.95 --min qcov 0.85 example/output/ani.tsv clusters.txt
268
304
```
269
305
270
306
## Cite
0 commit comments