@@ -57,14 +57,14 @@ The below is an example of output figures of wheat (ABD, 1n=3x=21):
57
57
![ wheat] ( example_data/wheat_figures.png )
58
58
** Figure. Phased subgenomes of allohexaploid bread wheat genome.** Colors are unified with each subgenome in subplots ` B-F ` , i.e. the same color means the same subgenome.
59
59
* (** A** ) The histogram of differential k-mers among homoeologous chromosome sets.
60
- * (** B** ) Heatmap and clustering of differential k-mers. The x-axis, k-mers; y-axis, chromosomes.
60
+ * (** B** ) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers) .
61
61
* (** C** ) Principal component analysis (PCA) of differential k-mers.
62
- * (** D** ) Chromosomal characteristics. Rings from outer to inner:
63
- - (** 1** ) Karyotypes of subgenome assignments by a k-Means algorithm.
64
- - (** 2** ) Significant enrichment of subgenome-specific k-mers.
62
+ * (** D** ) Chromosomal characteristics (window size: 1 Mb) . Rings from outer to inner:
63
+ - (** 1** ) Subgenome assignments by a k-Means algorithm.
64
+ - (** 2** ) Significant enrichment of subgenome-specific k-mers (blank for non-enriched windows) .
65
65
- (** 3** ) Normalized proportion of subgenome-specific k-mers.
66
- - (** 4-6** ) Density distribution of each subgenome-specific k-mer set.
67
- - (** 7** ) Density distribution of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
66
+ - (** 4-6** ) Density distribution (count) of each subgenome-specific k-mer set.
67
+ - (** 7** ) Density distribution (count) of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
68
68
- (** 8** ) Homoeologous blocks of each homoeologous chromosome set.
69
69
* (** E** ) Insertion time of subgenome-specific LTR-RTs.
70
70
* (** F** ) A phylogenetic tree of 1,000 randomly subsampled LTR/Gypsy elements.
@@ -137,6 +137,7 @@ phase-results/
137
137
├── k15_q200_f2.chrom-subgenome.tsv # subgenome assignments and bootstrap values
138
138
├── k15_q200_f2.sig.kmer-subgenome.tsv # subgenome-specific kmers
139
139
├── k15_q200_f2.bin.enrich # subgenome-specific enrichments by genome window/bin
140
+ ├── k15_q200_f2.bin.group # grouped bins by potential exchanges based on enrichments
140
141
├── k15_q200_f2.ltr.enrich # subgenome-specific LTR-RTs
141
142
├── k15_q200_f2.ltr.insert.pdf # density plot of insertion age of subgenome-specific LTR-RTs
142
143
├── k15_q200_f2.ltr.insert.R # R script for the density plot
@@ -164,28 +165,32 @@ tmp/
164
165
```
165
166
usage: subphaser [-h] -i GENOME [GENOME ...] -c CFGFILE [CFGFILE ...]
166
167
[-labels LABEL [LABEL ...]] [-no_label]
167
- [-target FILE] [-sep STR]
168
+ [-target FILE] [-sg_assigned FILE] [- sep STR]
168
169
[-custom_features FASTA [FASTA ...]] [-pre STR]
169
170
[-o DIR] [-tmpdir DIR] [-k INT] [-f FLOAT] [-q INT]
170
171
[-baseline BASELINE] [-lower_count INT]
171
172
[-min_prop FLOAT] [-max_freq INT] [-max_prop FLOAT]
172
173
[-low_mem] [-by_count] [-re_filter] [-nsg INT]
173
174
[-replicates INT] [-jackknife FLOAT]
174
- [-max_pval FLOAT] [-figfmt {pdf,png}]
175
+ [-max_pval FLOAT]
176
+ [-test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}]
177
+ [-figfmt {pdf,png}]
175
178
[-heatmap_colors COLOR [COLOR ...]]
176
- [-heatmap_options STR] [-disable_ltr]
179
+ [-heatmap_options STR] [-just_core] [- disable_ltr]
177
180
[-ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]]
178
181
[-ltr_finder_options STR] [-ltr_harvest_options STR]
179
182
[-tesorter_options STR] [-all_ltr] [-intact_ltr]
180
- [-shared_ltr ] [-mu FLOAT ] [-disable_ltrtree ]
181
- [-subsample INT]
183
+ [-exclude_exchanges ] [-shared_ltr ] [-mu FLOAT ]
184
+ [-disable_ltrtree] [- subsample INT]
182
185
[-ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]]
183
186
[-trimal_options STR]
184
187
[-tree_method {iqtree,FastTree}] [-tree_options STR]
185
188
[-ggtree_options STR] [-disable_circos]
186
189
[-window_size INT] [-disable_blocks] [-aligner PROG]
187
- [-aligner_options STR] [-min_block INT] [-p INT]
188
- [-max_memory MEM] [-cleanup] [-overwrite] [-v]
190
+ [-aligner_options STR] [-min_block INT]
191
+ [-alt_cfgs CFGFILE [CFGFILE ...]] [-chr_ordered FILE]
192
+ [-p INT] [-max_memory MEM] [-cleanup] [-overwrite]
193
+ [-v]
189
194
190
195
Phase and visualize subgenomes of an allopolyploid or hybrid based on the repetitive kmers.
191
196
@@ -198,19 +203,22 @@ Input:
198
203
-i GENOME [GENOME ...], -genomes GENOME [GENOME ...]
199
204
Input genome sequences in fasta format [required]
200
205
-c CFGFILE [CFGFILE ...], -sg_cfgs CFGFILE [CFGFILE ...]
201
- Subgenomes config file (one homoeologous group per
206
+ Subgenomes config file (one homologous group per
202
207
line); this chromosome set is for identifying
203
208
differential kmers [required]
204
209
-labels LABEL [LABEL ...]
205
210
For multiple genomes, provide prefix labels for each
206
211
genome sequence to avoid conficts among chromosome id
207
212
[default: '1-, 2-, ..., n-']
208
213
-no_label Do not use default prefix labels for genome sequences
209
- as there is no confict among chromosome id [default:
210
- False]
214
+ as there is no confict among chromosome id
215
+ [default= False]
211
216
-target FILE Target chromosomes to output; id mapping is allowed;
212
217
this chromosome set is for cluster and phase [default:
213
218
the same chromosome set as `-sg_cfgs`]
219
+ -sg_assigned FILE Provide subgenome assignments to skip k-means
220
+ clustering and to identify subgenome-specific features
221
+ [default=None]
214
222
-sep STR Seperator for chromosome ID [default="|"]
215
223
-custom_features FASTA [FASTA ...]
216
224
Custom features in fasta format to enrich subgenome-
@@ -243,7 +251,7 @@ Kmer:
243
251
[default=None]
244
252
-low_mem Low MEMory but slower [default: True if genome size >
245
253
3G, else False]
246
- -by_count Calculate fold by count instead of by propor
254
+ -by_count Calculate fold by count instead of by proportion
247
255
[default=False]
248
256
-re_filter Re-filter with subset of chromosomes (subgenome
249
257
assignments are expected to change) [default=False]
@@ -253,68 +261,73 @@ Cluster:
253
261
254
262
-nsg INT Number of subgenomes (>1) [default: auto]
255
263
-replicates INT Number of replicates for bootstrap [default=1000]
256
- -jackknife FLOAT Percent of kmers to resample for bootstrap
264
+ -jackknife FLOAT Percent of kmers to resample for each bootstrap
257
265
[default=50]
258
266
-max_pval FLOAT Maximum P value for all hypothesis tests
259
267
[default=0.05]
268
+ -test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}
269
+ The test method to identify differiential
270
+ kmers[default=ttest_ind]
260
271
-figfmt {pdf,png} Format of figures [default=pdf]
261
272
-heatmap_colors COLOR [COLOR ...]
262
- Color panel (2 or 3 colors) for heatmap plot
263
- [default= ('green', 'black', 'red')]
273
+ Color panel (2 or 3 colors) for heatmap plot [default:
274
+ ('green', 'black', 'red')]
264
275
-heatmap_options STR Options for heatmap plot (see more in R shell with
265
276
`?heatmap.2` of `gplots` package) [default="Rowv=T,Col
266
277
v=T,scale='col',dendrogram='row',labCol=F,trace='none'
267
278
,key=T,key.title=NA,density.info='density',main=NA,xla
268
- b=NA,margins=c(4,8)"]
279
+ b='Differential kmers',margins=c(2.5,12)"]
280
+ -just_core Exit after the core phasing module
281
+ [default=False]
269
282
270
283
LTR:
271
284
Options for LTR analyses
272
285
273
286
-disable_ltr Disable this step (this step is time-consuming for
274
287
large genome) [default=False]
275
288
-ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]
276
- Programs to detect LTR-RTs [default=['ltr_harvest',
277
- 'ltr_finder']]
289
+ Programs to detect LTR-RTs [default=['ltr_harvest']]
278
290
-ltr_finder_options STR
279
291
Options for `ltr_finder` to identify LTR-RTs (see more
280
- with `ltr_finder -h`) [default="-w 2 -D 20000 -d 1000
281
- -L 7000 -l 100 -p 20 -C -M 0.6 "]
292
+ with `ltr_finder -h`) [default="-w 2 -D 15000 -d 1000
293
+ -L 7000 -l 100 -p 20 -C -M 0.8 "]
282
294
-ltr_harvest_options STR
283
295
Options for `gt ltrharvest` to identify LTR-RTs (see
284
296
more with `gt ltrharvest -help`) [default="-seqids yes
285
- -similar 60 -vic 10 -seed 20 -minlenltr 100 -maxlenltr
286
- 7000 -maxdistltr 20000 -mindistltr 1000 -mintsd 4
287
- -maxtsd 20"]
297
+ -similar 80 -vic 10 -seed 20 -minlenltr 100 -maxlenltr
298
+ 7000 -mintsd 4 -maxtsd 6"]
288
299
-tesorter_options STR
289
300
Options for `TEsorter` to classify LTR-RTs (see more
290
- with `TEsorter -h`) [default="-db rexdb-plant -dp2"]
291
- -all_ltr Use all LTR identified by `-ltr_detectors` (more LTRs
292
- but slower) [default: only use LTR as classified by
293
- `TEsorter`]
294
- -intact_ltr Use completed LTR as classified by `TEsorter` (less
295
- LTRs but faster) [default: the same as `-all_ltr`]
296
- -shared_ltr Identify shared LTRs among subgenomes (experimental)
297
- [default=False]
301
+ with `TEsorter -h`) [default="-db rexdb -dp2"]
302
+ -all_ltr Use all LTR-RTs identified by `-ltr_detectors` (more
303
+ LTR-RTs but slower) [default: only use LTR as
304
+ classified by `TEsorter`]
305
+ -intact_ltr Use completed LTR-RTs classified by `TEsorter` (less
306
+ LTR-RTs but faster) [default: the same as `-all_ltr`]
307
+ -exclude_exchanges Exclude potential exchanged LTRs for insertion age
308
+ estimation and phylogenetic trees [default=False]
309
+ -shared_ltr Identify shared LTR-RTs among subgenomes
310
+ (experimental) [default=False]
298
311
-mu FLOAT Substitution rate per year in the intergenic region,
299
312
for estimating age of LTR insertion [default=1.3e-08]
300
313
-disable_ltrtree Disable subgenome-specific LTR tree (this step is
301
- time-consuming when subgenome-specific LTRs are too
314
+ time-consuming when subgenome-specific LTR-RTs are too
302
315
many, so `-subsample` is enabled by defualt)
303
316
[default=False]
304
- -subsample INT Subsample LTRs to avoid too many to construct a tree
305
- [default=1000] (0 to disable)
317
+ -subsample INT Subsample LTR-RTs to avoid too many to construct a
318
+ tree [default=1000] (0 to disable)
306
319
-ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]
307
320
Domains for LTR tree (Note: for domains identified by
308
321
`TEsorter`, PROT (rexdb) = AP (gydb), RH (rexdb) =
309
- RNaseH (gydb)) [default= ['INT', 'RT', 'RH']]
322
+ RNaseH (gydb)) [default: ['INT', 'RT', 'RH']]
310
323
-trimal_options STR Options for `trimal` to trim alignment (see more with
311
324
`trimal -h`) [default="-automated1"]
312
325
-tree_method {iqtree,FastTree}
313
326
Programs to construct phylogenetic trees
314
- [default=iqtree ]
327
+ [default=FastTree ]
315
328
-tree_options STR Options for `-tree_method` to construct phylogenetic
316
329
trees (see more with `iqtree -h` or `FastTree
317
- -expert`) [default="-mset JTT "]
330
+ -expert`) [default=""]
318
331
-ggtree_options STR Options for `ggtree` to show phylogenetic trees (see
319
332
more from `https://yulab-smu.top/treedata-book`)
320
333
[default="branch.length='none', layout='circular'"]
@@ -324,17 +337,22 @@ Circos:
324
337
325
338
-disable_circos Disable this step [default=False]
326
339
-window_size INT Window size (bp) for circos plot [default=1000000]
327
- -disable_blocks Disable to plot homoeologous blocks [default=False]
328
- -aligner PROG Programs to identify homoeologous blocks
340
+ -disable_blocks Disable to plot homologous blocks [default=False]
341
+ -aligner PROG Programs to identify homologous blocks
329
342
[default=minimap2]
330
343
-aligner_options STR Options for `-aligner` to align chromosome sequences
331
344
[default="-x asm20 -n 10"]
332
345
-min_block INT Minimum block size (bp) to show [default=100000]
346
+ -alt_cfgs CFGFILE [CFGFILE ...]
347
+ An alternative config file for identifying homologous
348
+ blocks [default=None]
349
+ -chr_ordered FILE Provide a chromosome order to plot circos
350
+ [default=None]
333
351
334
352
Other options:
335
353
-p INT, -ncpu INT Maximum number of processors to use [default=32]
336
354
-max_memory MEM Maximum memory to use where limiting can be enabled.
337
- [default=65.1G ]
355
+ [default=65.2G ]
338
356
-cleanup Remove the temporary directory [default=False]
339
357
-overwrite Overwrite even if check point files existed
340
358
[default=False]
0 commit comments