Skip to content

Commit b40a91e

Browse files
authored
Merge pull request #114 from shenwei356/reformat2
v0.19.0
2 parents c2e0c7b + 3b6229d commit b40a91e

File tree

5 files changed

+116
-28
lines changed

5 files changed

+116
-28
lines changed

README.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Related projects:
5353
- **Versatile commands**
5454
- [Usage and examples](http://bioinf.shenwei.me/taxonkit/usage/)
5555
- Featured command: [tracking monthly changelog of all TaxIds](https://github.com/shenwei356/taxid-changelog)
56-
- Featured command: [reformating lineage into format of seven-level ("superkingdom/kingdom, phylum, class, order, family, genus, species"](https://bioinf.shenwei.me/taxonkit/usage/#reformat)
56+
- Featured command: [reformating lineage into format of seven-level ("superkingdom/kingdom, phylum, class, order, family, genus, species"](https://bioinf.shenwei.me/taxonkit/usage/#reformat), and [even all possible ranks](https://bioinf.shenwei.me/taxonkit/usage/#reformat2)
5757
- Featured command: [filtering taxiDs by a rank range](http://bioinf.shenwei.me/taxonkit/usage/#filter), e.g., at or below genus rank.
5858
- Featured command: [**Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV**](https://bioinf.shenwei.me/taxonkit/usage/#create-taxdump)
5959

@@ -64,7 +64,7 @@ Subcommand |F
6464
[`list`](https://bioinf.shenwei.me/taxonkit/usage/#list) |List taxonomic subtrees (TaxIds) bellow given TaxIds
6565
[`lineage`](https://bioinf.shenwei.me/taxonkit/usage/#lineage) |Query taxonomic lineage of given TaxIds
6666
[`reformat`](https://bioinf.shenwei.me/taxonkit/usage/#reformat) |Reformat lineage in canonical ranks
67-
[`reformat2`](https://bioinf.shenwei.me/taxonkit/usage/#reformat2) |Reformat lineage in chosen ranks, allowing more ranks than 'reformat'
67+
[`reformat2`](https://bioinf.shenwei.me/taxonkit/usage/#reformat2)<sup>*</sup>|Reformat lineage in chosen ranks, allowing more ranks than 'reformat'
6868
[`name2taxid`](https://bioinf.shenwei.me/taxonkit/usage/#name2taxid) |Convert taxon names to TaxIds
6969
[`filter`](https://bioinf.shenwei.me/taxonkit/usage/#filter) |Filter TaxIds by taxonomic rank range
7070
[`lca`](https://bioinf.shenwei.me/taxonkit/usage/#lca) |Compute lowest common ancestor (LCA) for TaxIds
@@ -79,23 +79,23 @@ Note: <sup>*</sup>New commands since the publication.
7979

8080
## Benchmark
8181

82-
1. Getting complete lineage for given TaxIds
82+
1. Getting complete lineage for given TaxIds (this plot is very old).
8383

8484
<img src="bench/bench.get_lineage.reformat.tsv.png" alt="" width="600" align="center" />
8585

8686
Versions: ETE=3.1.2, taxopy=0.5.0 ([faster since 0.6.0](https://github.com/shenwei356/taxonkit/issues/47)), TaxonKit=0.7.2.
8787

8888
## Dataset
8989

90-
1. Download and uncompress `taxdump.tar.gz`: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
90+
1. Download and uncompress `taxdump.tar.gz`: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
9191
2. Copy `names.dmp`, `nodes.dmp`, `delnodes.dmp` and `merged.dmp` to data directory: `$HOME/.taxonkit`,
9292
e.g., `/home/shenwei/.taxonkit` ,
9393
3. Optionally copy to some other directories, and later you can refer to using flag `--data-dir`,
9494
or environment variable `TAXONKIT_DB`.
9595

9696
All-in-one command:
9797

98-
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
98+
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
9999
tar -zxvf taxdump.tar.gz
100100

101101
mkdir -p $HOME/.taxonkit
@@ -141,9 +141,9 @@ And then:
141141

142142
1. [Install go](https://go.dev/doc/install)
143143

144-
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz
144+
wget https://go.dev/dl/go1.24.1.linux-amd64.tar.gz
145145

146-
tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/
146+
tar -zxf go1.24.1.linux-amd64.tar.gz -C $HOME/
147147

148148
# or
149149
# echo "export PATH=$PATH:$HOME/go/bin" >> ~/.bashrc

doc/docs/download.md

+17-13
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,11 @@
66

77
## Current Version
88

9-
- [TaxonKit v0.18.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.18.0)
10-
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.18.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.18.0)
9+
- [TaxonKit v0.19.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.19.0)
10+
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.19.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.19.0)
11+
- new command `taxonkit reformat2`: Reformat lineage in chosen ranks, allowing more ranks than 'reformat'
1112
- `taxonkit reformat`:
12-
- Add a placeholder for rank "realm", "{r}", which is common in Virus taxonomy like [ictv](https://github.com/shenwei356/ictv-taxdump). [#102](https://github.com/shenwei356/taxonkit/issues/102)
13-
- `taxonkit name2taxid`:
14-
- Show warning for names with multiple taxids. [#103](https://github.com/shenwei356/taxonkit/issues/103)
13+
- Fix `-T/--trim` which did not work for `-r/--miss-rank-repl`. [#106](https://github.com/shenwei356/taxonkit/issues/106)
1514

1615
### Please cite
1716

@@ -28,11 +27,11 @@
2827

2928
OS |Arch |File, 中国镜像 |Download Count
3029
:------|:---------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
31-
Linux |**64-bit**|[**taxonkit_linux_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_linux_amd64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_linux_amd64.tar.gz)
32-
Linux |**arm64** |[**taxonkit_linux_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_linux_arm64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_linux_arm64.tar.gz)
33-
macOS |**64-bit**|[**taxonkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_darwin_amd64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_darwin_amd64.tar.gz)
34-
macOS |**arm64** |[**taxonkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_darwin_arm64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_darwin_arm64.tar.gz)
35-
Windows|**64-bit**|[**taxonkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_windows_amd64.exe.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.18.0/taxonkit_windows_amd64.exe.tar.gz)
30+
Linux |**64-bit**|[**taxonkit_linux_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_linux_amd64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_linux_amd64.tar.gz)
31+
Linux |**arm64** |[**taxonkit_linux_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_linux_arm64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_linux_arm64.tar.gz)
32+
macOS |**64-bit**|[**taxonkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_darwin_amd64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_darwin_amd64.tar.gz)
33+
macOS |**arm64** |[**taxonkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_darwin_arm64.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_darwin_arm64.tar.gz)
34+
Windows|**64-bit**|[**taxonkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_windows_amd64.exe.tar.gz),<br/> [中国镜像](http://app.shenwei.me/data/taxonkit/taxonkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/taxonkit/latest/taxonkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/taxonkit/releases/download/v0.19.0/taxonkit_windows_amd64.exe.tar.gz)
3635

3736
## Installation
3837

@@ -72,9 +71,9 @@ And then:
7271

7372
1. [Install go](https://go.dev/doc/install)
7473

75-
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz
74+
wget https://go.dev/dl/go1.24.1.linux-amd64.tar.gz
7675

77-
tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/
76+
tar -zxf go1.24.1.linux-amd64.tar.gz -C $HOME/
7877

7978
# or
8079
# echo "export PATH=$PATH:$HOME/go/bin" >> ~/.bashrc
@@ -153,7 +152,12 @@ All-in-one command:
153152

154153
## Release history
155154

156-
155+
- [TaxonKit v0.18.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.18.0)
156+
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.18.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.18.0)
157+
- `taxonkit reformat`:
158+
- Add a placeholder for rank "realm", "{r}", which is common in Virus taxonomy like [ictv](https://github.com/shenwei356/ictv-taxdump). [#102](https://github.com/shenwei356/taxonkit/issues/102)
159+
- `taxonkit name2taxid`:
160+
- Show warning for names with multiple taxids. [#103](https://github.com/shenwei356/taxonkit/issues/103)
157161
- [TaxonKit v0.17.0](https://github.com/shenwei356/taxonkit/releases/tag/v0.17.0)
158162
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/taxonkit/v0.17.0/total.svg)](https://github.com/shenwei356/taxonkit/releases/tag/v0.17.0)
159163
- `taxonkit filter`:

doc/docs/tutorial.md

+84
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
- [Making nr blastdb for specific taxids](#making-nr-blastdb-for-specific-taxids)
1212
- [Summaries of taxonomy data](#summaries-of-taxonomy-data)
1313
- [Merging GTDB and NCBI taxonomy](#merging-gtdb-and-ncbi-taxonomy)
14+
- [Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDs](#filtering-or-subsetting-taxdmp-files-to-make-a-custom-taxdmp-with-given-taxids)
1415

1516
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
1617

@@ -915,6 +916,89 @@ Some tests:
915916
1187493883 genus Escherichia
916917
1945799576 species Escherichia coli
917918

919+
## Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDs
920+
921+
> You want to create a smaller version of the official NCBI taxonomy taxdmp filtered or subset to just the lineages of certain species, for purposes such as creating small test data for testing of tools using taxdmp files.
922+
>
923+
> https://github.com/shenwei356/taxonkit/issues/112
924+
925+
Step 1: preparing taxids in the subset tree
926+
927+
# here, only keep nodes at the rank of species
928+
taxonkit list --ids 707,9606 -I "" \
929+
| taxonkit filter -E species \
930+
| taxonkit lineage -t \
931+
| cut -f 3 \
932+
| sed -s 's/;/\n/g' \
933+
> taxids.txt
934+
935+
# the root node
936+
echo 1 >> taxids.txt
937+
938+
Step 2: extracting data of needed nodes
939+
940+
mkdir subset
941+
942+
grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/nodes.dmp > subset/nodes.dmp
943+
grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/names.dmp > subset/names.dmp
944+
945+
touch subset/delnodes.dmp subset/merged.dmp
946+
947+
948+
Checking it. Since there are only two leaves here, we just dump the whole tree
949+
950+
$ wc -l subset/*.dmp
951+
0 subset/delnodes.dmp
952+
0 subset/merged.dmp
953+
144 subset/names.dmp
954+
39 subset/nodes.dmp
955+
183 total
956+
957+
$ taxonkit list --ids 1 --data-dir subset/ -nr
958+
1 [no rank] root
959+
131567 [no rank] cellular organisms
960+
2 [superkingdom] Bacteria
961+
1224 [phylum] Pseudomonadota
962+
1236 [class] Gammaproteobacteria
963+
135623 [order] Vibrionales
964+
641 [family] Vibrionaceae
965+
662 [genus] Vibrio
966+
28174 [species] Vibrio ordalii
967+
2759 [superkingdom] Eukaryota
968+
33154 [clade] Opisthokonta
969+
33208 [kingdom] Metazoa
970+
6072 [clade] Eumetazoa
971+
33213 [clade] Bilateria
972+
33511 [clade] Deuterostomia
973+
7711 [phylum] Chordata
974+
89593 [subphylum] Craniata
975+
7742 [clade] Vertebrata
976+
7776 [clade] Gnathostomata
977+
117570 [clade] Teleostomi
978+
117571 [clade] Euteleostomi
979+
8287 [superclass] Sarcopterygii
980+
1338369 [clade] Dipnotetrapodomorpha
981+
32523 [clade] Tetrapoda
982+
32524 [clade] Amniota
983+
40674 [class] Mammalia
984+
32525 [clade] Theria
985+
9347 [clade] Eutheria
986+
1437010 [clade] Boreoeutheria
987+
314146 [superorder] Euarchontoglires
988+
9443 [order] Primates
989+
376913 [suborder] Haplorrhini
990+
314293 [infraorder] Simiiformes
991+
9526 [parvorder] Catarrhini
992+
314295 [superfamily] Hominoidea
993+
9604 [family] Hominidae
994+
207598 [subfamily] Homininae
995+
9605 [genus] Homo
996+
9606 [species] Homo sapiens
997+
998+
999+
$ echo 28174 | taxonkit lineage -nr --data-dir subset/
1000+
28174 cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;Vibrio ordalii Vibrio ordalii species
1001+
9181002

9191003
<div id="disqus_thread"></div>
9201004
<script>

doc/docs/usage.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -1029,7 +1029,7 @@ Input:
10291029
10301030
- List of TaxIds, one record per line.
10311031
- Or tab-delimited format.
1032-
Plese specify the TaxId field with flag -I/--taxid-field (default 1)
1032+
Please specify the TaxId field with flag -I/--taxid-field (default 1)
10331033
- Supporting (gzipped) file or STDIN.
10341034
10351035
Output:
@@ -1040,7 +1040,7 @@ Output:
10401040
10411041
Output format:
10421042
1043-
1. it can contains some escape charactors like "\t".
1043+
1. it can contain some escape characters like "\t".
10441044
2. For subspecies nodes, the rank might be "subpecies", "strain", or "no rank".
10451045
You can use "|" to set multiple ranks, and the first valid one will be outputted.
10461046
For example,
@@ -1060,7 +1060,7 @@ Differences from 'taxonkit reformat':
10601060
- do not automatically add prefixes, but you can set in the format
10611061
10621062
Usage:
1063-
taxonkit reformat2 [flags]
1063+
taxonkit reformat2 [flags]
10641064
10651065
Flags:
10661066
-f, --format string output format, placeholders of rank are needed (default
@@ -1073,7 +1073,7 @@ Flags:
10731073
-t, --show-lineage-taxids show corresponding taxids of reformated lineage
10741074
-I, --taxid-field int field index of taxid. input data should be tab-separated. it overrides
10751075
-i/--lineage-field (default 1)
1076-
-T, --trim do not replace missing ranks lower than the rank of current node
1076+
-T, --trim do not replace missing ranks lower than the rank of the current node
10771077
10781078
```
10791079

taxonkit/cmd/reformat2.go

+4-4
Original file line numberDiff line numberDiff line change
@@ -43,18 +43,18 @@ Input:
4343
4444
- List of TaxIds, one record per line.
4545
- Or tab-delimited format.
46-
Plese specify the TaxId field with flag -I/--taxid-field (default 1)
46+
Please specify the TaxId field with flag -I/--taxid-field (default 1)
4747
- Supporting (gzipped) file or STDIN.
4848
4949
Output:
5050
5151
1. Input line data.
5252
2. Reformated lineage.
5353
3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids)
54-
54+
5555
Output format:
5656
57-
1. it can contains some escape charactors like "\t".
57+
1. it can contain some escape characters like "\t".
5858
2. For subspecies nodes, the rank might be "subpecies", "strain", or "no rank".
5959
You can use "|" to set multiple ranks, and the first valid one will be outputted.
6060
For example,
@@ -310,7 +310,7 @@ func init() {
310310
reformat2Cmd.Flags().StringP("format", "f", "{superkingdom};{phylum};{class};{order};{family};{genus};{species}", "output format, placeholders of rank are needed")
311311
reformat2Cmd.Flags().StringP("miss-rank-repl", "r", "", `replacement string for missing rank`)
312312
reformat2Cmd.Flags().StringP("miss-taxid-repl", "R", "", `replacement string for missing taxid`)
313-
reformat2Cmd.Flags().BoolP("trim", "T", false, "do not replace missing ranks lower than the rank of current node")
313+
reformat2Cmd.Flags().BoolP("trim", "T", false, "do not replace missing ranks lower than the rank of the current node")
314314

315315
reformat2Cmd.Flags().IntP("taxid-field", "I", 1, "field index of taxid. input data should be tab-separated. it overrides -i/--lineage-field")
316316
reformat2Cmd.Flags().BoolP("show-lineage-taxids", "t", false, `show corresponding taxids of reformated lineage`)

0 commit comments

Comments
 (0)