Skip to content

Commit 7a7378a

Browse files
committed
updating documentation
1 parent 30858e5 commit 7a7378a

File tree

1 file changed

+98
-54
lines changed

1 file changed

+98
-54
lines changed

README.Rmd

Lines changed: 98 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,73 @@
11
---
22
title: "README"
33
output: github_document
4+
editor_options:
5+
markdown:
6+
wrap: sentence
47
---
58

69
```{r setup, include=FALSE}
710
knitr::opts_chunk$set(echo = TRUE)
811
```
912

1013
# clustuneR
11-
### Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
1214

13-
clustuneR builds clusters from edge lists or phylogenetic trees, allowing users to choose between multiple cluster-building algorithms implemented in the package.
14-
These algorithms can be further augmented through the selection of parameters, such as a required similarity for cluster formation, or a required level of certainty.
15-
The package also takes in meta-data associated with sequences such as a known collection date or subtype/variant classification.
16-
These can also allow users to identify cluster-level characteristics, such as the range of collection dates or the most common subtype/variant within a cluster.
17-
18-
If a subset of sequences are specified as "New", then clustuneR simulates cluster growth by building clusters in two stages:
19-
first clusters are built from sequences which are not specified as new, then the new sequences are added to clusters.
20-
Depending on the clustering method used, this second step may include compromises to insure that new sequences do not retroactively change the membership of clusters.
21-
For example, if a single new sequence forms a cluster with two, previously separate clusters, then those two clusters would have ambiguous growth.
22-
Pairing cluster-level meta-data, with the growth of clusters is a common goal in research and clustuneR contains some functions to help test predictive models based on cluster data.
23-
Furthermore, clustuneR facilitates the assignment of multiple cluster sets from the same data using different methods and parameters.
24-
Pairing these with the effectiveness of growth models can be useful in method/parameter selection.
15+
### Optimizating genetic clustering methods on the performance of predictive growth models.
2516

17+
A genetic cluster is a grouping of sequences that are markedly more similar to each other than to other sequences in the data set.
18+
Genetic clustering has many applications in biology, such as defining taxonomic groups.
19+
In the molecular epidemiology of infectious diseases, it can be used to characterize the transmission of a pathogen through the population.
20+
For instance, a cluster of sequences can represent a recent transmission outbreak, especially for rapidly-evolving pathogens.
2621

22+
Most clustering methods require the user to select one or more criteria defining groups.
23+
**clustuneR** provides a statistical framework to select optimal clustering criteria, based on the premise that the most effective clustering should maximize our ability to predict the distribution of new cases among clusters.
2724

2825
### Installation
2926

30-
> Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux systems.
27+
> ⚠️ Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux and macOS systems.
28+
29+
#### Download the package
3130

3231
If you have the [`git`](https://git-scm.com/) version control system installed on your computer, you can clone the repository by navigating to a location of your filesystem where the package will be copied, and then running
33-
```
32+
33+
```
3434
git clone https://github.com/PoonLab/clustuneR.git
3535
```
3636

37-
If you do not have `git` installed, then you can download the most recent (developmental version) package as a ZIP archive at this link:
38-
https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip
37+
If you do not have `git` installed, then you can download the most recent (developmental version) package as a ZIP archive at this link: <https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip>
3938

40-
or from the Releases page:
41-
https://github.com/PoonLab/clustuneR/releases
39+
or from the Releases page: <https://github.com/PoonLab/clustuneR/releases>
4240

4341
If you have downloaded a `.zip` or `.tar.gz` archive, you can use `unzip` or `tar -zvxf` on the command line, or double-click on the archive file in your desktop environment.
4442

43+
#### Running macOS binaries
44+
45+
macOS will prevent you from running the *pplacer* and *guppy* binaries that are distributed with this package.
46+
To bypass this safety mechanism, you will need to follow these steps:
47+
48+
1. Use Finder or Terminal to navigate to the `inst` folder in the package directory.
49+
2. Attempt to execute the `pplacer.Darwin` binary. If using Finder, double-click on the `pplacer` file, which spawns a Terminal window running the binary. If using Terminal, enter the command `./pplacer.Darwin`. Your system should display a pop-up with the message `"pplacer" cannot be opened because the developed cannot be verified`. Click on the *Cancel* button to dismiss the pop-up.
50+
3. Open the System Settings app and click on the Privacy & Security tab. In one of the panels, you should see the following label: `"pplacer.Darwin" was blocked from use because it is not from an identified developer`. Click on the *Allow Anyway* button.
51+
4. Repeat step 2. The pop-up message should now be changed to `macOS cannot verify the developer of "pplacer.Darwin". Are you sure you want to open it?` Click on the *Open* button. Your Terminal window should now be updated with the following text:\
52+
`Warning: pplacer couldn't find any sequences to place. Please supply an alignment with sequences to place as an argument at the end of the command line.`\
53+
This means the program is running properly.
54+
5. Repeat steps 2-4 by substituting `guppy.Darwin` for `pplacer.Darwin`.
55+
56+
You can find similar instructions on the Apple website at <https://support.apple.com/en-ca/HT202491>
57+
58+
If you do not want to trust the binaries in this package distribution, you can download the macOS binaries directly from the Matsen lab [GitHub release page](https://github.com/matsen/pplacer/releases/tag/v1.1.alpha17) or compile them from source yourself.
59+
60+
#### Package installation
61+
4562
Use `cd clustuneR` to enter the package directory and run the following command to install the package into R:
46-
```
63+
64+
```
4765
R CMD INSTALL .
4866
```
67+
4968
You should see something like this on your console:
50-
```
69+
70+
```
5171
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
5272
* installing *source* package ‘clustuneR’ ...
5373
** using staged installation
@@ -65,15 +85,32 @@ You should see something like this on your console:
6585
* DONE (clustuneR)
6686
```
6787

88+
The process will pause at `moving datasets to lazyloadDB` because there are several large binary files (`pplacer` and `guppy`) that are included with this package distribution.
6889

6990
## Usage
7091

92+
**clustuneR** can optimize clustering methods based on either pairwise genetic distances (graph-based clustering) or a phylogenetic tree (subtree-based clustering).
93+
At minimum, you will need to start with an alignment of genetic sequences and the respective sample collection dates as metadata.
94+
95+
The general workflow is:
96+
97+
1. Partition the sequences into subsets of known and new cases, based on collection dates.
98+
2. Generate sets of clusters from the known cases under varying clustering criteria.
99+
3. Obtain the distribution of new cases among clusters under the different criteria as a measure of cluster growth.
100+
4. Fit a null model of cluster growth as a count outcome predicted by cluster size.
101+
5. Fit an alternate model of cluster growth incorporating additional metadata, e.g., sampling dates.
102+
6. Generate a ∆AIC profile comparing the fits of these models under varying clustering criteria.
71103

72104
### Graph-based clustering
73105

74-
There are numerous data structures which we can use as a basis for clustering.
75-
clustuneR also contains functions to support graph-based clustering.
76-
To utilize these, we start by reading in an aligment, extracting metadata and determining the year that would define the partition of new sequences.
106+
A graph consists of a set of nodes and edges.
107+
Each node represents an infection.
108+
An edge between nodes indicates that the genetic similarity of the respective infections falls below some threshold.
109+
Conventionally, we interpret each connected component of the graph as a cluster.
110+
A connected component is a group of nodes such that (1) every node can be reached from another node through a path of edges, and (1) there are no edges to nodes outside of the group.
111+
Varying the threshold for edges yields different sets of clusters.
112+
113+
In the following example, we start by reading in a sequence alignment, extracting metadata from the sequence labels, and identifying a subset of new sequences:
77114

78115
```{r}
79116
require(clustuneR)
@@ -84,32 +121,28 @@ seq.info <- parse.headers(names(seqs), sep="_",
84121
var.names=c('accession', 'coldate', 'subtype'),
85122
var.transformations=c(as.character, as.Date, as.factor)
86123
)
124+
seq.info$colyear <- year(seq.info$coldate)
87125
88126
# determine newest year for growth
89-
max.year <- max(year(seq.info$coldate))
90-
which.new <- which(year(seq.info$coldate) == max.year)
127+
which.new <- which(seq.info$colyear == max(seq.info$colyear))
91128
```
92129

93-
Once we have a set of sequences read into R and we've determined which of them
94-
are "new", we can calculate the pairwise distances between them and use this as
95-
the basis for a graph.
96-
In this graph, vertices represent sequences and edges are
97-
weighted based on genetic distance beetween them.
98-
Clusters can be defined as the
99-
connected components remaining after edges over a genetic distance threshold are
100-
removed.
101-
Much like the method described above, we can obtain a range of cluster
102-
sets based corresponding to a range of genetic distance thresholds, at which point
103-
the same poisson regression analysis can be preformed.
104-
105-
``` r
106-
# load genetic distances (run `tn93 -t 0.04 -o na.tn93.csv na.fasta`)
130+
You may already have these metadata in the form of a tabular data set (*i.e.*, a CSV file), in which case you can simply load these metadata as a data frame.
131+
132+
Next, we need to load a list of edges, where each row specifies two node labels and a distance.
133+
These data can be generated from a sequence alignment using the program [TN93](https://github.com/veg/tn93).
134+
135+
```{r}
136+
# load genetic distances (run `tn93 -t 1 -o na.tn93.csv na.fasta`)
107137
edge.info <- read.csv("data/na.tn93.csv")
108138
109-
obj <- create.graph(seq.info, edge.info, which.new)
139+
obj <- read.edges(edge.info, seq.info, which.new)
110140
111141
# generate cluster sets under varying parameter settings
112-
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(dist.thresh=x)})
142+
cutoffs <- seq(0, 0.04, length.out=50)
143+
param.list <- lapply(1:50, function(i) {
144+
list(dist.thresh=cutoffs[i], setID=i, time.var="colyear")
145+
})
113146
cluster.sets <- multi.cluster(obj, param.list, component.cluster)
114147
115148
res <- fit.analysis(cluster.sets, predictive.models=p.models,
@@ -123,12 +156,12 @@ plot(cutoffs, delta.AIC, type='l', col='cadetblue', lwd=2)
123156
abline(h=0, lty=2)
124157
```
125158

126-
127159
### Building a tree
128160

129161
We start with a multiple sequence alignment of sequences that are labelled with sample collection dates.
130162
An example of anonymized public domain HIV-1 sequences from a study based in northern Alberta (Canada) is provided in `data/na.fasta`.
131163
First, we use an R script to exclude the sequences collected in the most recent year:
164+
132165
```{r message=FALSE}
133166
require(clustuneR)
134167
require(ape)
@@ -147,17 +180,19 @@ write.FASTA(old.seqs, file="data/na-old.fasta")
147180
```
148181

149182
Next, we use IQ-TREE to reconstruct a maximum likelihood tree relating the "old" sequences:
150-
```console
183+
184+
``` console
151185
iqtree -bb 1000 -m GTR -nstop 200 -s na-old.fasta
152186
```
187+
153188
Note we've specified the generalized time reversible model of nucleotide substitution to bypass the model selection stage.
154189
Even so, this is a time-consuming step - to speed things up, we've provided IQ-TREE output files at `data/na.nwk` and `data/na.log`.
155190

156-
157191
### Grafting new sequences
158192

159193
Next, we import both the sequence alignment and the ML tree into R.
160194
We will use `clustuneR` to graft the sequences from the most recent year using the program `pplacer` and the output files from IQ-TREE.
195+
161196
```{r warning=FALSE}
162197
phy <- ape::read.tree("data/na.nwk")
163198
@@ -168,6 +203,7 @@ phy.extend <- extend.tree(phy, seq.info, seqs, mc.cores=4, log.file="data/na.log
168203
### Finding the optimal threshold
169204

170205
Next, we want to configure `clustuneR` to fit two Poisson regression models to the distribution of new cases among clusters, for a range of genetic distance thresholds:
206+
171207
```{r}
172208
# generate cluster sets under varying parameter settings
173209
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) list(branch.thresh=x, boot.thresh=0.95))
@@ -193,6 +229,7 @@ delta.AIC <- AICs$TimeModelAIC - AICs$NullModelAIC
193229
```
194230

195231
We can visualize the difference in AICs between models as a function of the distance threshold:
232+
196233
```{r}
197234
cutoffs <- sapply(param.list, function(x) x$branch.thresh)
198235
par(mar=c(5,5,1,1))
@@ -209,22 +246,29 @@ If every known infection is merged into a single giant cluster, then there is no
209246
If every infection each becomes a cluster of one, then there will be excessive information loss due to random variation in sampling dates.
210247
At the threshold that minimizes `delta.AIC`, the known infections are partitioned into clusters in such a way that minimizes the information loss associated with incorporating sample dates into the predictive model.
211248

212-
213249
## References
214250

215251
If you use `clustuneR` for your work, please cite one of the following references:
216252

217-
* Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection. Virus evolution. 2020 Jan;6(1):veaa011.
218-
219-
* Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications. PLOS Computational Biology. 2022 Nov 30;18(11):e1010745.
253+
- Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection.
254+
Virus evolution.
255+
2020 Jan;6(1):veaa011.
220256

221-
This package includes the binaries for pplacer and guppy (https://matsen.fhcrc.org/pplacer, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
257+
- Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications.
258+
PLOS Computational Biology.
259+
2022 Nov 30;18(11):e1010745.
222260

223-
* Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
261+
This package includes the binaries for pplacer and guppy (<https://matsen.fhcrc.org/pplacer>, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
224262

225-
As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set. These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
263+
- Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
226264

227-
* Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic acids research. 2000 Jan 1;28(1):15-8.
265+
As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set.
266+
These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
228267

229-
* Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.
268+
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank.
269+
Nucleic acids research.
270+
2000 Jan 1;28(1):15-8.
230271

272+
- Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data.
273+
Infection, Genetics and Evolution.
274+
2017 Aug 1;52:100-5.

0 commit comments

Comments
 (0)