You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.Rmd
+98-54Lines changed: 98 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -1,53 +1,73 @@
1
1
---
2
2
title: "README"
3
3
output: github_document
4
+
editor_options:
5
+
markdown:
6
+
wrap: sentence
4
7
---
5
8
6
9
```{r setup, include=FALSE}
7
10
knitr::opts_chunk$set(echo = TRUE)
8
11
```
9
12
10
13
# clustuneR
11
-
### Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
12
14
13
-
clustuneR builds clusters from edge lists or phylogenetic trees, allowing users to choose between multiple cluster-building algorithms implemented in the package.
14
-
These algorithms can be further augmented through the selection of parameters, such as a required similarity for cluster formation, or a required level of certainty.
15
-
The package also takes in meta-data associated with sequences such as a known collection date or subtype/variant classification.
16
-
These can also allow users to identify cluster-level characteristics, such as the range of collection dates or the most common subtype/variant within a cluster.
17
-
18
-
If a subset of sequences are specified as "New", then clustuneR simulates cluster growth by building clusters in two stages:
19
-
first clusters are built from sequences which are not specified as new, then the new sequences are added to clusters.
20
-
Depending on the clustering method used, this second step may include compromises to insure that new sequences do not retroactively change the membership of clusters.
21
-
For example, if a single new sequence forms a cluster with two, previously separate clusters, then those two clusters would have ambiguous growth.
22
-
Pairing cluster-level meta-data, with the growth of clusters is a common goal in research and clustuneR contains some functions to help test predictive models based on cluster data.
23
-
Furthermore, clustuneR facilitates the assignment of multiple cluster sets from the same data using different methods and parameters.
24
-
Pairing these with the effectiveness of growth models can be useful in method/parameter selection.
15
+
### Optimizating genetic clustering methods on the performance of predictive growth models.
25
16
17
+
A genetic cluster is a grouping of sequences that are markedly more similar to each other than to other sequences in the data set.
18
+
Genetic clustering has many applications in biology, such as defining taxonomic groups.
19
+
In the molecular epidemiology of infectious diseases, it can be used to characterize the transmission of a pathogen through the population.
20
+
For instance, a cluster of sequences can represent a recent transmission outbreak, especially for rapidly-evolving pathogens.
26
21
22
+
Most clustering methods require the user to select one or more criteria defining groups.
23
+
**clustuneR** provides a statistical framework to select optimal clustering criteria, based on the premise that the most effective clustering should maximize our ability to predict the distribution of new cases among clusters.
27
24
28
25
### Installation
29
26
30
-
> Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux systems.
27
+
> ⚠️ Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux and macOS systems.
28
+
29
+
#### Download the package
31
30
32
31
If you have the [`git`](https://git-scm.com/) version control system installed on your computer, you can clone the repository by navigating to a location of your filesystem where the package will be copied, and then running
If you do not have `git` installed, then you can download the most recent (developmental version) package as a ZIP archive at this link: <https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip>
39
38
40
-
or from the Releases page:
41
-
https://github.com/PoonLab/clustuneR/releases
39
+
or from the Releases page: <https://github.com/PoonLab/clustuneR/releases>
42
40
43
41
If you have downloaded a `.zip` or `.tar.gz` archive, you can use `unzip` or `tar -zvxf` on the command line, or double-click on the archive file in your desktop environment.
44
42
43
+
#### Running macOS binaries
44
+
45
+
macOS will prevent you from running the *pplacer* and *guppy* binaries that are distributed with this package.
46
+
To bypass this safety mechanism, you will need to follow these steps:
47
+
48
+
1. Use Finder or Terminal to navigate to the `inst` folder in the package directory.
49
+
2. Attempt to execute the `pplacer.Darwin` binary. If using Finder, double-click on the `pplacer` file, which spawns a Terminal window running the binary. If using Terminal, enter the command `./pplacer.Darwin`. Your system should display a pop-up with the message `"pplacer" cannot be opened because the developed cannot be verified`. Click on the *Cancel* button to dismiss the pop-up.
50
+
3. Open the System Settings app and click on the Privacy & Security tab. In one of the panels, you should see the following label: `"pplacer.Darwin" was blocked from use because it is not from an identified developer`. Click on the *Allow Anyway* button.
51
+
4. Repeat step 2. The pop-up message should now be changed to `macOS cannot verify the developer of "pplacer.Darwin". Are you sure you want to open it?` Click on the *Open* button. Your Terminal window should now be updated with the following text:\
52
+
`Warning: pplacer couldn't find any sequences to place. Please supply an alignment with sequences to place as an argument at the end of the command line.`\
53
+
This means the program is running properly.
54
+
5. Repeat steps 2-4 by substituting `guppy.Darwin` for `pplacer.Darwin`.
55
+
56
+
You can find similar instructions on the Apple website at <https://support.apple.com/en-ca/HT202491>
57
+
58
+
If you do not want to trust the binaries in this package distribution, you can download the macOS binaries directly from the Matsen lab [GitHub release page](https://github.com/matsen/pplacer/releases/tag/v1.1.alpha17) or compile them from source yourself.
59
+
60
+
#### Package installation
61
+
45
62
Use `cd clustuneR` to enter the package directory and run the following command to install the package into R:
46
-
```
63
+
64
+
```
47
65
R CMD INSTALL .
48
66
```
67
+
49
68
You should see something like this on your console:
50
-
```
69
+
70
+
```
51
71
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
52
72
* installing *source* package ‘clustuneR’ ...
53
73
** using staged installation
@@ -65,15 +85,32 @@ You should see something like this on your console:
65
85
* DONE (clustuneR)
66
86
```
67
87
88
+
The process will pause at `moving datasets to lazyloadDB` because there are several large binary files (`pplacer` and `guppy`) that are included with this package distribution.
68
89
69
90
## Usage
70
91
92
+
**clustuneR** can optimize clustering methods based on either pairwise genetic distances (graph-based clustering) or a phylogenetic tree (subtree-based clustering).
93
+
At minimum, you will need to start with an alignment of genetic sequences and the respective sample collection dates as metadata.
94
+
95
+
The general workflow is:
96
+
97
+
1. Partition the sequences into subsets of known and new cases, based on collection dates.
98
+
2. Generate sets of clusters from the known cases under varying clustering criteria.
99
+
3. Obtain the distribution of new cases among clusters under the different criteria as a measure of cluster growth.
100
+
4. Fit a null model of cluster growth as a count outcome predicted by cluster size.
101
+
5. Fit an alternate model of cluster growth incorporating additional metadata, e.g., sampling dates.
102
+
6. Generate a ∆AIC profile comparing the fits of these models under varying clustering criteria.
71
103
72
104
### Graph-based clustering
73
105
74
-
There are numerous data structures which we can use as a basis for clustering.
75
-
clustuneR also contains functions to support graph-based clustering.
76
-
To utilize these, we start by reading in an aligment, extracting metadata and determining the year that would define the partition of new sequences.
106
+
A graph consists of a set of nodes and edges.
107
+
Each node represents an infection.
108
+
An edge between nodes indicates that the genetic similarity of the respective infections falls below some threshold.
109
+
Conventionally, we interpret each connected component of the graph as a cluster.
110
+
A connected component is a group of nodes such that (1) every node can be reached from another node through a path of edges, and (1) there are no edges to nodes outside of the group.
111
+
Varying the threshold for edges yields different sets of clusters.
112
+
113
+
In the following example, we start by reading in a sequence alignment, extracting metadata from the sequence labels, and identifying a subset of new sequences:
You may already have these metadata in the form of a tabular data set (*i.e.*, a CSV file), in which case you can simply load these metadata as a data frame.
131
+
132
+
Next, we need to load a list of edges, where each row specifies two node labels and a distance.
133
+
These data can be generated from a sequence alignment using the program [TN93](https://github.com/veg/tn93).
Next, we want to configure `clustuneR` to fit two Poisson regression models to the distribution of new cases among clusters, for a range of genetic distance thresholds:
206
+
171
207
```{r}
172
208
# generate cluster sets under varying parameter settings
@@ -209,22 +246,29 @@ If every known infection is merged into a single giant cluster, then there is no
209
246
If every infection each becomes a cluster of one, then there will be excessive information loss due to random variation in sampling dates.
210
247
At the threshold that minimizes `delta.AIC`, the known infections are partitioned into clusters in such a way that minimizes the information loss associated with incorporating sample dates into the predictive model.
211
248
212
-
213
249
## References
214
250
215
251
If you use `clustuneR` for your work, please cite one of the following references:
216
252
217
-
*Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection. Virus evolution. 2020 Jan;6(1):veaa011.
218
-
219
-
* Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications. PLOS Computational Biology. 2022 Nov 30;18(11):e1010745.
253
+
-Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection.
254
+
Virus evolution.
255
+
2020 Jan;6(1):veaa011.
220
256
221
-
This package includes the binaries for pplacer and guppy (https://matsen.fhcrc.org/pplacer, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
257
+
- Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications.
258
+
PLOS Computational Biology.
259
+
2022 Nov 30;18(11):e1010745.
222
260
223
-
* Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
261
+
This package includes the binaries for pplacer and guppy (<https://matsen.fhcrc.org/pplacer>, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
224
262
225
-
As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set. These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
263
+
- Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
226
264
227
-
* Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic acids research. 2000 Jan 1;28(1):15-8.
265
+
As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set.
266
+
These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
228
267
229
-
* Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.
268
+
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank.
269
+
Nucleic acids research.
270
+
2000 Jan 1;28(1):15-8.
230
271
272
+
- Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data.
0 commit comments