You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.Rmd
+71-83
Original file line number
Diff line number
Diff line change
@@ -110,9 +110,9 @@ Conventionally, we interpret each connected component of the graph as a cluster.
110
110
A connected component is a group of nodes such that (1) every node can be reached from another node through a path of edges, and (1) there are no edges to nodes outside of the group.
111
111
Varying the threshold for edges yields different sets of clusters.
112
112
113
-
In the following example, we start by reading in a sequence alignment, extracting metadata from the sequence labels, and identifying a subset of new sequences:
113
+
In the following example, we start by reading in a sequence alignment (a published set of anonymized HIV-1 sequences from Canada), extracting metadata from the sequence labels, and identifying a subset of new sequences:
You may already have these metadata in the form of a tabular data set (*i.e.*, a CSV file), in which case you can simply load these metadata as a data frame.
130
+
> You may already have these metadata in the form of a tabular data set (*i.e.*, a CSV file), in which case you can simply load these metadata as a data frame.
131
131
132
132
Next, we need to load a list of edges, where each row specifies two node labels and a distance.
133
133
These data can be generated from a sequence alignment using the program [TN93](https://github.com/veg/tn93).
134
+
The resulting output file is enormous (\>34MB), so we do not include it in this package!
By specifying a `time.var` argument in `param.list`, we are fitting a model to the distribution of sample collection years to predict edges between cases.
150
+
For a more detailed explanation of this method, please refer to the vignettes.
res <- fit.analysis(cluster.sets, models=pmods, transforms=ptrans)
161
+
gaic <- get.AIC(res, param.list)
157
162
```
158
163
159
-
### Building a tree
164
+
Here, `gaic` is a data frame that stores the key result of our analysis - the AIC values associated with the two models under varying clustering thresholds.
165
+
The optimal TN93 distance cutoff is identified by the greatest difference between the AICs of the alternative and null models, which we can visualize as a plot:
160
166
161
-
We start with a multiple sequence alignment of sequences that are labelled with sample collection dates.
162
-
An example of anonymized public domain HIV-1 sequences from a study based in northern Alberta (Canada) is provided in `data/na.fasta`.
163
-
First, we use an R script to exclude the sequences collected in the most recent year:
Next, we use IQ-TREE to reconstruct a maximum likelihood tree relating the "old" sequences:
188
+
Next, we use a maximum likelihood program such as [IQ-TREE](http://www.iqtree.org/) to reconstruct a tree relating these "old" sequences:
183
189
184
190
```console
185
191
iqtree -bb 1000 -m GTR -nstop 200 -s na-old.fasta
186
192
```
187
193
188
-
Note we've specified the generalized time reversible model of nucleotide substitution to bypass the model selection stage.
189
-
Even so, this is a time-consuming step - to speed things up, we've provided IQ-TREE output files at `data/na.nwk` and `data/na.log`.
194
+
Note we've requested a specific model of nucleotide substitution (GTR) to bypass the model selection stage of this program.
195
+
Even so, this is a time-consuming step - to speed things up, we've provided these IQ-TREE output files at `data/na.nwk` and `data/na.log`.
190
196
191
-
### Grafting new sequences
197
+
> **clustuneR** uses a program (`pplacer`) that can work with the outputs of IQ-TREE, [FastTree](http://www.microbesonline.org/fasttree/) and [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/).
198
+
> You'll have to specify which ML tree reconstruction program you used in the next step.
192
199
193
-
Next, we import both the sequence alignment and the ML tree into R.
194
-
We will use `clustuneR`to graft the sequences from the most recent year using the program `pplacer`and the output files from IQ-TREE.
200
+
Assuming you've kept R running, our next step is to import the ML tree into R.
201
+
(If you quit R, you'll have to repeat the previous steps to import the alignment and parse headers.) We can then use `pplacer`to use maximum likelihood to graft the "new" sequences onto this tree.
195
202
196
-
```{r warning=FALSE}
203
+
```{r warning=FALSE, eval=FALSE}
197
204
phy <- ape::read.tree("data/na.nwk")
198
-
199
-
# use pplacer to graft new sequences onto old tree
Next, we want to configure `clustuneR` to fit two Poisson regression models to the distribution of new cases among clusters, for a range of genetic distance thresholds:
213
+
We can reuse the `cutoffs` vector from the previous example to configure a new parameter list for generating different sets of clusters.
214
+
In this case, we have two criteria: (1) a threshold for the total branch length from each tip to the root of a subtree, and (2) the bootstrap support for the subtree:
206
215
207
216
```{r}
208
-
# generate cluster sets under varying parameter settings
The optimal distance threshold is associated with the lowest value of `delta.AIC`.
243
-
We expect that adding information on sample collection dates should improve our ability to predict where the next infections will occur.
244
-
However, this improvement will depend on how we have partitioned the database of known infections into clusters.
245
-
If every known infection is merged into a single giant cluster, then there is no meaningful way to predict where new cases will occur, since there is no variation for a sample of one cluster.
246
-
If every infection each becomes a cluster of one, then there will be excessive information loss due to random variation in sampling dates.
247
-
At the threshold that minimizes `delta.AIC`, the known infections are partitioned into clusters in such a way that minimizes the information loss associated with incorporating sample dates into the predictive model.
248
-
249
244
## References
250
245
251
-
If you use `clustuneR` for your work, please cite one of the following references:
246
+
If you use **clustuneR** for your work, please cite one of the following references:
252
247
253
248
- Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection.
254
249
Virus evolution.
@@ -262,13 +257,6 @@ This package includes the binaries for pplacer and guppy (<https://matsen.fhcrc.
262
257
263
258
- Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
264
259
265
-
As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set.
266
-
These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
267
-
268
-
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank.
269
-
Nucleic acids research.
270
-
2000 Jan 1;28(1):15-8.
260
+
This package includes some anonymized HIV-1 sequences that were placed in the public domain in association with the following publication:
271
261
272
-
- Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data.
273
-
Infection, Genetics and Evolution.
274
-
2017 Aug 1;52:100-5.
262
+
- Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.
0 commit comments