updating documentation

ArtPoon · ArtPoon · commit 7a7378a49cee · 2023-12-12T14:45:10.000-05:00
diff --git a/README.Rmd b/README.Rmd
@@ -1,53 +1,73 @@
 ---
 title: "README"
 output: github_document
+editor_options: 
+  markdown: 
+    wrap: sentence
 ---
 
 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 ```
 
 # clustuneR
-### Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
 
-clustuneR builds clusters from edge lists or phylogenetic trees, allowing users to choose between multiple cluster-building algorithms implemented in the package.
-These algorithms can be further augmented through the selection of parameters, such as a required similarity for cluster formation, or a required level of certainty.
-The package also takes in meta-data associated with sequences such as a known collection date or subtype/variant classification.
-These can also allow users to identify cluster-level characteristics, such as the range of collection dates or the most common subtype/variant within a cluster.
-
-If a subset of sequences are specified as "New", then clustuneR simulates cluster growth by building clusters in two stages:
-first clusters are built from sequences which are not specified as new, then the new sequences are added to clusters.
-Depending on the clustering method used, this second step may include compromises to insure that new sequences do not retroactively change the membership of clusters.
-For example, if a single new sequence forms a cluster with two, previously separate clusters, then those two clusters would have ambiguous growth.
-Pairing cluster-level meta-data, with the growth of clusters is a common goal in research and clustuneR contains some functions to help test predictive models based on cluster data.
-Furthermore, clustuneR facilitates the assignment of multiple cluster sets from the same data using different methods and parameters.
-Pairing these with the effectiveness of growth models can be useful in method/parameter selection.
+### Optimizating genetic clustering methods on the performance of predictive growth models.
 
+A genetic cluster is a grouping of sequences that are markedly more similar to each other than to other sequences in the data set.
+Genetic clustering has many applications in biology, such as defining taxonomic groups.
+In the molecular epidemiology of infectious diseases, it can be used to characterize the transmission of a pathogen through the population.
+For instance, a cluster of sequences can represent a recent transmission outbreak, especially for rapidly-evolving pathogens.
 
+Most clustering methods require the user to select one or more criteria defining groups.
+**clustuneR** provides a statistical framework to select optimal clustering criteria, based on the premise that the most effective clustering should maximize our ability to predict the distribution of new cases among clusters.
 
 ### Installation
 
-> Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux systems.
+> ⚠️ Because clustuneR uses [pplacer](https://github.com/matsen/pplacer/) to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux and macOS systems.
+
+#### Download the package
 
 If you have the [`git`](https://git-scm.com/) version control system installed on your computer, you can clone the repository by navigating to a location of your filesystem where the package will be copied, and then running
-```
+
+```         
 git clone https://github.com/PoonLab/clustuneR.git 
 ```
 
-If you do not have `git` installed, then you can download the most recent (developmental version) package as a ZIP archive at this link:
-https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip
+If you do not have `git` installed, then you can download the most recent (developmental version) package as a ZIP archive at this link: <https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip>
 
-or from the Releases page:
-https://github.com/PoonLab/clustuneR/releases
+or from the Releases page: <https://github.com/PoonLab/clustuneR/releases>
 
 If you have downloaded a `.zip` or `.tar.gz` archive, you can use `unzip` or `tar -zvxf` on the command line, or double-click on the archive file in your desktop environment.
 
+#### Running macOS binaries
+
+macOS will prevent you from running the *pplacer* and *guppy* binaries that are distributed with this package.
+To bypass this safety mechanism, you will need to follow these steps:
+
+1.  Use Finder or Terminal to navigate to the `inst` folder in the package directory.
+2.  Attempt to execute the `pplacer.Darwin` binary. If using Finder, double-click on the `pplacer` file, which spawns a Terminal window running the binary. If using Terminal, enter the command `./pplacer.Darwin`. Your system should display a pop-up with the message `"pplacer" cannot be opened because the developed cannot be verified`. Click on the *Cancel* button to dismiss the pop-up.
+3.  Open the System Settings app and click on the Privacy & Security tab. In one of the panels, you should see the following label: `"pplacer.Darwin" was blocked from use because it is not from an identified developer`. Click on the *Allow Anyway* button.
+4.  Repeat step 2. The pop-up message should now be changed to `macOS cannot verify the developer of "pplacer.Darwin". Are you sure you want to open it?` Click on the *Open* button. Your Terminal window should now be updated with the following text:\
+    `Warning: pplacer couldn't find any sequences to place. Please supply an alignment with sequences to place as an argument at the end of the command line.`\
+    This means the program is running properly.
+5.  Repeat steps 2-4 by substituting `guppy.Darwin` for `pplacer.Darwin`.
+
+You can find similar instructions on the Apple website at <https://support.apple.com/en-ca/HT202491>
+
+If you do not want to trust the binaries in this package distribution, you can download the macOS binaries directly from the Matsen lab [GitHub release page](https://github.com/matsen/pplacer/releases/tag/v1.1.alpha17) or compile them from source yourself.
+
+#### Package installation
+
 Use `cd clustuneR` to enter the package directory and run the following command to install the package into R:
-```
+
+```         
 R CMD INSTALL .
 ```
+
 You should see something like this on your console:
-```
+
+```         
 * installing to library ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
 * installing *source* package ‘clustuneR’ ...
 ** using staged installation
@@ -65,15 +85,32 @@ You should see something like this on your console:
 * DONE (clustuneR)
 ```
 
+The process will pause at `moving datasets to lazyloadDB` because there are several large binary files (`pplacer` and `guppy`) that are included with this package distribution.
 
 ## Usage
 
+**clustuneR** can optimize clustering methods based on either pairwise genetic distances (graph-based clustering) or a phylogenetic tree (subtree-based clustering).
+At minimum, you will need to start with an alignment of genetic sequences and the respective sample collection dates as metadata.
+
+The general workflow is:
+
+1.  Partition the sequences into subsets of known and new cases, based on collection dates.
+2.  Generate sets of clusters from the known cases under varying clustering criteria.
+3.  Obtain the distribution of new cases among clusters under the different criteria as a measure of cluster growth.
+4.  Fit a null model of cluster growth as a count outcome predicted by cluster size.
+5.  Fit an alternate model of cluster growth incorporating additional metadata, e.g., sampling dates.
+6.  Generate a ∆AIC profile comparing the fits of these models under varying clustering criteria.
 
 ### Graph-based clustering
 
-There are numerous data structures which we can use as a basis for clustering.
-clustuneR also contains functions to support graph-based clustering. 
-To utilize these, we start by reading in an aligment, extracting metadata and determining the year that would define the partition of new sequences.
+A graph consists of a set of nodes and edges.
+Each node represents an infection.
+An edge between nodes indicates that the genetic similarity of the respective infections falls below some threshold.
+Conventionally, we interpret each connected component of the graph as a cluster.
+A connected component is a group of nodes such that (1) every node can be reached from another node through a path of edges, and (1) there are no edges to nodes outside of the group.
+Varying the threshold for edges yields different sets of clusters.
+
+In the following example, we start by reading in a sequence alignment, extracting metadata from the sequence labels, and identifying a subset of new sequences:
 
 ```{r}
 require(clustuneR)
@@ -84,32 +121,28 @@ seq.info <- parse.headers(names(seqs), sep="_",
   var.names=c('accession', 'coldate', 'subtype'),
   var.transformations=c(as.character, as.Date, as.factor)
 )
+seq.info$colyear <- year(seq.info$coldate)
 
 # determine newest year for growth
-max.year <- max(year(seq.info$coldate))
-which.new <- which(year(seq.info$coldate) == max.year)
+which.new <- which(seq.info$colyear == max(seq.info$colyear))
 ```
 
-Once we have a set of sequences read into R and we've determined which of them 
-are "new", we can calculate the pairwise distances between them and use this as 
-the basis for a graph. 
-In this graph, vertices represent sequences and edges are 
-weighted based on genetic distance beetween them. 
-Clusters can be defined as the 
-connected components remaining after edges over a genetic distance threshold are 
-removed. 
-Much like the method described above, we can obtain a range of cluster
-sets based corresponding to a range of genetic distance thresholds, at which point 
-the same poisson regression analysis can be preformed. 
-
-``` r
-# load genetic distances (run `tn93 -t 0.04 -o na.tn93.csv na.fasta`)
+You may already have these metadata in the form of a tabular data set (*i.e.*, a CSV file), in which case you can simply load these metadata as a data frame.
+
+Next, we need to load a list of edges, where each row specifies two node labels and a distance.
+These data can be generated from a sequence alignment using the program [TN93](https://github.com/veg/tn93).
+
+```{r}
+# load genetic distances (run `tn93 -t 1 -o na.tn93.csv na.fasta`)
 edge.info <- read.csv("data/na.tn93.csv")
 
-obj <- create.graph(seq.info, edge.info, which.new)
+obj <- read.edges(edge.info, seq.info, which.new)
 
 # generate cluster sets under varying parameter settings
-param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(dist.thresh=x)})
+cutoffs <- seq(0, 0.04, length.out=50)
+param.list <- lapply(1:50, function(i) {
+  list(dist.thresh=cutoffs[i], setID=i, time.var="colyear") 
+  })
 cluster.sets <- multi.cluster(obj, param.list, component.cluster) 
 
 res <- fit.analysis(cluster.sets, predictive.models=p.models, 
@@ -123,12 +156,12 @@ plot(cutoffs, delta.AIC, type='l', col='cadetblue', lwd=2)
 abline(h=0, lty=2)
 ```
 
-
 ### Building a tree
 
 We start with a multiple sequence alignment of sequences that are labelled with sample collection dates.
 An example of anonymized public domain HIV-1 sequences from a study based in northern Alberta (Canada) is provided in `data/na.fasta`.
 First, we use an R script to exclude the sequences collected in the most recent year:
+
 ```{r message=FALSE}
 require(clustuneR)
 require(ape)
@@ -147,17 +180,19 @@ write.FASTA(old.seqs, file="data/na-old.fasta")
 ```
 
 Next, we use IQ-TREE to reconstruct a maximum likelihood tree relating the "old" sequences:
-```console
+
+``` console
 iqtree -bb 1000 -m GTR -nstop 200 -s na-old.fasta
 ```
+
 Note we've specified the generalized time reversible model of nucleotide substitution to bypass the model selection stage.
 Even so, this is a time-consuming step - to speed things up, we've provided IQ-TREE output files at `data/na.nwk` and `data/na.log`.
 
-
 ### Grafting new sequences
 
 Next, we import both the sequence alignment and the ML tree into R.
 We will use `clustuneR` to graft the sequences from the most recent year using the program `pplacer` and the output files from IQ-TREE.
+
 ```{r warning=FALSE}
 phy <- ape::read.tree("data/na.nwk")
 
@@ -168,6 +203,7 @@ phy.extend <- extend.tree(phy, seq.info, seqs, mc.cores=4, log.file="data/na.log
 ### Finding the optimal threshold
 
 Next, we want to configure `clustuneR` to fit two Poisson regression models to the distribution of new cases among clusters, for a range of genetic distance thresholds:
+
 ```{r}
 # generate cluster sets under varying parameter settings
 param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) list(branch.thresh=x, boot.thresh=0.95))
@@ -193,6 +229,7 @@ delta.AIC <- AICs$TimeModelAIC - AICs$NullModelAIC
 ```
 
 We can visualize the difference in AICs between models as a function of the distance threshold:
+
 ```{r}
 cutoffs <- sapply(param.list, function(x) x$branch.thresh)
 par(mar=c(5,5,1,1))
@@ -209,22 +246,29 @@ If every known infection is merged into a single giant cluster, then there is no
 If every infection each becomes a cluster of one, then there will be excessive information loss due to random variation in sampling dates.
 At the threshold that minimizes `delta.AIC`, the known infections are partitioned into clusters in such a way that minimizes the information loss associated with incorporating sample dates into the predictive model.
 
-
 ## References
 
 If you use `clustuneR` for your work, please cite one of the following references:
 
-* Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection. Virus evolution. 2020 Jan;6(1):veaa011.
-
-* Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications. PLOS Computational Biology. 2022 Nov 30;18(11):e1010745.
+-   Chato C, Kalish ML, Poon AF. Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection.
+    Virus evolution.
+    2020 Jan;6(1):veaa011.
 
-This package includes the binaries for pplacer and guppy (https://matsen.fhcrc.org/pplacer, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively. 
+-   Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, Poon AF. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications.
+    PLOS Computational Biology.
+    2022 Nov 30;18(11):e1010745.
 
-* Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
+This package includes the binaries for pplacer and guppy (<https://matsen.fhcrc.org/pplacer>, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
 
-As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set. These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
+-   Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
 
-* Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic acids research. 2000 Jan 1;28(1):15-8.
+As an example, this package includes a subset of a larger published HIV-1 *pol* sequence data set.
+These sequences were originally published in a study by Vrancken *et al.* (2017) and publicly accessible in the GenBank database under the PopSet accession `1033910942`.
 
-* Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.
+-   Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank.
+    Nucleic acids research.
+    2000 Jan 1;28(1):15-8.
 
+-   Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data.
+    Infection, Genetics and Evolution.
+    2017 Aug 1;52:100-5.