Skip to content

Commit 899a799

Browse files
committed
Once Over
1 parent 941a221 commit 899a799

36 files changed

+474
-813
lines changed

.Rhistory

+28-510
Large diffs are not rendered by default.

DESCRIPTION

+17-12
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
1-
Package: MountainPlot
1+
Package: clustuneR
22
Title: Molecular clustering implementation and predictive optimization
3-
Version: 0.0.0.9000
3+
Version: 1.0
44
Authors@R:
5-
person(given = "Connor", family = "Chato", role = c("aut", "cre"), email = "[email protected]")
6-
Description: This package can take identify clusters from the ape package's
7-
implementation of tree and sequence data. Multiple common definitions of sequence
8-
based clusters are implemented as functions, with a cluster neatly standardized
9-
as a row in a data.table object. This also offers some ability to run and test
10-
predictive models on clustered data sets, tracking the effect of known variables
11-
within clusters (ex. time) on outcomes such as cluster growth over time. Several
12-
built in functions handle the measurement and definition of cluster growth for
13-
this purpose. Optimal clusters(built under a certain set of parameters) can be
14-
identified based on predictive model performance.
5+
c(person("Connor", "Chato", email = "[email protected]",
6+
role = c("aut", "cre")),
7+
person("Art", "Poon", role="ths"))
8+
Description: clustuneR builds clusters from inputted sequence alignments and/or
9+
phylogenetic trees, allowing users to choose between multiple cluster-building
10+
algorithms implememented in the package and tune clustering parameters to produce
11+
informative clusters. The package also takes in meta-data associated with sequences
12+
such as a known collection date or subtype/variant classification. Cluster-level
13+
characteristics, such as the range of collection dates or the most common
14+
subtype/variant within a cluster can also be identified from these.
15+
If a subset of sequences are specified as "New", then clustuneR simulates cluster
16+
growth by building clusters in two stages: first clusters are built from sequences
17+
which are not specified as new, then the new sequences are added to clusters.
18+
Predictive models can then be tested on cluster-level attributes and validated
19+
with growth outcomes, to measure how informative a cluster set is.
1520
License: `use_gpl3_license()`
1621
Encoding: UTF-8
1722
Imports:

R/analysis.R

+39-25
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,19 @@
11
#' Multiple clusters from a parameter set
22
#'
3-
#' Runs a given clustering method over a range of parameters values to output
4-
#' a range of cluster sets corresponding to different
5-
#'
6-
#' @param cluster.method: A given clustering function such as step.cluster() which produces a set of clusters
7-
#' @param param.list: A named list of parameter sets, which can act as inputs to cluster.method.
8-
#' @param rangeID: A unique identifier for the set of rows generated by this run.
9-
#' If this output is bound to other cluster ranges in a larger analysis, this can disambiguate
3+
#' Runs a given clustering method (a passed function) over a range of parameter
4+
#' values (a list, each entry a named list of parameters for the function).
5+
#' Collects the data into a single data table with multiple cluster set ID's
6+
#' indicating the parameter set used to define clusters a unique cluster range ID.
7+
#'
8+
#' @param cluster.method: A given clustering function such as step.cluster() which
9+
#' produces a set of clusters.
10+
#' @param param.list: A list, each entry a named list of parameter sets which can
11+
#' act as inputs to the cluster.method. These include values such as trees and graphs,
12+
#' as well as criteria for clustering such as boot.thresh or dist.thresh.
13+
#' @param rangeID: A unique identifier for the set of rows generated by this run
14+
#' if this output is bound to other cluster ranges in a larger analysis.
1015
#' @param mc.cores: A parallel option to increase run speed.
11-
#' @return A data.table with parameter sets and cluster IDs specified
16+
#' @return A data.table of clusters. Multiple cluster sets are collected into a range.
1217
#' @export
1318
#' @example examples/multi.cluster_ex.R
1419
multi.cluster <- function(cluster.method, param.list, mc.cores = 1, rangeID = 0) {
@@ -28,24 +33,31 @@ multi.cluster <- function(cluster.method, param.list, mc.cores = 1, rangeID = 0)
2833

2934
#' Predictive analysis on clusters
3035
#'
31-
#' Fits a predictive model of some outcome (by default, cluster growth) to sets of cluster data.
32-
#' These fits are recorded for each use of the predictive model on a given cluster set
36+
#' Fits predictive model of some outcome (by default, cluster growth) to some
37+
#' cluster-level variable (by default, cluster size). This fit is done for each
38+
#' cluster set. Multiple models can be inputted as a named list of functions taking
39+
#' in cluster data (see example)
3340
#'
34-
#' @param cluster.data: Inputted set(s) of clusters May or may not be sorted into ranges
41+
#' @param cluster.data: Inputted set(s) of clusters. Possibly multiple ranges
3542
#' @param mc.cores: A parallel option to increase run speed
36-
#' @param predictor.transformations: A named list of transformation functions for each predictor variable.
37-
#' This name should correspond to a column from the cluster.data, which will be taken as input for the function.
38-
#' for example list("CollectionDate"=mean), would change the collection date column to a vector of means
39-
#' instead of a list collection date vectors
40-
#' @param predictive.models: A named list of functions, each of which applies a model to inputted data (x). See default null for example.
41-
#' @return A data.table of analysis results. Several important summary values such as null and full AIC are proposed here.
43+
#' @param predictor.transformations: A named list of transformation functions for
44+
#' each predictor variable (ex. list("Data"==sum). Because clustered meta data takes
45+
#' the form of a list these functions are often necessary to obtain a single,
46+
#' cluster-level variable
47+
#' @param predictive.models: A named list of functions, each of which applies a
48+
#' model to inputted cluster data (x). By default a "NullModel" example. Where
49+
#' Growth is predicted only by cluster size
50+
#' @return A data.table of analysis results. Model fits are stored as entries in
51+
#' the rows of a data.table. The column specifying setID is retained, as is the
52+
#' range ID and the parameters used to create the cluster.
4253
#' @export
4354
#' @example examples/fit.analysis_ex.R
4455
fit.analysis <- function(cluster.data, mc.cores = 1, predictor.transformations = list(),
4556
predictive.models = list(
4657
"NullModel" = function(x){
4758
glm(Size~Growth, data=x, family="poisson")
48-
})) {
59+
})) {
60+
4961
# Check inputs
5062
predictors <- names(predictor.transformations)
5163
mod.names <- names(predictive.models)
@@ -69,7 +81,6 @@ fit.analysis <- function(cluster.data, mc.cores = 1, predictor.transformations =
6981
})]
7082
}
7183

72-
7384
# Obtain fit data for each cluster set
7485
cluster.analysis <- dplyr::bind_rows(
7586
parallel::mclapply(setIDs, function(id) {
@@ -88,20 +99,23 @@ fit.analysis <- function(cluster.data, mc.cores = 1, predictor.transformations =
8899

89100
#'Get AIC values from an analysis
90101
#'
91-
#'Takes a cluster.analysis and extracts AIC values from columns containing model fits.
92-
#'Model fit columns are automatically identified
102+
#'Takes a cluster.analysis and extracts AIC values from columns containing model
103+
#'fits. Fit columns are automatically identified
93104
#'
94-
#'@param cluster.analysis: A data.table from some predictive growth model analysis generated by fit.analysis()
95-
#'@return The AIC data for all columns containing fit objects
105+
#'@param cluster.analysis: A data.table from some predictive growth model analysis
106+
#'generated by fit.analysis()
107+
#'@return The AIC data for all columns containing fit objects. The column specifying
108+
#'setID is retained
96109
#'@export
97110
#'@example examples/get.AIC_ex.R
98111
get.AIC <- function(cluster.analysis){
99112

100113
#Identify models
101-
which.models <- sapply(cluster.analysis[1,], function(x){any(attr(x[[1]], "class")%in%c("lm", "glm"))})
114+
which.models <- sapply(cluster.analysis[1,],
115+
function(x){any(attr(x[[1]], "class")%in%c("lm", "glm"))})
102116
which.models <- which(which.models)
103117
if(length(which.models)==0) {
104-
stop("No models in the data set provided")
118+
stop("No fits in the data set provided")
105119
}
106120
model.fits <- cluster.analysis[,.SD, .SDcols = which.models]
107121

R/data.R

+69-44
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,62 @@
1-
#'An alignment of HIV1, subtype B sequences
1+
#' An alignment of HIV1, subtype B sequences
22
#'
3-
#'A dataset containing 10 HIV1, subtype B polymerase sequences collected in Northern Alberta Canada.
4-
#'This a 10 sequence sample from popset# 1033910942 on NCBI's genbank Archive
3+
#' A dataset containing 10 HIV1, subtype B polymerase sequences collected in Northern
4+
#' Alberta Canada. This a 10 sequence sample from popset# 1033910942 on NCBI's genbank
5+
#' Archive. The sequence headers also include meta-data for sequences.
56
#'
6-
#' @format An ape DNA object: 10 DNA sequences in binary format stored in a list. All sequences of same length: 1017
7+
#' @format An ape DNA object: 10 DNA sequences in binary format stored in a list.
8+
#' All sequences of same length: 1017
79
#' @source \url{ https://www.ncbi.nlm.nih.gov/popset?DbFrom=nuccore&Cmd=Link&LinkName=nuccore_popset&IdsFromResult=1033912042 }
810
"alignment.ex"
911

10-
#'An example set of sequence meta.data corresponding to alignment.ex
12+
#' An example set of sequence meta.data corresponding to alignment.ex
1113
#'
12-
#'A dataset describing 10 different HIV1 pol sequences collected in Northern Alberta Canada.
14+
#' Built from alignment.ex, the example 10 sequence alignment using pull.headers.
15+
#' The date of each sequence's collection, it's genbank unique accession ID, and
16+
#' sequence subtype are referenced within the header
1317
#'
14-
#' @format A data.table object with 9 variables:
18+
#' @format A data.table object with 4 variables:
1519
#' \describe{
1620
#' \item{ID}{Accession IDs (characters) of sequences}
1721
#' \item{CollectionDate}{Collection date of sequences. Full dates given as yyyy-mm-dd}
18-
#' \item{Subtype}{Subtypes (factors) within a cluster}
19-
#' \item{Header}{The original headers from the alignement. This matches meta data to sequences}
22+
#' \item{Subtype}{Subtypes (factors) of sequences}
23+
#' \item{Header}{The original headers from the alignement. This matches meta
24+
#' data to sequences in original alignment}
2025
#' }
2126
"seq.info.ex"
2227

23-
#'An example set of clusters, built using component.cluster
28+
#' An example set of clusters, built using component.cluster
2429
#'
25-
#'A dataset describing 5 different clusters. Their member headers are listed, as well as the growth they experienced
26-
#'(ie. the number of new sequences forming clusters with old sequences.). See component.cluster for further information on
27-
#'how these were assigned based on graph.ex as an input
30+
#' A dataset describing 5 different clusters. The headers (from alignment.ex), and
31+
#' associated meta data (from seq.info.ex) of cluster members is captured, as well
32+
#' as several cluster-level traits, such as growth and size. See component.cluster
33+
#' or further information onhow these were assigned based on graph.ex as an input
2834
#'
2935
#' @format A data.table object with 9 variables:
3036
#' \describe{
31-
#' \item{ClusterID}{ The unique identifier number for this cluster. A numberic}
32-
#' \item{ID}{A list of vectors, each containing the accession IDs (characters) of sequences within a cluster}
33-
#' \item{CollectionDate}{A list of vectors, each containing the collection date of sequences within a cluster}
34-
#' \item{Subtype}{A list of vectors, each containing the subtypes (factors) within a cluster}
35-
#' \item{Header}{A list of vectors, each containing the original headers from the alignement used to build this set of clusters}
36-
#' \item{Size}{The original size of this cluster before being updated with new cases. This simply the number of sequences within the cluster}
37+
#' \item{ClusterID}{ The unique identifier number for this cluster. A numeric}
38+
#' \item{ID}{A list of vectors, each containing the accession IDs (characters)
39+
#' of sequences within a cluster}
40+
#' \item{CollectionDate}{A list of vectors, each containing the collection date
41+
#' of sequences within a cluster}
42+
#' \item{Subtype}{A list of vectors, each containing the subtypes (factors) within
43+
#' a cluster}
44+
#' \item{Header}{A list of vectors, each containing the original headers from the
45+
#' alignement used to build this set of clusters}
46+
#' \item{Size}{The original size of this cluster before being updated with new cases.
47+
#' This simply the number of sequences within the cluster}
3748
#' \item{Growth}{The growth of the cluster after new cases are added}
38-
#' \item{DistThresh}{The pairwise distance threshold used to create this complete set of clusters. Corresponds to a setID as an input parameter}
49+
#' \item{DistThresh}{The pairwise distance threshold used to create this complete
50+
#' set of clusters. Corresponds to a setID as an input parameter}
3951
#' \item{SetID}{The unique identifier for this set of clusters. A numeric}
4052
#' }
4153
"cluster.ex"
4254

43-
#'An example graph, built based on pairwise TN93 distances
55+
#' An example graph, built based on pairwise TN93 distances
4456
#'
45-
#'This implementation of a graph is a list, describing a set of sequences and the distances between them.
46-
#'See create.graph for more information on how this graph was created using alignment.ex as input
57+
#' This implementation of a graph is a list, describing a set of sequences and the
58+
#' distances between them. See create.graph for more information on how this graph
59+
#' was created using alignment.ex as input
4760
#'
4861
#' @format A list of 3 variables
4962
#' \describe{
@@ -56,39 +69,51 @@
5669
#' }
5770
"graph.ex"
5871

59-
#'A tree built based on a subset of alignment.ex
72+
#' A tree built based on a subset of alignment.ex
6073
#'
61-
#'This is a maximum likelyhood tree built using IQ-TREE with model selection and 1000 parametric bootstraps.
62-
#'The log information for this tree is stored in data/IQTREE_log_ex.txt. A subset of six older sequences
63-
#'(collected before January 1st 2012) from alignment.ex was used to construct this tree
74+
#' A maximum likelihood tree built using IQ-TREE with model selection and 1000
75+
#' parametric bootstraps. The log information for this tree is stored in data/IQTREE_log_ex.txt.
76+
#' A subset of six older sequences (collected before January 1st 2012) from alignment.ex
77+
#' was used to construct this tree.
6478
#'
6579
#'
66-
#' @format An unrooted, phylogenetic tree with 6 tips and 4 internal nodes. Node labels represent certainty
67-
#' See ape's implementation of phylogenetic tree objects for information about tags within this object
80+
#' @format An unrooted, phylogenetic tree with 6 tips and 4 internal nodes.
81+
#' Node labels represent certainty. See ape's implementation of phylogenetic tree
82+
#' objects for information about tags within this object
6883
"old.tree.ex"
6984

70-
#'A tree built from alignment.ex
85+
#' A tree built from alignment.ex
7186
#'
72-
#'This is a maximum likelihood tree built using IQ-TREE with automatic model selection and 1000 parametric bootstraps.
87+
#' This is a maximum likelihood tree built using IQ-TREE with automatic model
88+
#' selection and 1000 parametric bootstraps. Contrasting old.tree.ex. This is a
89+
#' complete tree containing all sequences in alignment.ex
7390
#'
74-
#' @format An unrooted, phylogenetic tree with 10 tips and 8 internal nodes. Node labels represent certainty
75-
#' See ape's implementation of phylogenetic tree objects for information about tags within this object
91+
#' @format An unrooted, phylogenetic tree with 10 tips and 8 internal nodes. Node
92+
#' labels represent certainty. See ape's implementation of phylogenetic tree objects
93+
#' for information about tags within this object.
7694
"full.tree.ex"
7795

78-
#'An extension of an ape tree object which can be used to create clusters
96+
#' An extension of an ape tree object which can be used to create clusters
7997
#'
80-
#'This is a maximum likelihood tree built using IQ-TREE with automatic model selection and 1000 parametric bootstraps.
81-
#'Additional functions within tree.setup.R were used to annotate information useful for clustering
98+
#' An extension of old.tree.ex maximum likelihood tree built using IQ-TREE with automatic
99+
#' model selection and 1000 parametric bootstraps. growth information and additional
100+
#' information useful for cluster identification were added by extend.tree.
82101
#'
83-
#' @format A , phylogenetic tree with 6 tips and 4 internal nodes. Node labels represent certainty
84-
#' See ape's implementation of phylogenetic tree objects for information about tags within this object.
85-
#' In addition, there are 4 new objects created by functions within tree.setup.R
102+
#' @format A phylogenetic tree with 6 tips and 4 internal nodes. Node labels represent
103+
#' certainty. See ape's implementation of phylogenetic tree objects for information
104+
#' about tags within this object. In addition, there are 4 new objects created by
105+
#' functions within tree.setup.R
86106
#' \describe{
87107
#' \item{seq.info}{ See seq.info.ex, a data.table containing sequence meta data}
88-
#' \item{node.info}{ Grouping of the meta.data present in seq.info assigned to various nodes in the tree,
89-
#' coupled with information important to clustering, such as mean divergence from root, or node certainty }
90-
#' \item{path.info}{ Information regarding the path of edges from tips to the root of the tree.
91-
#' This is also necessary for some clustering algorithms, specifically step.cluster}
92-
#' \item{growth.info}{ a data.table pairing new sequences, to a single node in the tree based on placements assigned by guppy and pplacer.}
108+
#' \item{node.info}{ Grouping of the meta.data present in seq.info assigned to
109+
#' various nodes in the tree, coupled with information important to clustering,
110+
#' such as mean divergence from root, or node certainty }
111+
#' \item{path.info}{ Information regarding the path of edges from tips to the root
112+
#' of the tree. This is also necessary for some clustering algorithms, specifically
113+
#' step.cluster}
114+
#' \item{growth.info}{ a data.table pairing new sequences, to a single node in the
115+
#' tree based on placements assigned by guppy and pplacer. The certainty of this placement,
116+
#' terminal branch length, neighbour, and branch length from new internal node to
117+
#' new neighbour are described}
93118
#' }
94119
"extended.tree.ex"

R/generate.data.R

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#'Generate data found in /data folder
22
#'
3-
#'This is partially intended as example use code, however may also act as secondary,
4-
#'informal testing in the development cycle and as a tool to update data quickly if required.
3+
#'This is partially intended as example use code, however may also act as informal
4+
#'testing in the development cycle and as a tool to update data quickly if required.
55
generate.all <- function() {
66
generate.seq.info()
77
generate.graph()

R/graph.clustering.R

+10-8
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
#' Create clusters based on the components of a graph
22
#'
3-
#' This uses a homogenization algorithm to identify disconnected components in a graph.
4-
#' Edges are filtered away using a distance threshold to result in components.
3+
#' Edges are filtered away using a distance threshold to break up the completely
4+
#' connected graph such that only similar edges remain.
55
#'
6-
#' @param g: The input graph, annotated with vertex and edge information
7-
#' @param dist.thresh: The maximum distance defining which edges are filtered
8-
#' @param setID: If several different parameter ranges are used, the setID can identify them
9-
#' @return A data table which represents cluster information. This includes growth info
10-
#' Because data.tables are being used, this prevents original values being reassigned via pointer
6+
#' @param g: The input graph, annotated with vertex, edge, and growth resolution
7+
#' information
8+
#' @param dist.thresh: The maximum distance defining which edges are filtered.
9+
#' A higher distance threshold implies a larger average cluster size
10+
#' @param setID: A numeric identifier for this cluster set.
11+
#' @return A set of clusters as a data.table. See example cluster.ex object
12+
#' documentation for an example of clustered sequence data + meta data
1113
#' @export
1214
#' @example examples/component.cluster_ex.R
13-
component.cluster <- function(g, dist.thresh = 0.007, setID = 0) {
15+
component.cluster <- function(g, dist.thresh = 0, setID = 0) {
1416

1517
# Filter edges above the distance threshold and prepare for component finding algorithm
1618
# All edges from a new sequence are filtered except for their "growth-resolved" edge

0 commit comments

Comments
 (0)