-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
297 lines (228 loc) · 18.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "<img src='man/figures/scLANE_logo.png' align='right' height='20%' width='20%'/>"
output:
github_document:
toc: true
toc_depth: 2
fig_width: 9
fig_height: 6
dev: png
always_allow_html: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(warning = FALSE,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
fig.retina = TRUE,
fig.width = 6,
fig.height = 4)
```
<!-- badges: start -->
[](https://github.com/jr-leary7/scLANE/actions/workflows/R-CMD-check.yaml)
[](https://github.com/jr-leary7/scLANE/actions/workflows/bioc-check.yaml)


[](https://codecov.io/gh/jr-leary7/scLANE)
[](https://www.codefactor.io/repository/github/jr-leary7/sclane)
[](https://doi.org/10.1101/2023.12.19.572477)
[](https://opensource.org/licenses/MIT)
<!-- badges: end -->
# Installation
You can install the most recent version of `scLANE` using:
```{r install-scLANE, eval=FALSE}
remotes::install_github("jr-leary7/scLANE")
```
# Model structure
The `scLANE` package enables users to accurately determine differential expression of genes over pseudotime or latent time, and to characterize gene's dynamics using interpretable model coefficients. `scLANE` builds upon the `marge` modeling framework([GitHub](https://github.com/JakubStats/marge), [paper](https://doi.org/10.1080/10618600.2017.1360780)), allowing users to characterize their trajectory's effects on gene expression using negative binomial GLMs, GEEs, or GLMMs depending on the experimental design & biological questions of interest. This modeling framework is an extension of the [Multivariate Adapative Regression Splines (MARS)](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline) method, which builds nonlinear models out of piecewise linear components. `scLANE` is agnostic with respect to the ordering estimation method used, and can be implemented downstream of any pseudotime or RNA velocity method.
A quickstart guide on how to use `scLANE` with simulated data continues below, and a more detailed vignette showcasing its performance on real data can be found [here](https://jr-leary7.github.io/quarto-site/tutorials/scLANE_Trajectory_DE.html).
# Usage
Our method relies on a relatively simple test in order to define whether a given gene is differentially expressed (or "dynamic") over the provided trajectory. While the exact structure of the test differs by model mode, the concept is the same: the spline-based NB GLM / GEE / GLMM is treated as the alternate model, and a null model is fit using the corresponding model mode. If the GLM mode is used, then the null model is simply an intercept-only NB GLM; the GEE mode fits an intercept-only model with the same working correlation structure as the alternate model, and if the GLMM mode is used then the null model is an intercept-only model with random intercepts for each subject. The alternate hypothesis is that at least one of the estimated coefficients is significantly different from zero. We predict a given gene to be dynamic if the adjusted *p*-value of the test is less than the default $\alpha = 0.01$ threshold, and classify it as static otherwise.
## Libraries
```{r libraries, results='hide', message=FALSE, warning=FALSE}
library(dplyr)
library(scLANE)
library(ggplot2)
library(SingleCellExperiment)
```
## Input data
We read a previously-simulated dataset comprised of cells from 3 subjects exhibiting a homogeneous trajectory structure from [the Zenodo repository](https://doi.org/10.5281/zenodo.8433077). The underlying true pseudotime values are stored in the `colData` slot of the `SingleCellExperiment` object under the name **cell_time_normed**.
```{r sim, results='hide'}
sim_data <- readRDS(url("https://zenodo.org/records/8433077/files/scLANE_sim_data.Rds"))
```
The PCA embedding shows us a pretty simple trajectory that's strongly correlated with the first principal component.
```{r plot-sims-pt, results='hold', message=FALSE}
data.frame(sim_data@int_colData$reducedDims@listData$PCA[, 1:2]) %>%
mutate(pseudotime = sim_data$cell_time_normed) %>%
ggplot(aes(x = PC1, y = PC2, color = pseudotime)) +
geom_point(size = 2, alpha = 0.75, stroke = 0) +
scale_color_gradientn(colors = viridisLite::plasma(n = 20)) +
labs(x = "PC 1", y = "PC 2", color = "Pseudotime") +
theme_scLANE(umap = TRUE)
```
We also see that the data are not clustered by subject, which indicates that gene dynamics are mostly homogeneous across subjects.
```{r plot-sims-subj, results='hold', message=FALSE}
data.frame(sim_data@int_colData$reducedDims@listData$PCA[, 1:2]) %>%
mutate(subject = sim_data$subject) %>%
ggplot(aes(x = PC1, y = PC2, color = subject)) +
geom_point(size = 2, alpha = 0.75, stroke = 0) +
labs(x = "PC 1", y = "PC 2", color = "Subject ID") +
theme_scLANE(umap = TRUE)
```
## Trajectory DE testing
Since we have multi-subject data, we can use any of the three model modes to run our DE testing. We'll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dataframe containing our pseudotime / latent time cellular ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it's usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we'll select 50 genes each from the dynamic and non-dynamic populations.
**Note:** In this case we're working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to `testDynamic()`.
```{r prep-data}
set.seed(312)
gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50),
sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50))
pt_df <- data.frame(PT = sim_data$cell_time_normed)
cell_offset <- createCellOffset(sim_data)
```
### GLM mode
Running `testDynamic()` provides us with a nested list containing model output & DE test results for each gene over each pseudotime / latent time lineage. In this case, since we have a true cell ordering we only have one lineage. Parallel processing is turned on by default, and we use 6 cores here to speed up runtime.
```{r glm-models}
scLANE_models_glm <- testDynamic(sim_data,
pt = pt_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.cores = 6L,
verbose = FALSE)
```
After the function finishes running, we use `getResultsDE()` to generate a sorted table of DE test results, with one row for each gene & lineage. The GLM mode uses a simple likelihood ratio test to compare the null & alternate models, with the test statistic assumed to be asymptotically Chi-squared distributed.
```{r glm-results}
scLANE_res_glm <- getResultsDE(scLANE_models_glm)
select(scLANE_res_glm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>%
slice_sample(n = 5) %>%
knitr::kable(format = "pipe",
digits = 3,
col.names = c("Gene", "Lineage", "LRT stat.", "P-value", "Adj. p-value", "Predicted dynamic status"))
```
### GEE mode
The function call is essentially the same when using the GLM mode, with the exception of needing to provide a sorted vector of subject IDs & a desired correlation structure. We also need to flip the `is.gee` flag in order to indicate that we'd like to fit estimating equations models (instead of mixed models). Since fitting GEEs is more computationally complex than fitting GLMs, DE testing with the GEE mode takes a bit longer. Using more cores and / or running the tests on an HPC cluster speeds things up considerably.
```{r gee-models}
scLANE_models_gee <- testDynamic(sim_data,
pt = pt_df,
genes = gene_sample,
size.factor.offset = cell_offset,
is.gee = TRUE,
id.vec = sim_data$subject,
cor.structure = "ar1",
n.cores = 6L,
verbose = FALSE)
```
We again generate the table of DE test results. The variance of the estimated coefficients is determined using the sandwich estimator, and a Wald test is used to compare the null & alternate models.
```{r gee-results}
scLANE_res_gee <- getResultsDE(scLANE_models_gee)
select(scLANE_res_gee, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>%
slice_sample(n = 5) %>%
knitr::kable("pipe",
digits = 3,
col.names = c("Gene", "Lineage", "Wald stat.", "P-value", "Adj. p-value", "Predicted dynamic status"))
```
### GLMM mode
We re-run the DE tests a final time using the GLMM mode. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects belong to groups like Treatment & Control, and you expect the Treatment group to experience a different progression through the biological process. Executing the function with the GLMM mode differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure.
```{r glmm-models}
scLANE_models_glmm <- testDynamic(sim_data,
pt = pt_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.potential.basis.fns = 3,
is.glmm = TRUE,
id.vec = sim_data$subject,
n.cores = 6L,
verbose = FALSE)
```
**Note:** The GLMM mode is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly.
Like the GLM mode, the GLMM mode uses a likelihood ratio test to compare the null & alternate models.
```{r glmm-results}
scLANE_res_glmm <- getResultsDE(scLANE_models_glmm)
select(scLANE_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>%
slice_sample(n = 5) %>%
knitr::kable("pipe",
digits = 3,
col.names = c("Gene", "Lineage", "LRT stat.", "P-value", "Adj. p-value", "Predicted dynamic status"))
```
## Downstream analysis & visualization
### Model comparison
We can use the `plotModels()` to visually compare different types of models. It takes as input the results from `testDynamic()`, as well as a few specifications for which models & lineages should be plotted. While more complex visualizations can be created from our model output, this function gives us a good first glance at which models fit the underlying trend the best. Here we show the output generated using the GLM mode, split by model type. The intercept-only model shows the null hypothesis against which the scLANE model is compared using the likelihood ratio test and the GLM displays the inadequacy of monotonic modeling architectures for nonlinear dynamics. A GAM shows essentially the same trend as the `scLANE` model, though the fitted trend from `scLANE` is more interpretable.
```{r plot-models-glm}
plotModels(scLANE_models_glm,
gene = "MPG",
pt = pt_df,
expr.mat = sim_data,
size.factor.offset = cell_offset,
plot.null = TRUE,
plot.glm = TRUE,
plot.gam = TRUE,
plot.scLANE = TRUE)
```
When plotting the models generated using the GLMM mode, we split by lineage & color the points by subject ID instead of by lineage. The gene in question highlights the utility of the scLANE model, since the gene dynamics differ significantly by subject.
```{r plot-models-glmm, fig.width=9, fig.height=9}
plotModels(scLANE_models_glmm,
gene = "FLOT2",
pt = pt_df,
expr.mat = sim_data,
size.factor.offset = cell_offset,
id.vec = sim_data$subject,
is.glmm = TRUE,
plot.glm = TRUE,
plot.gam = TRUE,
plot.scLANE = TRUE)
```
### Coefficient summaries
A key feature of `scLANE` is the ability to obtain a quantitative, interpretable coefficient for the effect of pseudotime on gene expression. This functionality is currently available for the GLM & GEE frameworks, and each coefficient carries the [interpretation of a generalized linear model](https://stats.oarc.ucla.edu/r/dae/negative-binomial-regression/).
```{r}
scLANE_models_glm[["JARID2"]]$Lineage_A$Gene_Dynamics %>%
knitr::kable("pipe",
digits = 2,
col.names = c("Gene", "Lineage", "Breakpoint", "First Slope", "Second Slope", "First Trend", "Second Trend"))
```
Coefficients can also be plotted like so:
```{r plot-model-coefs}
plotModelCoefs(scLANE_models_glm,
gene = "JARID2",
pt = pt_df,
expr.mat = sim_data,
size.factor.offset = cell_offset)
```
### Knot distribution
Lastly, we can pull the locations in pseudotime of all the knots fitted by `scLANE`. Visualizing this distribution gives us some idea of where transcriptional switches are occurring in the set of genes classified as dynamic.
```{r, plot-knot-dist, message=FALSE, warning=FALSE}
dyn_genes <- filter(scLANE_res_glm, Gene_Dynamic_Overall == 1) %>%
pull(Gene)
knot_dist <- getKnotDist(scLANE_models_glm, dyn.genes = dyn_genes)
ggplot(knot_dist, aes(x = knot)) +
geom_histogram(aes(y = after_stat(density)),
color = "black",
fill = "white",
linewidth = 0.5) +
geom_density(color = "forestgreen",
fill = "forestgreen",
alpha = 0.5,
linewidth = 0.75) +
labs(x = "Knot Location", y = "Density") +
theme_scLANE()
```
### Smoothed dynamics matrix
We can extract matrix of the fitted values for each dynamic gene using the `smoothedCountsMatrix()` function.
```{r}
smoothed_dynamics <- smoothedCountsMatrix(scLANE_models_glm,
size.factor.offset = cell_offset,
pt = pt_df,
genes = dyn_genes)
```
The smoothed dynamics can then be used to generate expression cascade heatmaps, cluster genes, etc. For more information on downstream analysis of gene dynamics, see [the corresponding vignette](https://jr-leary7.github.io/quarto-site/tutorials/scLANE_Trajectory_DE.html#downstream-analysis).
# Conclusions & best practices
In general, starting with the GLM mode is probably your best bet unless you have a strong prior belief that expression trends will differ significantly between subjects. If that is the case, you should use the GEE mode if you're interested in population-level estimates, but are worried about wrongly predicting differential expression when differences in expression are actually caused by inter-subject variation. If you're interested in generating subject-specific estimates then the GLMM mode should be used; take care when interpreting the fixed vs. random effects though, and consult a biostatistician if necessary.
If you have a large dataset (10,000+ cells), you should start with the GLM mode, since standard error estimates don't differ much between modeling methods given high enough *n*. In addition, running the tests on an HPC cluster with 12+ CPUs and 64+ GB of RAM will help your computations to complete swiftly. Datasets with smaller numbers of cells or fewer genes of interest may be easily analyzed in an R session on a local machine.
# Contact information
This package is developed & maintained by Jack Leary. Feel free to reach out by [opening an issue](https://github.com/jr-leary7/scLANE/issues) or by email ([email protected]) if more detailed assistance is needed.
# References
1. Bacher, R. *et al*. [Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization](https://doi.org/10.1093/nar/gkab1071). *Nucleic Acids Research* (2021).
2. Warton, D. & J. Stoklosa. [A generalized estimating equation approach to multivariate adaptive regression splines](https://doi.org/10.1080/10618600.2017.1360780). *Journal of Computational and Graphical Statistics* (2018).
3. Nelder, J. & R. Wedderburn. [Generalized linear models](https://doi.org/10.2307/2344614). *Journal of the Royal Statistical Society* (1972).
4. Liang, K. & S. Zeger. [Longitudinal data analysis using generalized linear models](https://doi.org/10.1093/biomet/73.1.13). *Biometrika* (1986).
5. Laird, N. & J. Ware. [Random-effects models for longitudinal data](https://doi.org/10.2307/2529876). *Biometrics* (1988).