forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path21-pseudotime.Rmd
378 lines (314 loc) · 13.7 KB
/
21-pseudotime.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
# knit: bookdown::preview_chapter
output: html_document
---
## Pseudotime analysis
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center")
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(TSCAN)
library(M3Drop)
library(monocle)
library(destiny)
library(SLICER)
set.seed(1)
```
In many situations, one is studying a process where cells change
continuously. This includes, for example, many differentiation processes
taking place during development: following a stimulus, cells
will change from one cell-type to another. Ideally, we would like to
monitor the expression levels of an individual cell over
time. Unfortunately, such monitoring is not possible with scRNA-seq
since the cell is lysed (destroyed) when the RNA is extracted.
Instead, we must sample at multiple time-points and obtain snapshots
of the gene expression profiles. Since some of the cells will proceed
faster along the differentiation than others, each snapshot may
contain cells at varying points along the developmental
progression. We use statistical methods to order the cells along one
or more trajectories which represent the underlying developmental
trajectories, this ordering is referred to as "pseudotime".
In this chapter we will consider four different tools: Monocle, TSCAN,
destiny and SLICER for ordering cells according to their pseudotime
development. To illustrate the methods we will be using a dataset on
mouse embryonic development [@Deng2014-mx]. The dataset consists of
268 cells from 10 different time-points of early mouse development. In this case, there is no need for pseudotime alignment since the cell labels provide information about the development trajectory. Thus, the labels allow us to establish a ground truth so that we can evaluate and compare the different methods.
A recent review by Cannoodt et al provides a detailed summary of the
various computational methods for trajectory inference from
single-cell transcriptomics [@Cannoodt2016-uj]. They discuss several
tools, but unfortunately for our purposes many of these tools do not
have complete or well-maintained implementations, and/or are not
implemented in R.
Cannoodt et al cover:
* SCUBA - Matlab implementation
* Wanderlust - Matlab (and requires registration to even download)
* Wishbone - Python
* SLICER - R, but package only available on Github
* SCOUP - C++ command line tool
* Waterfall - R, but one R script in supplement
* Mpath - R pkg, but available as tar.gz on Github; function
documentation but no vignette/workflow
* Monocle - Bioconductor package
* TSCAN - Bioconductor package
Unfortunately only two tools discussed (Monocle and TSCAN) meet the
gold standard of open-source software hosted in a reputable repository.
The following figures from the paper summarise some of the features of
the various tools.
```{r pseudotime-methods-description, out.width = '90%', fig.cap="Descriptions of trajectory inference methods for single-cell transcriptomics data (Fig. 2 from Cannoodt et al, 2016).", echo=FALSE}
knitr::include_graphics("figures/cannoodt_pseudotime_properties.png")
```
```{r pseudotime-methods, out.width = '90%', fig.cap="Characterization of trajectory inference methods for single-cell transcriptomics data (Fig. 3 from Cannoodt et al, 2016).", echo=FALSE}
knitr::include_graphics("figures/cannoodt_pseudotime_methods.png")
```
### TSCAN
TSCAN combines clustering with pseudotime analysis. First it clusters the cells using `mclust`,
which is based on a mixture of normal distributions. Then it builds a minimum spanning tree to connect the clusters. The branch of this tree that connects the largest number of clusters is the main branch which is used to determine pseudotime.
First we will try to use all genes to order the cells.
```{r tscan-all-genes}
deng_counts <- readRDS("deng/deng-reads.rds")
cellLabels <- colData(deng_counts)$cell_type2
deng <- logcounts(deng_counts)
colnames(deng) <- cellLabels
procdeng <- TSCAN::preprocess(deng)
colnames(procdeng) <- 1:ncol(deng)
dengclust <- TSCAN::exprmclust(procdeng, clusternum = 10)
TSCAN::plotmclust(dengclust)
dengorderTSCAN <- TSCAN::TSCANorder(dengclust, orderonly = F)
pseudotime_order_tscan <- as.character(dengorderTSCAN$sample_name)
```
We can also examine which timepoints have been assigned to each state:
```{r tscan-vs-truth}
cellLabels[dengclust$clusterid == 10]
colours <- rainbow(n = 10) # red = early, violet = late
tmp <-
factor(
cellLabels[as.numeric(pseudotime_order_tscan)],
levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
"16cell", "earlyblast", "midblast", "lateblast")
)
plot(
as.numeric(tmp),
xlab = "Pseudotime Order",
ylab = "Timepoint",
col = colours[tmp],
pch = 16
)
```
__Exercise 1__ Compare results for different numbers of clusters (`clusternum`).
### monocle
Monocle skips the clustering stage of TSCAN and directly builds a
minimum spanning tree on a reduced dimension representation of the
cells to connect all cells. Monocle then identifies the longest path
in this tree to determine pseudotime. If the data contains diverging
trajectories (i.e. one cell type differentiates into two different
cell-types), monocle can identify these. Each of the resulting forked paths is
defined as a separate cell state.
Unfortunately, Monocle does not work when all the genes are used, so
we must carry out feature selection. First, we use M3Drop:
```{r m3d-select-genes}
m3dGenes <- as.character(
M3DropFeatureSelection(counts(deng_counts))$Gene
)
d <- deng[which(rownames(deng) %in% m3dGenes), ]
d <- d[!duplicated(rownames(d)), ]
```
Now run monocle:
```{r monocle-all-genes, message=FALSE, warning=FALSE}
colnames(d) <- 1:ncol(d)
geneNames <- rownames(d)
rownames(d) <- 1:nrow(d)
pd <- data.frame(timepoint = cellLabels)
pd <- new("AnnotatedDataFrame", data=pd)
fd <- data.frame(gene_short_name = geneNames)
fd <- new("AnnotatedDataFrame", data=fd)
dCellData <- newCellDataSet(d, phenoData = pd, featureData = fd, expressionFamily = tobit())
dCellData <- setOrderingFilter(dCellData, which(geneNames %in% m3dGenes))
dCellData <- estimateSizeFactors(dCellData)
dCellDataSet <- reduceDimension(dCellData, pseudo_expr = 1)
dCellDataSet <- orderCells(dCellDataSet, reverse = FALSE)
plot_cell_trajectory(dCellDataSet)
# Store the ordering
pseudotime_monocle <-
data.frame(
Timepoint = phenoData(dCellDataSet)$timepoint,
pseudotime = phenoData(dCellDataSet)$Pseudotime,
State=phenoData(dCellDataSet)$State
)
rownames(pseudotime_monocle) <- 1:ncol(d)
pseudotime_order_monocle <-
rownames(pseudotime_monocle[order(pseudotime_monocle$pseudotime), ])
```
We can again compare the inferred pseudotime to the known sampling timepoints.
```{r monocle-vs-truth}
monocle_time_point <- factor(
pseudotime_monocle$Timepoint,
levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
"16cell", "earlyblast", "midblast", "lateblast")
)
plot(
pseudotime_monocle$pseudotime,
monocle_time_point,
xlab = "Pseudotime",
ylab = "Timepoint",
col = colours[monocle_time_point],
pch = 16
)
```
### Diffusion maps
[Diffusion maps](https://en.wikipedia.org/wiki/Diffusion_map) were introduced by [Ronald Coifman and Stephane Lafon](http://www.sciencedirect.com/science/article/pii/S1063520306000546), and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data.
[Haghverdi et al](http://biorxiv.org/content/biorxiv/early/2015/08/04/023309.full.pdf) have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called [destiny](http://bioconductor.org/packages/destiny).
```{r destiny-deng}
dm <- DiffusionMap(t(log2(deng + 1)))
tmp <- factor(
colnames(deng),
levels = c(
"early2cell",
"mid2cell",
"late2cell",
"4cell",
"8cell",
"16cell",
"earlyblast",
"midblast",
"lateblast"
)
)
plot(
eigenvectors(dm)[,1],
eigenvectors(dm)[,2],
xlab = "Diffusion component 1",
ylab = "Diffusion component 2",
col = colours[tmp],
pch = 16
)
```
Like the other methods, destiny does a good job at ordering the early time-points, but it is unable to distinguish the later ones.
__Exercise 2__ Do you get a better resolution between the later time points by considering additional eigenvectors?
__Exercise 3__ How does the ordering change if you only use the genes identified by M3Drop?
### SLICER
The SLICER method is an algorithm for constructing trajectories that
describe gene expression changes during a sequential biological
process, just as Monocle and TSCAN are. SLICER is designed to capture
highly nonlinear gene expression changes, automatically select genes
related to the process, and detect multiple branch and loop features
in the trajectory [@Welch2016-jr]. The SLICER R package is available
from its [GitHub repository](https://github.com/jw156605/SLICER) and
can be installed from there using the `devtools` package.
We use the `select_genes` function in SLICER to automatically select
the genes to use in builing the cell trajectory. The function uses
"neighbourhood variance" to identify genes that vary smoothly, rather
than fluctuating randomly, across the set of cells. Following this, we
determine which value of "k" (number of nearest neighbours) yields an embedding that
most resembles a trajectory. Then we estimate the [locally linear
embedding](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction) of the cells.
```{r slicer-analyis, message=FALSE, warning=FALSE}
require("lle")
slicer_genes <- select_genes(t(deng))
k <- select_k(t(deng[slicer_genes,]), kmin = 30, kmax=60)
slicer_traj_lle <- lle(t(deng[slicer_genes,]), m = 2, k)$Y
plot(slicer_traj_lle, xlab = "LLE Comp 1", ylab = "LLE Comp 2",
main = "Locally linear embedding of cells from SLICER",
col=colours[tmp], pch=16)
```
With the locally linear embedding computed we can construct a
k-nearest neighbour graph that is fully connected. This plot displays
a (yellow) circle for each cell, with the cell ID number overlaid in
blue. Here we show the graph computed using 10 nearest
neighbours. Here, SLICER appears to detect one major trajectory with
one branch.
```{r slicer-build-graph}
slicer_traj_graph <- conn_knn_graph(slicer_traj_lle, 10)
plot(slicer_traj_graph, main = "Fully connected kNN graph from SLICER")
```
From this graph we can identify "extreme" cells that are candidates
for start/end cells in the trajectory.
```{r slicer}
ends <- find_extreme_cells(slicer_traj_graph, slicer_traj_lle)
start <- ends[1]
```
Having defined a start cell we can order the cells in the estimated pseudotime.
```{r}
pseudotime_order_slicer <- cell_order(slicer_traj_graph, start)
branches <- assign_branches(slicer_traj_graph, start)
pseudotime_slicer <-
data.frame(
Timepoint = cellLabels,
pseudotime = NA,
State = branches
)
pseudotime_slicer$pseudotime[pseudotime_order_slicer] <-
1:length(pseudotime_order_slicer)
```
We can again compare the inferred pseudotime to the known sampling
timepoints. SLICER does not provide a pseudotime value per se, just an
ordering of cells.
```{r slicer-vs-truth}
slicer_time_point <- factor(
pseudotime_slicer$Timepoint,
levels = c("early2cell", "mid2cell", "late2cell", "4cell", "8cell",
"16cell", "earlyblast", "midblast", "lateblast")
)
plot(
pseudotime_slicer$pseudotime,
slicer_time_point,
xlab = "Pseudotime",
ylab = "Timepoint",
col = colours[slicer_time_point],
pch = 16
)
```
Like the previous method, SLICER here provides a good ordering for the
early time points and struggles for later time points.
__Exercise 4__ How do the results change for different k? (e.g. k = 5) What about changing the number of nearest neighbours in
the call to `conn_knn_graph`?
__Exercise 5__ How does the ordering change if you use a different set
of genes from those chosen by SLICER (e.g. the genes identified by M3Drop)?
### Comparison of the methods
How do the trajectories inferred by TSCAN and Monocle compare?
```{r tscan-monocle-compare}
matched_ordering <-
match(
pseudotime_order_tscan,
pseudotime_order_monocle
)
timepoint_ordered <-
monocle_time_point[order(pseudotime_monocle$pseudotime)]
plot(
matched_ordering,
xlab = "Monocle Order",
ylab = "TSCAN Order",
col = colours[timepoint_ordered],
pch = 16
)
```
__Exercise 6__: Compare destiny and SLICER to TSCAN and Monocle.
### Expression of genes through time
Each package also enables the visualization of expression through pseudotime. Following individual genes is very helpful for identifying genes that play an important role in the differentiation process. We illustrate the procedure using the `Rhoa` gene.
__TSCAN__
```{r Rhoa-tscan}
colnames(deng) <- 1:ncol(deng)
TSCAN::singlegeneplot(
deng[rownames(deng) == "Rhoa", ],
dengorderTSCAN
)
```
__Monocle__
```{r Rhoa-monocle}
monocle::plot_genes_in_pseudotime(
dCellDataSet[fData(dCellDataSet)$gene == "Rhoa",],
color_by = "timepoint"
)
```
Of course, pseudotime values computed with any method can be added to
the `colData` slot of an `SCE` object. Having done that, the full
plotting capabilities of the `scater` package can be used to
investigate relationships between gene expression, cell populations
and pseudotime. This would be particularly useful for the SLICER
results, as SLICER does not provide plotting functions.
__Exercise 7__: Repeat the exercise using a subset of the genes, e.g. the set of highly variable genes that can be obtained using M3Drop::Brennecke_getVariableGenes
### sessionInfo()
```{r echo=FALSE}
sessionInfo()
```