-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdatavis_example.Rmd
237 lines (180 loc) · 8.75 KB
/
datavis_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "Data Visualization in R"
author: "CodeRATS"
date: "1/29/2020"
output:
html_document:
toc: true
number_sections: true
fig_width: 7
fig_height: 5
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
echo = TRUE)
```
## Setup
Download [R](https://www.r-project.org/) and [RStudio](https://rstudio.com/products/rstudio/download/) (the free version works perfectly well).
Open the project folder, and click the .Rproj file which will install the packages necessary to run this notebook.
## Introduction to ggplot2
To get started with ggplot, we will use the anscombe data, which contains four sets which have the same statistical properties (mean, variance, etc.), yet are visually distinguishable.
**Note:** The `::` operator allows us to explicitly call functions from a particular package. We use this convention here only for clarity, but it is equally valid to load the package first with `library()` and then use the function without reference to the package.
```{r}
library(magrittr) # allows us to use the %>% operator
tidy_anscombe <- anscombe %>%
dplyr::mutate(row = dplyr::row_number()) %>%
tidyr::pivot_longer(-row, names_to = 'variable_set', values_to = 'value') %>%
tidyr::separate(variable_set, c('variable', 'set'), sep = 1) %>%
tidyr::pivot_wider(names_from = 'variable', values_from = 'value')
tidy_anscombe %>%
dplyr::group_by(set) %>%
dplyr::summarise(mean(x),
mean(y),
sd(x),
sd(y))
```
## ggplot2 and the grammar of graphics
We will use the grammar of graphics package, *ggplot2* in this tutorial. From the ggplot2 [documentation](https://ggplot2.tidyverse.org/):
> It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of
> visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic
> mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like
> scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like
> coord_flip()).
```{r}
library(ggplot2) # allows us to use the grammar of ggplot2
ggplot(tidy_anscombe) +
aes(x = x, y = y, color = as.factor(set)) +
geom_point()
```
```{r}
ggplot(tidy_anscombe) +
aes(x = x, y = y, color = as.factor(set)) +
geom_point() +
scale_color_brewer(palette = 'Paired')
```
```{r}
ggplot(tidy_anscombe) +
aes(x = x, y = y, color = as.factor(set)) +
geom_point() +
scale_color_brewer(palette = 'Paired') +
facet_wrap(.~set)
```
## DepMap Data
In order to pracitice visualizing data with thousands of observations while also investigating a biologically relevant question, we will use data from the Broad's [Cancer Dependency Map](https://depmap.org/portal/), a compendium of genome-wide CRISPR screens across hundreds of cancer cell lines, to identify genetic vulnerabilities.
### Preparing data
Here we use the packages, *readr*, *dplyr* and *tidyr* to prepare data for analysis. This is not the focus of this session, but Hadley Wickham's [R for Data Science](https://r4ds.had.co.nz/introduction.html) is an excellent reference.
```{r}
gene_effect = readr::read_csv(here::here('example_data',
'Achilles_gene_effect.csv.zip')) # rely on local filepaths with here
cell_line_info = readr::read_csv(here::here('example_data', 'sample_info.csv.zip'),
col_types = list(additional_info = readr::col_character()))
tidy_gene_effects = gene_effect %>%
dplyr::rename('DepMap_ID' = 'X1') %>%
tidyr::pivot_longer(-DepMap_ID, names_to = 'Gene (ID)', values_to = 'CERES score')
joined_effects_cell_info = dplyr::inner_join(tidy_gene_effects, cell_line_info)
```
```{r}
head(joined_effects_cell_info)
```
### CERES Score
CERES score is a measure of a gene essentiality in a cell line from a CRISPR-Cas9 genome-wide screen. A score less than or equal to -1 indicates a gene is essential, whereas a score close to 0 indicates little to no phenotypic effect. See [DepMap](https://depmap.org/ceres/) for more details.
```{r}
ggplot(joined_effects_cell_info) +
aes(x = `CERES score`) +
geom_density() +
geom_vline(xintercept = -1)
```
### Leukemia dependencies
With this datset we can ask - which genes are selectively essential in leukemia cell lines? Identifying these selective vulnerabilities can help accelerate the development of precision treatment.
To answer this question, we take the median CERES score for Leukemia cell lines vs the median score for all other cell lines. Taking the difference between these scores, we see that MYB is the top differential dependency.
```{r}
leukemia_dependency = joined_effects_cell_info %>%
dplyr::mutate(type = ifelse(disease == 'Leukemia', 'Leukemia', 'other')) %>%
dplyr::group_by(`Gene (ID)`, type) %>%
dplyr::summarise(median_effect = median(`CERES score`, na.rm = T)) %>%
tidyr::pivot_wider(names_from = type, values_from = median_effect) %>%
dplyr::ungroup() %>%
dplyr::mutate(delta = Leukemia - other) %>%
dplyr::arrange(delta)
head(leukemia_dependency)
```
#### Scatterplot
Using ggplot2 we'll plot the median CERES score of non-Leukemia vs Leukemia cell lines. Here we can see the leukemia and non-leukemia CERES scores are well correlated, although there are some outliers.
```{r}
ggplot(leukemia_dependency) +
aes(x = Leukemia, y = other) +
coord_equal() +
geom_point() +
geom_abline(color = 'grey50')
```
Often, it is helpful to know the density of points in an dense scatterplot. We can do this using
**Option 1:** ggplot's internal `stat_density2d` function
```{r}
ggplot(leukemia_dependency) +
aes(x = Leukemia, y = other) +
coord_equal() +
geom_point() +
stat_density2d(n = 50) +
geom_abline(color = 'grey50') +
ggtitle('Using stat_density2d')
```
**Option 2:** the external package *ggpointdensity* to overlay color representing density
```{r}
ggplot(leukemia_dependency) +
aes(x = Leukemia, y = other) +
coord_equal() +
ggpointdensity::geom_pointdensity() +
scale_color_viridis_c() +
geom_abline(color = 'grey50') +
ggtitle('Using ggpointdensity')
```
Suppose we want to label the top hits in our scatterplot. ggplot2's `geom_text` allows us label points, but these can be overlapping and difficult to read. On the other hand, `geom_text_repel` from *ggrepel* repels overlapping labels away from one another.
```{r}
top_dependencies <- leukemia_dependency %>%
dplyr::top_n(3, -delta)
labeled_dependency <- leukemia_dependency %>%
dplyr::mutate(label = ifelse(`Gene (ID)` %in% top_dependencies$`Gene (ID)`,
`Gene (ID)`, ''))
ggplot(labeled_dependency) +
aes(x = Leukemia, y = other, label = label) +
coord_equal() +
ggpointdensity::geom_pointdensity() +
scale_color_viridis_c() +
geom_abline(color = 'grey50') +
ggrepel::geom_text_repel(size = 3, box.padding = 0.5, min.segment.length = 0)
```
#### Exercise
What are the top 3 buffering dependencies (i.e. greatest positive difference)
for Leukemia cell lines? Can you label them in a scatterplot?
#### Boxplots
To gain an understanding for another type of plot at our disposal, we will use boxplots to see if there are any Leukemia sublineages which are enriched for our top genes. We can visualize these in two different ways
**Option 1:** faceting. Create a subplot for each gene separtely. Note that scales along the y axis are cohesive accross facets.
We see that MYB is essential for all specified sublineages, whereas CBFB is variable in CML lines and TYMS is variable essential in AML lines.
```{r}
top_leukemia_genes <- joined_effects_cell_info %>%
dplyr::filter(`Gene (ID)` %in% unique(top_dependencies$`Gene (ID)`)) %>%
dplyr::mutate(leukemia_subtype =ifelse(disease == 'Leukemia',
lineage_subtype, 'non-Leukemia'),
leukemia_subtype = forcats::fct_reorder(leukemia_subtype, `CERES score`)) # allows us to reorder the subtypes by CERES score
ggplot(top_leukemia_genes) +
aes(x = leukemia_subtype, y = `CERES score`) +
geom_boxplot() +
facet_wrap(vars(`Gene (ID)`)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
```
**Option 2:** Color by gene.
Visualizing this way, we can see CBFB is less essential than MYB accross leukemia subtypes.
```{r}
ggplot(top_leukemia_genes) +
aes(x = leukemia_subtype, y = `CERES score`, color = `Gene (ID)`) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
scale_color_brewer(palette = 'Set2')
```
#### Exercise
Are there any subtypes which act differently for the top buffering genes?
## References
* [ggplot2 cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)
* [R for Data Science](https://r4ds.had.co.nz/introduction.html)