datavis_example.Rmd

---
title: "Data Visualization in R"
author: "CodeRATS"
date: "1/29/2020"
output: 
  html_document:
    toc: true
    number_sections: true
    fig_width: 7
    fig_height: 5
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE)
```

## Setup 

Download [R](https://www.r-project.org/) and [RStudio](https://rstudio.com/products/rstudio/download/) (the free version works perfectly well). 
Open the project folder, and click the .Rproj file which will install the packages necessary to run this notebook. 


## Introduction to ggplot2

To get started with ggplot, we will use the anscombe data, which contains four sets which have the same  statistical properties (mean, variance, etc.), yet are visually distinguishable.   

**Note:** The `::` operator allows us to explicitly call functions from a particular package. We use this convention here only for clarity, but it is equally valid to load the package first with `library()` and then use the function without reference to the package.

```{r}
library(magrittr) # allows us to use the %>% operator

tidy_anscombe <- anscombe %>%
  dplyr::mutate(row = dplyr::row_number()) %>%
  tidyr::pivot_longer(-row, names_to = 'variable_set', values_to = 'value') %>%
  tidyr::separate(variable_set, c('variable', 'set'), sep = 1) %>%
  tidyr::pivot_wider(names_from = 'variable', values_from = 'value')

tidy_anscombe %>%
  dplyr::group_by(set) %>%
  dplyr::summarise(mean(x), 
                   mean(y), 
                   sd(x), 
                   sd(y))
```

## ggplot2 and the grammar of graphics

We will use the grammar of graphics package, *ggplot2* in this tutorial. From the ggplot2 [documentation](https://ggplot2.tidyverse.org/):

> It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of
> visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic
> mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like
> scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like
> coord_flip()).

```{r}
library(ggplot2) # allows us to use the grammar of ggplot2
ggplot(tidy_anscombe) +
  aes(x = x, y = y, color = as.factor(set)) +
  geom_point()
```

```{r}
ggplot(tidy_anscombe) +
  aes(x = x, y = y, color = as.factor(set)) +
  geom_point() +
  scale_color_brewer(palette = 'Paired')
```


```{r}
ggplot(tidy_anscombe) +
  aes(x = x, y = y, color = as.factor(set)) +
  geom_point() +
  scale_color_brewer(palette = 'Paired') +
  facet_wrap(.~set)
```

## DepMap Data

In order to pracitice visualizing data with thousands of observations while also investigating a biologically relevant question, we will use data from the Broad's [Cancer Dependency Map](https://depmap.org/portal/), a compendium of genome-wide CRISPR screens across hundreds of cancer cell lines, to identify genetic vulnerabilities.

### Preparing data

Here we use the packages, *readr*, *dplyr* and *tidyr* to prepare data for analysis. This is not the focus of this session, but Hadley Wickham's [R for Data Science](https://r4ds.had.co.nz/introduction.html) is an excellent reference.

```{r}
gene_effect = readr::read_csv(here::here('example_data', 
                                         'Achilles_gene_effect.csv.zip')) # rely on local filepaths with here
cell_line_info = readr::read_csv(here::here('example_data', 'sample_info.csv.zip'),
                                 col_types = list(additional_info = readr::col_character()))
tidy_gene_effects = gene_effect %>%
  dplyr::rename('DepMap_ID' = 'X1') %>%
  tidyr::pivot_longer(-DepMap_ID, names_to = 'Gene (ID)', values_to = 'CERES score')
joined_effects_cell_info = dplyr::inner_join(tidy_gene_effects, cell_line_info)
```

```{r}
head(joined_effects_cell_info)
```

### CERES Score

CERES score is a measure of a gene essentiality in a cell line from a CRISPR-Cas9 genome-wide screen. A score less than or equal to -1 indicates a gene is essential, whereas a score close to 0 indicates little to no phenotypic effect. See [DepMap](https://depmap.org/ceres/) for more details. 

```{r}
ggplot(joined_effects_cell_info) +
  aes(x = `CERES score`) +
  geom_density() +
  geom_vline(xintercept = -1)
```


### Leukemia dependencies

With this datset we can ask - which genes are selectively essential in leukemia cell lines? Identifying these selective vulnerabilities can help accelerate the development of precision treatment. 

To answer this question, we take the median CERES score for Leukemia cell lines vs the median score for all other cell lines. Taking the difference between these scores, we see that MYB is the top differential dependency.

```{r}
leukemia_dependency = joined_effects_cell_info %>%
  dplyr::mutate(type = ifelse(disease == 'Leukemia', 'Leukemia', 'other')) %>%
  dplyr::group_by(`Gene (ID)`, type) %>%
  dplyr::summarise(median_effect = median(`CERES score`, na.rm = T)) %>%
  tidyr::pivot_wider(names_from = type, values_from = median_effect) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(delta = Leukemia - other) %>%
  dplyr::arrange(delta)
head(leukemia_dependency)
```

#### Scatterplot

Using ggplot2 we'll plot the median CERES score of non-Leukemia vs Leukemia cell lines. Here we can see the leukemia and non-leukemia CERES scores are well correlated, although there are some outliers.

```{r}
ggplot(leukemia_dependency) +
  aes(x = Leukemia, y = other) +
  coord_equal() +
  geom_point() +
  geom_abline(color = 'grey50')
```

Often, it is helpful to know the density of points in an dense scatterplot. We can do this using 

**Option 1:** ggplot's internal `stat_density2d` function

```{r}
ggplot(leukemia_dependency) +
  aes(x = Leukemia, y = other) +
  coord_equal() +
  geom_point() +
  stat_density2d(n = 50) +
  geom_abline(color = 'grey50') +
  ggtitle('Using stat_density2d')
```

**Option 2:** the external package *ggpointdensity* to overlay color representing density

```{r}
ggplot(leukemia_dependency) +
  aes(x = Leukemia, y = other) +
  coord_equal() +
  ggpointdensity::geom_pointdensity() +
  scale_color_viridis_c() +
  geom_abline(color = 'grey50') +
  ggtitle('Using ggpointdensity')
```

Suppose we want to label the top hits in our scatterplot. ggplot2's `geom_text` allows us label points, but these can be overlapping and difficult to read. On the other hand, `geom_text_repel` from *ggrepel* repels overlapping labels away from one another.

```{r}
top_dependencies <- leukemia_dependency %>%
  dplyr::top_n(3, -delta)
labeled_dependency <- leukemia_dependency %>%
  dplyr::mutate(label = ifelse(`Gene (ID)` %in% top_dependencies$`Gene (ID)`, 
                               `Gene (ID)`, ''))
ggplot(labeled_dependency) +
  aes(x = Leukemia, y = other, label = label) +
  coord_equal() +
  ggpointdensity::geom_pointdensity() +
  scale_color_viridis_c() +
  geom_abline(color = 'grey50') +
  ggrepel::geom_text_repel(size = 3, box.padding = 0.5, min.segment.length = 0)
```

#### Exercise

What are the top 3 buffering dependencies (i.e. greatest positive difference) 
for Leukemia cell lines? Can you label them in a scatterplot?

#### Boxplots

To gain an understanding for another type of plot at our disposal, we will use boxplots to see if there are any Leukemia sublineages which are enriched for our top genes. We can visualize these in two different ways

**Option 1:** faceting. Create a subplot for each gene separtely. Note that scales along the y axis are cohesive accross facets. 

We see that MYB is essential for all specified sublineages, whereas CBFB is variable in CML lines and TYMS is variable essential in AML lines.

```{r}
top_leukemia_genes <- joined_effects_cell_info %>% 
  dplyr::filter(`Gene (ID)` %in% unique(top_dependencies$`Gene (ID)`)) %>%
  dplyr::mutate(leukemia_subtype =ifelse(disease == 'Leukemia', 
                                         lineage_subtype, 'non-Leukemia'),
                leukemia_subtype = forcats::fct_reorder(leukemia_subtype, `CERES score`)) # allows us to reorder the subtypes by CERES score
ggplot(top_leukemia_genes) +
  aes(x = leukemia_subtype, y = `CERES score`) +
  geom_boxplot() +
  facet_wrap(vars(`Gene (ID)`)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) 
```

**Option 2:** Color by gene.

Visualizing this way, we can see CBFB is less essential than MYB accross leukemia subtypes. 

```{r}
ggplot(top_leukemia_genes) +
  aes(x = leukemia_subtype, y = `CERES score`, color = `Gene (ID)`) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  scale_color_brewer(palette = 'Set2')
```

#### Exercise 

Are there any subtypes which act differently for the top buffering genes?

## References

* [ggplot2 cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)
* [R for Data Science](https://r4ds.had.co.nz/introduction.html)