22-imputation.Rmd

---
output: html_document
---

## Imputation

```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align="center", echo=FALSE)
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(scImpute)
library(SC3)
library(scater)
library(SingleCellExperiment)
set.seed(1234567)
```


As discussed previously, one of the main challenges when analyzing scRNA-seq data is the presence of zeros, or dropouts. The dropouts are assumed to have arisen for three possible reasons:

* The gene was not expressed in the cell and hence there are no transcripts to sequence
* The gene was expressed, but for some reason the transcripts were lost somewhere prior to sequencing
* The gene was expressed and transcripts were captured and turned into cDNA, but the sequencing depth was not sufficient to produce any reads.

Thus, dropouts could be result of experimental shortcomings, and if this is the case then we would like to provide computational corrections. One possible solution is to impute the dropouts in the expression matrix. To be able to impute gene expression values, one must have an underlying model. However, since we do not know which dropout events are technical artefacts and which correspond to the transcript being truly absent, imputation is a difficult challenges.

To the best of our knowledge, there are currently two different imputation methods available: MAGIC [@Van_Dijk2017-bh] and scImpute [@Li2017-tz]. Since [MAGIC](https://github.com/pkathail/magic) is only available for Python or Matlab, we will only cover [scImpute](https://github.com/Vivianstats/scImpute) in this chapter.

### scImpute

To test `scImpute`, we use the default parameters and we apply it to the Deng dataset that we have worked with before. scImpute takes a .csv or .txt file as an input:

```{r}
deng <- readRDS("deng/deng-reads.rds")
write.csv(counts(deng), "deng.csv")
scimpute(
    count_path = "deng.csv",
    infile = "csv",
    outfile = "txt", 
    out_dir = "./",
    Kcluster = 10,
    ncores = 2
)
```

Now we can compare the results with original data by considering a PCA plot

```{r}
res <- read.table("scimpute_count.txt")
colnames(res) <- NULL
res <- SingleCellExperiment(
    assays = list(logcounts = log2(as.matrix(res) + 1)), 
    colData = colData(deng)
)
plotPCA(
    res, 
    colour_by = "cell_type2"
)
```

Compare this result to the original data in Fig. X, Chapter Y. What are the most significant differences?

To evaluate the impact of the imputation, we use `SC3` to cluster the imputed matrix
```{r}
res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
res <- sc3(res, ks = 10, gene_filter = FALSE)
sc3_plot_consensus(
    res, 
    k = 10, 
    show_pdata = "cell_type2"
)
plotPCA(
    res, 
    colour_by = "sc3_10_clusters"
)
```

Based on the PCA and the clustering results, do you think that imputation is a good idea for the Pollen dataset?

### MAGIC

```{r}
system("MAGIC.py -d deng.csv -o MAGIC_count.csv --cell-axis columns -l 1 --pca-non-random csv")
```

```{r}
res <- read.csv("MAGIC_count.csv", header = TRUE)
rownames(res) <- res[,1]
res <- res[,-1]
res <- t(res)
res <- SingleCellExperiment(
    assays = list(logcounts = res), 
    colData = colData(deng)
)
plotPCA(
    res, 
    colour_by = "cell_type2"
)
```

Compare this result to the original data in Fig. X, Chapter Y. What are the most significant differences?

To evaluate the impact of the imputation, we use `SC3` to cluster the imputed matrix
```{r, eval = FALSE}
res <- sc3_estimate_k(res)
metadata(res)$sc3$k_estimation
res <- sc3(res, ks = 10, gene_filter = FALSE)
sc3_plot_consensus(
    res, 
    k = 10, 
    show_pdata = "cell_type2"
)
plotPCA(
    res, 
    colour_by = "sc3_10_clusters"
)
```

### sessionInfo()

```{r echo=FALSE}
sessionInfo()
```