All analysis was conducted in R version 3.4.0 using the following script. Computations were performed on a MacBook Pro with 16GB of RAM and an i7 quad-core processor.
If you haven't already installed the bioplotr
package, you'll need to do so to reproduce some of the figures below.
# Install bioplotr
devtools::install_github('dswatson/bioplotr')
# Load libraries
library(data.table)
library(tximport)
library(edgeR)
library(limma)
library(ggplot2)
library(bioplotr)
library(ggsci)
library(qvalue)
library(dplyr)
Following RNA-seqencing of all samples, reads were pseudo-aligned using kallisto (Bray et al., 2016). Our reference genome was GRCh38.p2, Ensembl version 79. Transcript-level reads were aggregated to gene-level using the tximport
package (Soneson et al., 2015). We add a subject-tissue interaction term to the clinical data file for later use.
# Load data
clin <- fread('./Data/Clinical.csv') %>%
mutate(Subject.Tissue = paste(Subject, Tissue, sep = '.'))
t2g <- fread('./Data/Hs79.t2g.csv')
files <- file.path('./Data/RawCounts', clin$Sample, 'abundance.tsv')
txi <- tximport(files, type = 'kallisto', tx2gene = t2g, importer = fread,
countsFromAbundance = 'lengthScaledTPM')
We modify genetic and clinical data prior to modelling.
Before conducting EDA or differential expression analysis, we remove probes with less than one count per million (CPM) in at least nine libraries. This ensures that every probe is expressed in at least one of our nine replicates per subject (three tissue types observed at three timepoints each). This threshold follows the filtering guidelines of Robinson et al. (2010). Counts are then TMM normalised prior to modelling (Robinson & Oshlack, 2010). See the extensive edgeR
package vignette for more details.
# Filter genes
keep <- rowSums(cpm(txi$counts) > 1) >= 9
y <- DGEList(txi$counts[keep, ])
y <- calcNormFactors(y)
colnames(y) <- clin$Sample
# Check dimensionality
dim(y)
## [1] 19304 89
The analysis will proceed with 19,304 probes.
To get a quick sense for the clinical data, we create scatterplots of baseline vs. 12 week PASI and baseline vs. delta PASI.
# Create per-subject data frame
df <- clin %>% distinct(Subject, PASI_wk00, PASI_wk12, DeltaPASI)
# Regress 12 week PASI on baseline
m <- lm(PASI_wk12 ~ 0 + PASI_wk00, data = df)
# Plot results
ggplot(df, aes(PASI_wk00, PASI_wk12, label = Subject)) +
geom_text() +
geom_abline(intercept = 0, slope = 1) +
geom_abline(intercept = 0, slope = coef(m), colour = 'blue') +
xlim(0, 40) + ylim(0, 40) +
labs(title = 'Baseline vs. 12 Week PASI',
x = 'Baseline PASI',
y = '12 Week PASI') +
theme_bw() +
theme(plot.title = element_text(hjust = .5))
The black line has an intercept of 0 and a slope of 1, representing zero improvement over the course of treatment. We find that only one subject, S09, falls to the left of this line. The blue line has an intercept of 0 and a slope corresponding to the line of best fit through the data points. To obtain a least squares estimate of the average percent change in PASI observed in our study, we calculate the difference in slope between the black and blue lines.
as.numeric(1 - coef(m))
## [1] 0.6917699
Subjects in our study experienced a 69% reduction in PASI scores over the course of treatment, on average. It's evident from this plot, however, that there is considerable variance around that mean. S09 in particular appears to be an outlier. Note that this patient is the only one in the study who got worse between baseline and week 12. It is important that our model capture genomic signatures of non-response, but the distance between this patient's outcome and the nearest fellow non-responder is disconcertingly large. As we see in the following plot, S09 singlehandedly creates the impression of a relationship between baseline and outcome statistics.
# Regress delta PASI on baseline, with and without S09
m1 <- lm(DeltaPASI ~ PASI_wk00, data = df)
m2 <- lm(DeltaPASI ~ PASI_wk00, data = filter(df, Subject != 'S09'))
# Plot baseline vs. delta PASI
ggplot(df, aes(PASI_wk00, DeltaPASI, label = Subject)) +
geom_text() +
geom_abline(intercept = coef(m1)[1], slope = coef(m1)[2]) +
geom_abline(intercept = coef(m2)[1], slope = coef(m2)[2], colour = 'blue') +
labs(title = 'Baseline vs. Delta PASI',
x = 'Baseline PASI',
y = 'Delta PASI') +
theme_bw() +
theme(plot.title = element_text(hjust = .5))
The black line represents the regression of delta PASI on baseline for all subjects in the study; the blue line represents the same model, excluding S09. Note how the latter line is practically flat, while the former has a clear positive slope. Correlations between these variables are heavily affected by the presence or absence of this single data point.
cor(df$PASI_wk00, df$DeltaPASI)
## [1] 0.2612898
cor(df$PASI_wk00[df$Subject != 'S09'], df$DeltaPASI[df$Subject != 'S09'])
## [1] -0.007712975
Early attempts at this analysis found that subject 9 was driving differential expression results due to the patient's high leverage in least squares regression modelling. Rather than remove the patient altogether, we elect to winsorise the distribution, replacing S09's delta PASI score with a lower limit defined as two median absolute deviations below the median of all delta PASI scores.
# Create function
winsorise <- function(x, multiple = 2) {
y <- x - median(x)
lim <- mad(y, center = 0) * multiple
y[y > lim] <- lim
y[y < -lim] <- -lim
y <- y + median(x)
return(y)
}
# Check to see which patient outcomes are changed by winsorisation
(df <- clin %>%
distinct(Subject, DeltaPASI) %>%
mutate(Winsorised = winsorise(DeltaPASI)))
Subject | DeltaPASI | Winsorised |
---|---|---|
S01 | 0.5172414 | 0.5172414 |
S02 | 0.7346154 | 0.7346154 |
S03 | 0.6736111 | 0.6736111 |
S04 | 0.5023697 | 0.5023697 |
S05 | 0.6948229 | 0.6948229 |
S07 | 0.8009709 | 0.8009709 |
S08 | 0.9100000 | 0.9100000 |
S09 | -0.0468750 | 0.3677652 |
S10 | 0.8459120 | 0.8459120 |
S11 | 0.7222222 | 0.7222222 |
This confirms that only subject 9 meets our winsorisation threshold. The patient's results are still by far the most extreme in the data, but have now been robustified for linear modelling.
# Winsorise the distribution
clin <- clin %>%
group_by(Tissue, Time) %>%
mutate(DeltaPASI = winsorise(DeltaPASI)) %>%
ungroup()
All EDA is conducted on the log2-CPM scale of the normalised count matrix.
mat <- cpm(y, log = TRUE, prior.count = 1)
We begin by examining the data's mean-variance trend, as the shape of this curve is essential for the voom
weights we intend to apply before testing for differential expression.
plot_mv(mat, trans = 'sqrt')
This plot looks about right for these data.
While a mean-variance plot tells us something about the distribution of counts by gene, a density plot helps visualise the distribution of counts by sample. I've chosen to colour the densities by tissue type, since this is evidently the greatest source of variation in the data.
plot_density(mat, group = list('Tissue' = clin$Tissue))
We find here that blood samples take a unique shape, while skin samples are generally more alike. Still, nonlesional tissue appears to have a slightly higher peak than lesional tissue. There are no clear outliers in this figure, but we cannot make a conclusive judgment about this without further exploration.
We build a sample similarity matrix by calculating the pairwise Euclidean distance between all samples in the data. This matrix is then visualised as a heatmap and used to generate a hierarchical clustering dendrogram.
plot_similarity(mat, group = list('Tissue' = clin$Tissue))
The dendrogram has perfectly separated blood from skin, which is reassuring. Hierarchical clustering has had more trouble distinguishing between lesional and nonlesional samples, but this appears to be the result of strong intra-subject effects. A paired test would almost certainly find large differential expression between these two types of skin samples.
One final, popular method for visualising the variance of a high-dimensional dataset along just two axes is principal component analysis (PCA).
plot_pca(mat, group = list('Tissue' = clin$Tissue))
This plot represents perhaps the clearest possible summary of the findings from the last few figures. The first principal component, which captures over 70% of the variance in these data, perfectly separates blood from skin samples. The second principal component, which accounts for a little over 9% of the data variance, separates lesional from nonlesional tissue, albeit with one stray data point corresponding to S03's week 12 observation. While this subject did not achieve PASI 75, he did come quite close. Transcriptomic evidence suggests he may have been on his way to positive biologic response, given the proximity of his final lesional sample to nonlesional tissue.
We examine between-subject variation across tissue types and time points by fitting a linear model, shrinking residual variance using empirical Bayes methods, and extracting the appropriate contrasts (Smyth, 2004).
# Build design matrix
des <- model.matrix(~ 0 + Tissue:Time + Tissue:Time:DeltaPASI, data = clin)
colnames(des)[10:18] <- paste(unique(clin$Tissue), rep(unique(clin$Time), each = 3),
'Response', sep = '.')
This design matrix has 18 columns. The first 9 represent every tissue-time combination, effectively providing intercepts for each of the 9 sub-models of which the model is composed. The latter 9 are three-way interactions between tissue type, time point, and delta PASI, which we use to measure drug response. The coefficients for these features represent the slopes of each linear sub-model. We rename these variables for more convenient reference later on.
With 89 samples in our study collected from different tissues at various times, it's distinctly possible that there's significant variation in quality across libraries. The authors of the limma
package originally introduced an arrayWeights
function for microarray data so users could empirically estimate sample weights to incorporate in heteroscedastic linear model fits (Ritchie et al., 2006). Shortly after introducing the voom
transformation, which allows RNA-seq count matrices to be modeled in the limma
pipeline (Law et al., 2014), the package authors extended Ritchie et al.'s sample weighting procedure to RNA-seq data (Liu et al., 2015). We can check for variability in library quality by plotting the results of a preliminary estimation of voom
sample weights.
# Estimate weights
v <- voomWithQualityWeights(y, des)
# Tidy data
df <- data_frame(Sample = paste0('S', seq_len(89)),
Tissue = clin$Tissue,
Weight = as.numeric(v$sample.weights)) %>%
mutate(Sample = factor(Sample, levels = unique(Sample)))
# Plot results
ggplot(df, aes(Sample, Weight, fill = Tissue)) +
geom_bar(stat = 'identity') +
scale_fill_npg() +
geom_hline(yintercept = 1, linetype = 'dashed') +
labs(title = 'Library Quality by Tissue') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = .5),
legend.justification = c(0.01, 0.99),
legend.position = c(0.01, 0.99))
Library quality does seem to vary across samples, but it's not entirely clear whether that's a function of tissue, time, the interaction between them, or perhaps even subject. To find out, we run a series of F-tests. The first three are repeated measures ANOVAs, the latter a simple one-way ANOVA. (Technically, we should remove subject 11 from the repeated measures ANOVAs since this patient's week 12 lesional sample is NA; in practice, it makes no difference here.)
df <- df %>%
mutate(Subject = clin$Subject,
Time = clin$Time)
# Do sample weights vary significantly by tissue?
anova(lm(Weight ~ Subject + Tissue, data = df),
lm(Weight ~ Subject, data = df))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
77 | 6.811668 | NA | NA | NA | NA |
79 | 8.182313 | -2 | -1.370644 | 7.746971 | 0.0008601 |
# By time?
anova(lm(Weight ~ Subject + Time, data = df),
lm(Weight ~ Subject, data = df))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
77 | 7.949054 | NA | NA | NA | NA |
79 | 8.182313 | -2 | -0.2332586 | 1.129752 | 0.3284098 |
# By tissue-time?
anova(lm(Weight ~ Subject + Tissue:Time, data = df),
lm(Weight ~ Subject, data = df))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
71 | 6.210653 | NA | NA | NA | NA |
79 | 8.182313 | -8 | -1.97166 | 2.817494 | 0.0090095 |
# By subject?
anova(lm(Weight ~ Subject, data = df),
lm(Weight ~ 1, data = df))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
79 | 8.182313 | NA | NA | NA | NA |
88 | 9.540588 | -9 | -1.358276 | 1.457124 | 0.1787251 |
It appears from these F-tests that tissue type is the main driver of variation in library quality. This is true even though the design matrix passed to voomWithQualityWeights
includes a Tissue:Time interaction term, meaning residual sample variance is only calculated in comparison to other libraries from the same tissue-time. Because the variability is considerable - the minimum and maximum weights differ by a factor of five - we elect to build a heteroscedastic model that will take sample weights into account.
To account for the intra-subject correlations inherent to our study's repeated measures design, we use the duplicateCorrelation
function. This approximates a mixed model in which a blocking variable, in this case the interaction between subject and tissue type, becomes a random effect (Smyth, 2005). Following the advice of the package authors, we estimate voom
weights and block correlations twice each.
icc <- duplicateCorrelation(v, des, block = clin$Subject.Tissue)
v <- voomWithQualityWeights(y, des, correlation = icc$cor,
block = clin$Subject.Tissue)
icc <- duplicateCorrelation(v, des, block = clin$Subject.Tissue)
It's worth checking to see how high the intra-class correlation is just to confirm it's a positive real number on (0, 1).
icc$cor
## [1] 0.3210134
That value seems plausible, and confirms that we are right to use this approach. The positive correlation of expression values within each subject-tissue violates the assumption of independence upon which fixed effect models are based.
With all these pieces in place, all that's left to do is fit the genewise regressions. We write a custom function to export results.
res <- function(coef) {
topTable(fit, coef = coef, number = Inf, sort.by = 'none') %>%
rename(AvgExpr = AveExpr,
p.value = P.Value) %>%
mutate(q.value = qvalue(p.value)$qvalues,
Gene = rownames(v)) %>%
arrange(p.value) %>%
select(Gene, AvgExpr, logFC, p.value, q.value) %>%
fwrite(paste0('./Results/Response/mRNA/', coef, '.txt'), sep = '\t')
}
We now fit the model and export results for each tissue-time. We use a robust empirical Bayes shrinkage procedure to mitigate the impact of hypervariable genes (Phipson et al., 2016).
fit <- lmFit(v, des, correlation = icc$cor, block = clin$Subject.Tissue)
fit <- eBayes(fit)
for (j in colnames(des)[10:18]) res(j)
In addition to exporting the results file, we also save the final voom-transformed matrix and model object to disk for later use. This will save considerable time when clustering and plotting figures.
saveRDS(v$E, './Data/mat_mRNA.rds')
saveRDS(fit, './Data/fit_mRNA.rds')
To get an overview of how many genes are declared differentially expressed across all our contrasts at 10% FDR, we build the following summary table.
(df <- clin %>%
distinct(Tissue, Time) %>%
rowwise() %>%
mutate(DEgenes = fread(paste0('./Results/Response/mRNA/',
Tissue, '_', Time, '.Response.txt')) %>%
summarise(sum(q.value <= 0.1)) %>%
as.numeric()) %>%
as.data.frame())
Tissue | Time | DEgenes |
---|---|---|
Blood | wk00 | 8 |
Blood | wk01 | 25 |
Blood | wk12 | 46 |
Lesional | wk00 | 2 |
Lesional | wk01 | 14 |
Lesional | wk12 | 5 |
Nonlesional | wk00 | 12 |
Nonlesional | wk01 | 113 |
Nonlesional | wk12 | 8 |
It appears there are signs of differential expression at each tissue-time, but the signal is somewhat uneven. To get a better sense for these results, it helps to visualise them as a bar plot.
ggplot(df, aes(Time, DEgenes, fill = Tissue)) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_fill_npg() +
labs(title = 'Differential Expression by Tissue and Time',
y = 'Differentially Expressed Genes (10% FDR)') +
theme_bw() +
theme(plot.title = element_text(hjust = .5),
legend.justification = c(0.99, 0.99),
legend.position = c(0.99, 0.99))
We find here that while transcriptomic signatures of response seem to get stronger in blood as time goes on, they peak in lesional and nonlesional skin at week 1. Nonlesional skin in particular shows an especially strong signal after one week's exposure to etanercept. We are especially interested in identifying predictive biomarkers, however, which is why in what follows we will focus on baseline measures.