-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSWATH-limma.Rmd
207 lines (150 loc) · 7.71 KB
/
SWATH-limma.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "Differentially expressed protein analysis of SWATH data using the LIMMA package"
output: html_notebook
---
This worflow shows how to process data from SWATH-MS protein quantification, in order to study differential expression of the proteins between the different conditions.
The data in this example come from : Bjelosevic S, Pascovici D, Ping H, et.al. Quantitative age-specific variability of plasma proteins in healthy neonates, children and adults. Molecular & Cellular Proteomics (March 23 2017). Four conditions are compared each conditions have ten replicates :
-neonates
-child less than one year old
-child between 1 and five years old
-adults
146 proteins were identified and quantified.
Here we gonna use the limma package to perform the normalization of the data and the differential expression annalysis.
Limma is a Bioconductor package which uses linear model for analysing experiments and the assessment of differential expression.
#1. loading and visualizing the data
```{r}
data<-read.table("SWATH.example.data.csv", sep=",",header=T, row.names = 1)
data
boxplot(data, ylab="log2(Intensity)", xlab="samples", main = "distribution of Intensity")
hist(as.numeric(unlist(data)), main = "Histogram of Intensity distribution", xlab = "Intensity")
```
#3. Normalization
##3.1 design and contrast matrix
In order to perform the normalization with the limma package we need to define the design matrix generated with model.matrix(), which identify which sample belong to which condition.
```{r}
#for this specific data :
exp.design <- data.frame(samples = colnames(data), condition = 1) ## first we define a matrix containing the experiment design information by associating a sample name with the condition name
exp.design$condition[1:10] = "control"
exp.design$condition[11:20] = "lessone"
exp.design$condition[21:30] = "onetofive"
exp.design$condition[31:40] = "adult"
design <- model.matrix(~0 + exp.design$condition, data = exp.design) ## model.matrix() use the experiment design to generate a matrix.
colnames(design) <- sort(unique(exp.design$condition))
row.names(design) <- exp.design$samples
as.data.frame(design)
```
We also need to create a contrast matrix generated with the limma function makeContrast(). This matrix allows to define the comparisons between the differents conditions to perform.
```{r}
contrast <- makeContrasts(adult-control, lessone-control, onetofive-control, levels=design)
as.data.frame(contrast)
```
##3.2 applying limma
Now, we can normalise the dataset using the following commands. The calcNormFactors(), calculates the normalization factors to scale the library sizes.
The limma package (since version 3.16.0) offers the voom function that will normalise read counts and apply a linear model to the normalised data before computing moderated t-statistics of differential expression.
The returned data object consists of a few attributes, which you can check using names(y), one of which is the normalised expression (y$E) values in log2 scale.
```{r}
library(limma)
library(edgeR)
dge <- DGEList(data)
dge <- calcNormFactors(dge)
y <- voom(dge, design)
norm.data <- y$E
as.data.frame(norm.data)
boxplot(norm.data, ylab="normalized Intensity", xlab="samples", main = "distribution of the normalized Intensity")
hist(norm.data, main = "Histogram of Intensity distribution", xlab = "Intensity")
```
#4. Differential expression annalysis
##4.1 fitting the model
To fit the model, use the lmFit() function, which takes in the normalised data object and the design matrix:
```{r}
fit <- lmFit(y, design)
```
Refit the model using the comparisons defined:
```{r}
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)
```
##4.2 Extracting the results
The topTable function summarises the output from limma in a table format.
```{r}
adult.vs.c <- topTable(fit2, coef = "adult - control", n = nrow(fit2))
lessone.vs.control <- topTable(fit2, coef = "lessone - control", n = nrow(fit2))
onetofive.vs.control <- topTable(fit2, coef = "onetofive - control", n = nrow(fit2))
adult.vs.c ; lessone.vs.control ; onetofive.vs.control
```
We keep only the values of interest like the p value, the adjusted p value and the fold change to create the volcano plot. The results for the different pairwise comparison are stored in a list of dataframe.
```{r}
adult.vs.c$protein <- row.names(adult.vs.c)
row.names(adult.vs.c) <- NULL
adult.vs.c <- adult.vs.c[, c(7, 4, 5, 1)]
lessone.vs.control$protein <- row.names(lessone.vs.control)
row.names(lessone.vs.control) <- NULL
lessone.vs.control <- lessone.vs.control[, c(7, 4, 5, 1)]
onetofive.vs.control$protein <- row.names(onetofive.vs.control)
row.names(onetofive.vs.control) <- NULL
onetofive.vs.control <- onetofive.vs.control[, c(7, 4, 5, 1)]
Results <- list(adult.vs.c, lessone.vs.control, onetofive.vs.control)
```
##4.2 Differentially expressed proteins
A protein is considered differentially expressed if its p.value and fold change are superior to a determined threshold. The fold change represent how much a protein is differentially expressed between two conditions and the p value allows to evaluate the statistical significance of this difference.
```{r}
thresh_fc <- 0.5
thresh_p <- 0.05
deProt <- Results
for(i in 1:length(Results)){
fc = as.data.frame(deProt[[i]])[,4]
p = as.data.frame(deProt[[i]])[,3]
dt <-as.data.frame(deProt[[i]])
deProt[[i]] <- dt[which(p<=thresh_p & abs(fc)>=thresh_fc),]
print(as.data.frame(deProt[[i]]))
}
```
##4.3 Volcano Plot
For each pairwise comparison in the results list a volcano plot is created, volcano plots are the best way to visualise the p value and the fold change at the same time.
```{r}
library(ggplot2)
library(ggrepel)
tresh_fc = 0.5 ## Fold change threshold
tresh_p = 0.05 ## p.value treshold
for(i in 1:length(Results)){
plotTitle <- substr(colnames(Results[[i]])[2], 9 ,nchar(colnames(Results[[i]])[2]))
values <- as.data.frame(Results[[i]])
forplot <- data.frame(x=as.numeric(values[,4]), y=-log10(values[,3]), id=as.character(values[,1]))
tmp <- forplot[as.numeric(forplot$y)>=-log10(tresh_fc) & abs(forplot$x)>tresh_fc,]
p <- ggplot(forplot) + geom_point(aes(x, y , color = ifelse(y>=-log10(tresh_p) & abs(x)>=tresh_fc, "not signi", "FC")),show.legend = F) +
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(data = subset(forplot, abs(forplot$x)>=tresh_fc & forplot$y>=-log10(tresh_p)),
aes(x,y,label = id),
size = 2) +
geom_vline(xintercept = tresh_fc ) +
geom_vline(xintercept = -tresh_fc) +
geom_hline(yintercept = -log10(tresh_p)) +
labs(title = plotTitle,x="log2(Fold-change)", y="-log10(P.Value)") + theme_bw()
print(p)
}
```
```{r}
res <- vector()
for(i in 1:length(deProt)){
res <- append(res, as.character(deProt[[i]][,1]))
}
res <- unique(res)
print(res)
saveRDS(res, "resLIMMA.rds")
```
```{r}
library(VennDiagram)
grid.newpage()
draw.triple.venn(area1 = length(deProt[[1]][,1]),
area2 = length(deProt[[2]][,1]),
area3 = length(deProt[[3]][,1]),
n12 = length(intersect(deProt[[1]][,1],deProt[[2]][,1])),
n23 = length(intersect(deProt[[2]][,1],deProt[[3]][,1])),
n13 = length(intersect(deProt[[1]][,1],deProt[[3]][,1])),
n123 = length(Reduce(intersect, list(deProt[[1]][,1],deProt[[2]][,1],deProt[[3]][,1]))),
category = c("a-c","l-c","o-c"),
lty = "blank", fill = c("skyblue", "pink1", "mediumorchid"))
```
# ```{r}
# boxplot(apply(norm.data,1,sd),apply(scale(log2(data)),1,sd), log(apply(data,1,sd)),main = "comparison of the standard deviation \n of the differents normalization techniques", xlab="technique", ylab="standard deviation",names=c("voom", "center and scaling", "original data"))
# ```