-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path060.experiment.Rmd
340 lines (271 loc) · 12.9 KB
/
060.experiment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
## A Cluster Performance Experiment ##
### Introduction ###
Actually, the elapsed time depends on many factors. This presents an opportunity for optimizing the
computation even further by making the best choice of the factor. Our approach to the optimization
is to run statistically designed experiments. This experiment consists of a lot of factors. There are
two different types of elapsed time, which means there is one computation-type categorical factor.
Three statistical factors are "n", "m" and "v", which are the kernel factors for our experiment.
There are also two Hadoop HDFS factors and two Hadoop MapReduce factors, two hardware factors. In
total, we have 10 factors. We also have replicates for each combination of these factors. For more
details, you can check the paper(link) which describes the whole experiment process. In this tutorial,
we mainly focus on the computation-type category factor "type" and the three statistical factors "n",
"m" and "v".
We will vary the values of `n`, `m`, `v` and set 3 replicates for the experiment.
- n : 21, 23, 25, 27
- m : 8, 9, 10, ..., 16
- v : 4, 5, 6
- type : "O", "T", two levels
- rep : 3
Firstly, we will run a serial computation in R, which means there is no parallel computing. By contrast,
another designed experiment with DeltaRho computation system will be carried out to check the
computation ablity of DeltaRho.
### Serial Computation in R ###
Let's see what we can do without DeltaRho computation environment, and we only run this experiment in
R. Let's just try one single run first to have a basic idea about the time costing.
```{r, eval=FALSE,tidy=FALSE}
m <- 2^23
v <- 5
p <- 2^v -1
value <- matrix(c(rnorm(m*p), sample(c(0,1), m, replace=TRUE)), ncol=p+1)
L <- system.time(glm.fit(value[,1:p],value[,p+1],family=binomial())$coef)[3]
rm(value)
L
```
We illustrate the serial computation in R by setting the number of observations $m = 2^{23}$, the
log2 number of variables $v = 5$. The size of the dataset is $2^{23}\cdot 2^5 \cdot 2^3=2GB$. Since
we run this serial computation in R, we don't need the reading time. That is, we can directly analyze
the dataset after we generate it. In this case, the elapsed time would be the second elapsed time
"L" which is the analysing time.
```
367.719
```
The total elapsed time $L =367.719$, the unit for `L` is second. So it will take about 6.13 mins to
finish the analyising part. That's not bad. What if we enlarge the value of `v` to 6? Then the size
of the new dataset would be $2^{23}\cdot 2^6 \cdot 2^3=4GB$. We can re-run this test.
```{r, eval=FALSE,tidy=FALSE}
v <- 6
p <- 2^v -1
value <- matrix(c(rnorm(m*p), sample(c(0,1), m, replace=TRUE)), ncol=p+1)
L <- system.time(glm.fit(value[,1:p],value[,p+1],family=binomial())$coef)[3]
rm(value)
L
```
```
962.317
```
The total elapsed time `L` is 962.317 seconds, which is approximately 16.04 minutes. So `L` increases
by about 2 times while the size of dataset increases by 1 time. That is getting crazy. Let's keep
going on by change the value of `m`.
```{r, eval=FALSE,tidy=FALSE}
m <- 2^25
v <- 5
p <- 2^v -1
value <- matrix(c(rnorm(m*p), sample(c(0,1), m, replace=TRUE)), ncol=p+1)
L <- system.time(glm.fit(value[,1:p],value[,p+1],family=binomial())$coef)[3]
rm(value)
L
```
```
1815.752
```
The number of rows is $2^{25}$ and the number of collumns is $2^5$. So the size for the new dataset
is $2^{25}\cdot 2^5 \cdot 2^3=8GB$. The elapsed time `L` is 1815.752 seconds, which is about 30.3
minutes. What if we keep increasing the size of the dataframe by setting $m = 2^{27}$ and $v=4$? In
this case, the size of the dataset is 16GB. And actually, we will get an error message in the
analysing step:
```
Error: cannot allocate vector of size 15.0 Gb
Timing stopped at: 237.267 102.842 416.989
```
Same situation happens when we set $m = 2^{25}$ and $v=6$ in which case, the data set size is 16GB,
too. So Serial Computation in R has a lot of limitations in large data analysis such as timing
costing and memory limitation.
#### Conclusion and Plot ####
- For $v=4$, the largest value of n that it works for Serial Computation in R is 25
- For $v=5$, the largest value of n that it works for Serial Computation in R is 25
- For $v=6$, the largest value of n that it works for Serial Computation in R is 23.
[Plot. Elapsed time plots against n](./plots/serial_v_n.pdf).
And it is time for us to experience the computation ablity of `RHIPE`.
### A Designed Experiment Using RHIPE ###
We have shown how to run a single run of the performance test, so we
can easily generalize the single run to the whole cluster performance experiment.
#### Generation Datasets Example ####
```{r,eval=FALSE,tidy=FALSE}
n.vec <- c(21,23,25,27)
m.vec <- seq(8, 16, by=1)
v.vec <- 4:6
p.vec <- 2^v.vec - 1
run.vec <- 3
dir.exp <- "/ln/song273/tmp/multi.factor"
```
First, we specify some parameters in the frond end of R. The `run.vec` denotes for each combination
of `m`,`v`, we have 3 replicates. The `dir.exp` specifies the directory for this experiment on HDFS.
##### Map and Execution Function #####
```{r, eval=FALSE,tidy=FALSE}
for (n in n.vec){
for (m in m.vec) {
for (p in p.vec) {
dir.dm = paste(dir.exp,"/dm/",'n',n,'p',p,'m',m, sep="")
map1
mr1
}
}
}
```
The `map1` and `mr1` are the map function and the execution function seperately we write in the datasets
generation part of the single run example. You can just copy the codes from the last section there.
For each fixed value of `m` and `p` there is a single run. So the idea is to put the single run code
to a for-loop. And the for-loop contains multi-number of MapReduce jobs. The `dir.dm` specifies the
output location for each MapReduce job.
It is noteworthy that we only generate subsets once and in the following timing part, we would read
the subsets from HDFS to the front end of R 3 times.
##### Elapsed Time Measurement #####
```{r, eval=FALSE,tidy=FALSE}
for (n in n.vec){
for (rep in rep.vec) {
## timing for O
type = "O"
for (m in m.vec) {
for (v in v.vec) {
dir.dm = paste(dir.exp,"/dm/",'n',n,'v',v,"m",m, sep="")
dir.nf = paste(dir.exp,"/nf/","rep",rep,'n',n,'v',v,"m",m, sep="")
map2
mr2
t = as.numeric(system.time({rhex(mr2, async=FALSE)})[3])
t = data.frame(rep=rep,n=n,m=m,v=v,type=type,t=t)
timing = rbind(timing,t)
}
}
## timing for T
type = "T"
for (m in m.vec) {
for (v in v.vec) {
dir.dm = paste(dir.exp,"/dm/",'n',n,'v',v,"m",m, sep="")
dir.gf = paste(dir.exp,"/gf/","rep",rep,'n',n,'v',v,"m",m, sep="")
mp3
reduce3
mr3
t = as.numeric(system.time({rhex(mr3, async=FALSE)})[3])
t = data.frame(rep=rep,n=n,m=m,v=v,type=type,t=t)
timing = rbind(timing,t)
}
}
}
}
```
The first for-loop stands for the different values of `n`. For each fixed value of `n`, the second
for-loop stands for the replicates. For each fixed value of `n` and `rep`, the third and forth
for-loop is for differente combination of `m` and `v`. There is a single run of performance test for
any fixed `n`, `rep`, `m` and `v`. `dir.dm` and `dir.nf` seperately specify the input type and output
type for execution function `mr2`. The `dir.dm` and `dir.gf` seperately specify the input type and
output type for execution function `mr3`. The `map2`, `map3`, `reduce3` are the map and reduce
function we specified before.
Finally, we save our final results to a data frame called `timing`.
#### Results of the Experiment of the Performance Test ####
```{r, eval=FALSE,tidy=FALSE}
timing
```
```
rep n m v type t
1 1 21 8 4 O 20.644
2 1 21 8 5 O 19.601
3 1 21 8 6 O 21.530
4 1 21 9 4 O 19.581
5 1 21 9 5 O 19.521
6 1 21 9 6 O 21.493
7 1 21 10 4 O 19.571
8 1 21 10 5 O 19.514
9 1 21 10 6 O 22.500
10 1 21 11 4 O 19.499
11 1 21 11 5 O 22.517
12 1 21 11 6 O 21.513
13 1 21 12 4 O 20.011
14 1 21 12 5 O 18.531
15 1 21 12 6 O 22.004
16 1 21 13 4 O 19.623
17 1 21 13 5 O 19.497
18 1 21 13 6 O 22.654
19 1 21 14 4 O 19.473
20 1 21 14 5 O 19.472
...
```
The `timing` dataframe has 648 rows and 6 collumns because `n` has 4 different values, `type` has two
levels, `m` has 9 different values and `v` has 3 different values. For each combination of the
factors, we have 3 replicates. So we have $4\cdot 2\cdot 9\cdot 3\cdot 3=648$ rows.
#### Save the Results to the local ####
Because the size of data frame `timing` is not very large, we can easily download the dataset to
our local laptop.
```{r, eval=FALSE, tidy=FALSE }
save(timing, file=paste("timing", ".RData", sep=""))
```
First, we saved `timing` to the R current working directory on initiating
R server by using `save` function. Then we will copy this file from the server to our laptop or
desktop computer. We are assuming that your laptop is running Linux OS. So from now no we will
working in the R that is running on your local laptop.
```{r, eval=FALSE, tidy=FALSE }
system(scp [email protected]:timing.RData /home/song273/elapsedtime/)
```
The linux command `scp` will copy the `timing.RData` file from the remote host cluster to the laptop
directory "/home/song273/elapsedtime/". Then we can analyze the data in our local computer which
gives us more freedom to analysis, for examle we don't need to worry about the abrupt network fault.
### Visualize the Results ###
Here we use R function `xyplot` in the `lattice` package to visualize the results.
```{r, eval=FALSE, tidy=FALSE }
library(lattice)
load("timing.RData")
```
First, we load the library `lattice` and load the .RData file to our working console. Now, the data
frame `timing` is available in our working directory.
```{r, eval=FALSE, tidy=FALSE }
v = c("v=4","v=5","v=6")
label = levels(as.factor(v))
xyplot(log2(t)~timing$m|n*type,
data = timing,
type ="p",
groups = as.factor(timing$v),
aspect = 2,
col = 1:3,
pch = 1,
key = list(type = c("p"),
text = list(label = label, cex = 1.2),
lines = list(col = 1:3, lwd = 1.5, pch=1),
column = 3,
space = "top"),
xlab = "Log Number of Observations per Subset (log base 2 number)",
ylab = "Log Time (log base 2 sec)"
)
```
Therefore, we get the first plot: Elapsed time against m. The R code to generate the other plots are
the same.
- The first plot describes the relationship between two kinds of elapsed time and log subsets size `m`.
[Plot1. Elapsed time plots against m](./plots/Elapsed_Time_m_v.pdf).
As we can see from the first plot, for every fixed value of `n` and `v`, the total elapsed time
"T" decreases first and then increases later as the value of `m` changes. And the three different
curves in one pannel show the same drop-increase pattern.
- The second plot describes the relationship between two kinds of elapsed time and log number of
varibales `v`.
[Plot2. Elapsed time plots against v](./plots/Elapsed_Time_v_n.pdf).
As we can see from the plot2, for fixed value of `m` , as the value of `v` increases, the value
of log2(elapsed time) increases linearly in each panel of the plot.
- The third plot describes the relationship between two kinds of elapsed time and log observations `n`.
[Plot3. Elapsed time plots against n](./plots/Elapsed_Time_n_v.pdf).
- The forth plot describes the relationship between two kinds of elapsed time and 3 replicates.
[Plot4. Elapsed time plots against Replicates](./plots/Elapsed_Time_rep_v.pdf).
This plots helps to see whether replication has blocking effect on the elapsed time. Because R has
temporary memory and we read the subsets from HDFS to the front end of R for the first time and R
has temperory memory about these files and in the next replicate of reading part, R maybe just
reads these subsets from memory instead of reading them from the HDFS, which will affect our
elpased time experiment.
From the plot4, we can see that replication doesn't have the blocking effects.
### Summary ###
We contrast the resulsts in the Serial Computation in R and the Divide and Recombine Computation
with `RHIPE`.
[Plot. Elapsed time plots against v](./plots/contrast_D_S.pdf).
For each combination of `n` and `v`, we pick up the optimal `m` in the Divid and Recombine
Computation to get the minimum elapsed time "T". In this plot, we contrast the minimum elapsed time
"T" in Divide and Recombine and the elapsed time "T" in the Serial Computation in R. In this plot,
the categorical factor `compute` has two levels, 'D' and 'S'. The 'D' stands for the Divide and
Recombine computation type. And the 'S' stands for the Serial Computation type.
As we can see from the plot, the Divid and Recombine Computation by RHIPE has much more prevailing
advantages than the Serial Computation in R in the aspect of computation speed as well as the
computation ablity.