-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathforecasting-realized-volatility.Rmd
1439 lines (1113 loc) · 67.8 KB
/
forecasting-realized-volatility.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Forecasting EUR/USD Volatility Using Supervised Learning"
subtitle: "Capstone Project – HarvardX Data Science Program"
author: "[Christian Satzky](https://www.linkedin.com/in/christian-satzky/)"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
pdf_document:
keep_tex: yes
number_sections: true
urlcolor: blue
bibliography: references.bib
nocite: '@*'
---
\newpage
\tableofcontents
\newpage
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
tidy=TRUE,
warning = FALSE, message = FALSE)
library(kableExtra)
library(gridExtra)
```
# Acknowledgement
This analysis is for partly fulfillment of the 'Capstone' module in the [HarvardX Data Science](https://www.edx.org/professional-certificate/harvardx-data-science) program, an online learning initiative by Harvard University. The programming parts are written in R [@R-base].
All code is _fully reproducible_ and all steps that led to the results are documented below. For the complete code please look into the *.Rmd file directly. In the rendered PDF version, most chunks of code are not displayed for an improved reading experience.
# Introduction and Objective
This project deals with predicting EUR/USD volatility using supervised learning. For those unfamiliar with the topic of financial volatility and/or its measurement, I am providing [an excursion in the appendix][Appendix 1: Financial Volatility] which you may like to read before continuing with this paper.
Accurate forecasting of volatility is crucial for the pricing of financial derivatives, portfolio allocation and effective risk management. The goal of this project is to find machine-learning methods that are best suited for volatility forecasting.
For this project, [SHARELAB LIMITED](https://sharelab.com) provided data containing daily _realized volatility_ (RV) of EUR/USD currency quotes. The data consists of a `training` set (data from 1/1/2010 to 12/31/2019), and a `validation` set (data from 1/1/2020 to 7/31/2020). The objective of this project is to predict $\textrm{RV}_{t+1}$ (i.e. EUR/USD realized volatility for date $t + 1$) as accurately as possible on the pseudo out-of-sample `validation` set.
Any analysis and modeling is strictly performed on the `training` dataset, while the `validation` subset is used only to assess the final model's performance per machine-learning method considered. The performance measure is the root mean square error (RMSE).
With consent of SHARELAB LIMITED, the data is publicly available in the [*.Rdata file on GitHub](https://github.com/csatzky/forecasting-realized-volatility-using-supervised-learning/blob/main/data/EURUSD_realized_volatility.RData).
# Executive Summary
The dataset contains future realized volatility $RV_{t+1}$ (y-variable) and predictors that are aggregates of _past_ volatility. Data exploration reveals that (1) the relationship between the $X$ variables and $RV_{t+1}$ [appears to be well approximated by a straight line][Relationship Across Variables] and (2) RV seems to exhibit [significant dependence with regards to day of the week][Stratification by Day of the Week].
When it comes to modeling, four general methods are considered: Linear regression, tree ensembles, support vector machine and k-nearest neighbor. The best performing model variants for each category are:
* Linear regression: ['Weekday effect model'][Weekday Effect Model]
* Tree ensemble: [Gradient boosted trees][Gradient Boosted Trees (XGBoost)]
* Support vector machine (SVM): [SVM with linear kernel][Linear Kernel]
* [K-nearest neighbor][K Nearest Neighbor] (only one model considered)
Of all methods and models considered, the gradient boosted tree algorithm performs best with an RMSE of `r round(0.0015951,4)` on the `validation` set. However, the `validation` set includes a period of market turmoil due to the Covid-19 pandemic that greatly impacted markets in March 2020. [Rigorous analysis of the results][Predictions During Pre-Crisis Period (January 2020)] shows that;
* During _market turmoil_, the gradient boosted trees (GBT) algorithm performs best
* During _normal_ times, the support vector machine (SVM) algorithm with linear kernel performs best
The linear [weekday effect model][Weekday Effect Model] performs second best in each scenario and appears to be most robust to changing market conditions.
Noteworthy, [in the appendix][Appendix 2: GARCH(1,1) Volatility Forecast], it is found that the industry-prevalent GARCH(1,1) model is outperformed by all models considered.
# Modeling Approach
To prevent potential overtraining, the `training` dataset is first split into `train` and `test` subsets. The `train` subset is explored using summary statistics and data visualization. In the process, some variables are transformed and additional predictors are created.
Secondly, different machine-learning methods are applied to estimate the conditional expectation $E[{\textrm{RV}_{t+1}|X}]$. For each method, any model variants and any tuning of parameters is done using the `train` and `test` sets only.
Finally, using the RMSE information on the `test` subset, _one_ model _per algorithm_ is chosen to be fit on the complete `training` set and evaluated on the `validation` set. Considering the RMSE performance on the `validation` set of each method's final model, the suitability of different algorithms to forecast volatility for the EUR/USD currency is discussed.
```{r load-dataset, include = FALSE}
# load EUR/USD volatility dataset sponsored by sharelab.com
# NOTE: all code is designed to run on R 3.6.X (Windows machine)
# code runs successfully on personal computer (Intel Core i7-7500U, 16GB RAM)
# increase memory limit on Windows machine
invisible(memory.limit(size = 35000))
# load/install libraries
if(!require(gridExtra)) install.packages("gridExtra", repos = "http://cran.us.r-project.org")
if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate", repos = "http://cran.us.r-project.org")
if(!require(ggthemes)) install.packages("ggthemes", repos = "http://cran.us.r-project.org")
if(!require(fastDummies)) install.packages("fastDummies", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(kernlab)) install.packages("kernlab", repos = "http://cran.us.r-project.org")
if(!require(xgboost)) install.packages("kernlab", repos = "http://cran.us.r-project.org")
library(gridExtra)
library(knitr)
library(tidyverse)
library(caret)
library(data.table)
library(lubridate)
library(ggthemes)
library(fastDummies)
library(randomForest)
library(kernlab)
library(xgboost)
# download dataset from GitHub, load data and delete file from hard disk drive (all relative paths)
url <- "https://raw.githubusercontent.com/csatzky/forecasting-realized-volatility-using-supervised-learning/main/data/EURUSD_realized_volatility.RData"
temp_name <- tempfile()
download.file(url, temp_name)
load(temp_name)
file.remove(temp_name) # remove temporary file
```
# Variables in the EUR/USD Dataset
Prior to data exploration, below I am giving an overview regarding the meaning, calculation and interpretation of the provided variables in the EUR/USD realized volatility dataset.
```{r variable-info, echo=TRUE}
# show variable names
names(training)
```
There are 5 variables provided:
1. 'date'
2. $RV_{t+1}$ ('rv_t1')
3. $RV^d$ ('rv_d')
4. $RV^w$ ('rv_w')
5. $RV^m$ ('rv_m')
$RV$ is short for 'realized volatility'. Following Corsi (2009), $RV$ for the EUR/USD exchange rate is calculated by SHARELAB LIMITED as follows:
$$RV^d = \sqrt{\sum_{m=0}^{M-1} r^2_{t-\frac{m}{M}}}$$
where:
$RV^d$: is the one-day realized volatility for the EUR/USD quote
$r_t$: is the intradaily return of the EUR/USD quote, $r_t = ln \frac{P_t}{P_{t-\frac{1}{M}}}$
$M$: is the number of equally spaced returns per day, here $M=288$
$ln$: is the natural logarithm
$P_t$: is the EUR/USD quote at time $t$
As seen above, the daily realized variance is the sum of _intradaily_, continuously compounded, squared returns. Note that there are $M=288$ returns per day, implying that intradaily returns are observed at the 5-minute interval for the 24-hour EUR/USD trading day.
Both $RV^w$ and $RV^m$ are rolling, historical averages of $RV^d$. More precisely, $RV^w$ is the _weekly_ average of $RV^d$ and $RV^m$ is the _monthly_ average of $RV^d$. Note that for the EUR/USD currency pair, there are 5 trading days per week and roughly 22 trading days a month.
$$RV^w = \frac{1}{5} \sum_{j=0}^{4} RV^d_{t-j}$$
$$RV^m = \frac{1}{22} \sum_{j=0}^{21} RV^d_{t-j}$$
where:
$t$: current trading day (as shown in the variable 'date')
$t-j$: $j^{th}$ trading day before $t$
Note that $RV_{t+1}$ is the 'Y' variable we are trying to predict. $RV_{t+1}$ is the one-day ahead future value of daily realized volatility, $RV^d$.
In conclusion, in this project we are tying to predict the next trading day's EUR/USD volatility, $RV_{t+1}$ given historical volatility measures $RV^d$, $RV^w$ and $RV^m$. For more information see Corsi, 2009.
# Data Exploration
Data exploration and any modeling is strictly performed on the `train` set, which contains 90% of the observations given in the `training` dataset.
```{r create-train-test-sets, echo=TRUE}
# split 'training' into 'train' and 'test' sets
# order 'training' set by date ascending
training <- training[order(date),]
# pick 10% of (future) time-series for 'test' set
dates_test_set <- tail(training$date, round(0.1*nrow(training)))
test <- training[date %in% dates_test_set,]
train <- training[!date %in% dates_test_set,]
```
## Summary Statistics
First, I am describing the provided variables for the `train` partition using summary statistics.
```{r summary-stats}
# retrieve variable names, classes and summary statistics
var_info <- rbind(
class = sapply(train, class),
min = sapply(train, FUN = function(x){
round(min(x),5)
}),
max = sapply(train, FUN = function(x){
round(max(x),5)
}),
mean = sapply(train, FUN = function(x){
round(mean(x),5)
}),
median = sapply(train, FUN = function(x){
round(median(x),5)
}),
sd = sapply(train, FUN = function(x){
round(sd(x),5)
})
)
# format 'date'-column
var_info[2:3,1] <- as.character(as_date(as.numeric(var_info[2:3,1])))
var_info[4:6,1] <- rep("",3) # mean/median/sd not meaningful for dates
# format var names
colnames(var_info) <- c("date", "$\\textrm{RV}_{t+1}$", "$RV^d$", "$RV^w$", "$RV^m$")
# display latex-style table
kable(var_info, escape = FALSE, booktabs = TRUE) %>%
kable_styling(position = "center")
```
From the table above, we observe:
* the time-series for the `train` partition spawns from 1 January 2010 to 31 December 2018
* the `date` variable is correctly formatted as `Date` and the other variables are all `numeric`
* all variables are positive (the lowest value of $RV^d$ is 0.00003). This is expected as volatility is proxied by the standard deviation of the return time-series. Therefore, volatility must be greater than 0 as long as the EUR/USD returns are not constant
* the mean value is approximately the same across daily, weekly and monthly aggregations of $RV$
* across all $RV$ variables, the median value is slightly lower than the mean. This implies that the distribution for all measures of $RV$ is skewed to the right
* note that $sd(RV_{t+1}) = sd(RV^d) > sd(RV^w) > sd(RV^m)$. This is expected as the monthly or weekly measures of $RV$ are computed from more observations than the daily measure, and $RV^d$ is hence noisier than $RV^w$ and $RV^m$
```{r plot-distributions}
## plot empirical densities
# rv_t1
p1 <- train %>%
ggplot(aes(x=rv_t1)) +
ggtitle(bquote("Empirical Density of" ~ RV[t+1])) +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10)) +
geom_density()
# rv_d
p2 <- train %>%
ggplot(aes(x=rv_d)) +
ggtitle(bquote("Empirical Density of" ~ RV^d)) +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10)) +
geom_density()
# rv_w
p3 <- train %>%
ggplot(aes(x=rv_w)) +
ggtitle(bquote("Empirical Density of" ~ RV^w)) +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10)) +
geom_density()
# rv_m
p4 <- train %>%
ggplot(aes(x=rv_m)) +
ggtitle(bquote("Empirical Density of" ~ RV^m)) +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10)) +
geom_density()
grid.arrange(p1, p2, p3, p4, ncol=2)
```
The empirical densities are in line with the fact that $RV^w$ and $RV^m$ are cumulative aggregations of $RV^d$. Rare, extreme events with high volatility have greater persistence on weekly and even more so on monthly measures of volatility.
## Relationship Across Variables
As $RV^w$ has overlapping information with $RV^d$ and $RV^m$ has overlapping information with both $RV^w$ and $RV^d$, it is straightforward to assume that all X variables are positively correlated. This is confirmed by the correlation matrix below:
```{r correlation-analysis, echo=TRUE}
# compute Pearson correlation of features
corr <- round(cor(train[,.(rv_t1,rv_d,rv_w,rv_m)], method="pearson"),2)
var_names <- c("$RV_{t+1}$", "$RV^d$", "$RV^w$", "$RV^m$")
row.names(corr) <- var_names
colnames(corr) <- var_names
# display latex-style table
kable(corr, escape = FALSE, booktabs = TRUE) %>%
kable_styling(position = "center")
```
Looking at the scatter plots of the $X$ variables vs. $RV_{t+1}$ provides greater detail regarding the variables' relationship:
```{r scatter-plots}
p1 <- train %>%
ggplot(aes(x=rv_d, y=rv_t1)) +
geom_point(alpha=0.3) +
ggtitle(bquote("Relationship between" ~ RV^d ~ "and" ~ RV[t+1])) +
xlab(bquote(RV^d)) +
ylab(bquote(RV[t+1])) +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10),
axis.title=element_text(size=9, family="sans"))
p2 <- train %>%
ggplot(aes(x=rv_w, y=rv_t1)) +
geom_point(alpha=0.3) +
ggtitle(bquote("Relationship between" ~ RV^w ~ "and" ~ RV[t+1])) +
xlab(bquote(RV^w)) +
ylab(bquote(RV[t+1])) +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10),
axis.title=element_text(size=9, family="sans"))
p3 <- train %>%
ggplot(aes(x=rv_m, y=rv_t1)) +
geom_point(alpha=0.3) +
ggtitle(bquote("Relationship between" ~ RV^m ~ "and" ~ RV[t+1])) +
xlab(bquote(RV^m)) +
ylab(bquote(RV[t+1])) +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10),
axis.title=element_text(size=9, family="sans"))
p4 <- train %>%
ggplot(aes(x=rv_m, y=rv_t1)) +
geom_point(alpha=0.3) +
geom_smooth() +
ggtitle(bquote(RV^m ~ "vs." ~ RV[t+1] ~ "(LOESS)")) +
xlab(bquote(RV^m)) +
ylab(bquote(RV[t+1])) +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(size=10),
axis.title=element_text(size=9, family="sans"))
grid.arrange(p1, p2, p3, p4, ncol=2)
```
As seen in the plots above, the relationship between the independent variables vs. $RV_{t+1}$ appears to be well approximated by a straight line. Hence, a linear model to estimate $E[{\textrm{RV}_{t+1}|X}]$ might performs well.
## Stratification by Day of the Week
The EUR/USD currency pair is traded from Mondays to Fridays. When information accumulates _over the weekends_, it can only impact volatility on the following Monday. Likewise, if market participants are required to close their positions _before_ the start of the weekend, this might impact Friday's volatility. Hence, it is sensible to investigate volatility dependencies with respect to days of the week.
First, the variable `date` is used to extract `weekday`.
```{r extract-weekday, echo=TRUE}
# extract 'weekday' from 'date' variable
train[, weekday := weekdays(date)]
```
Secondly, I am plotting 95% confidence intervals for the _mean value_ of $RV^d$ per `weekday`. Note that stratifying $RV^d$ is equivalent to stratifying $RV_{t+1}$, but for the latter the `weekday` is shifted by one day.
```{r plot-cis-per-weekday, out.width = '75%', fig.align='center'}
# plot CIs of average daily volatility per 'weekday'
train %>%
group_by(weekday) %>%
mutate(mean_rv_d = mean(rv_d)) %>%
mutate(ci = sd(rv_d)/sqrt(nrow(train)) * qnorm(0.975)) %>%
mutate(weekday = factor(weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))) %>%
ggplot(aes(x=weekday, y=mean_rv_d)) +
theme(axis.text.x = element_text(angle=-90, hjust=0, vjust=0.2)) +
geom_errorbar(aes(ymax = mean_rv_d + ci, ymin = mean_rv_d - ci)) +
geom_hline(yintercept=mean(train$rv_d), linetype="dashed", color = "red") +
ggtitle(bquote("95% Confidence Intervals for Mean" ~ RV^d ~ "per Weekday")) +
xlab("") +
ylab("") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
In the above plot, the dashed red line marks the overall mean of $RV^d$. Surprisingly, the 95% confidence interval for $\overline{RV^d}$ is _lowest_ on Mondays. It might be the case that on average, information flow on weekends is modest and risk-averse market participants close positions on Fridays, when volatility is _greater_ than on the average weekday.
Note that _none_ of the 95% confidence intervals overlaps with the overall average value of `r round(mean(train$rv_d),5)`. This is reasonable evidence to assume some variability across weekdays and hence I will include the newly created `weekday`-variable for modeling via the creation of dummy variables.
$$D^i =
\begin{cases}
1, & \textrm{if}\ \textrm{weekday} = i \\
0, & \textrm{otherwise}
\end{cases}$$
where:
$D$: Dummy variable, $D : \Omega \to \{0,1\}$
$i$: Value of `weekday`, $i \in \{\textrm{Monday}, \textrm{Tuesday}, \textrm{Wednesday}, \textrm{Thursday}, \textrm{Friday}\}$
To avoid perfect multicollinearity, it is necessary to only create dummy variables for $N-1$ categories of `weekday`. Hence, I am only including dummy variables for Monday to Thursday. The library `fastDummies` [@fastdummies] provides a convenient function for dummy variable extraction.
```{r create-dummy-var, echo=TRUE}
# create dummy variables on `train` set
train <- dummy_cols(.data = train, select_columns = "weekday")
# exclude Friday's dummy variable
train[, weekday_Friday := NULL]
```
# Modeling
Now that the data is sufficiently explored and additional features are created, in this section I am fitting different models on the `train` set and check for an RMSE estimate on the `test` set.
## The Benchmark Model
The benchmark model is simply the average value of $RV_{t+1}$, without using any of the other variables. It can be directly evaluated on the `test` set.
$$RV_{t+1} = \mu + \epsilon_{t+1}$$
```{r benchmark-model}
# calculate mean rv_t1
benchmark_model <- mean(train$rv_t1)
# create function to calculate RMSE
get_rmse <- function(y_hat, y){
round(sqrt(mean((y-y_hat)**2)),7)
}
# compute RMSE on test set
rmse_benchmark <- get_rmse(benchmark_model, test$rv_t1)
```
The benchmark model achieves an RMSE of `r round(rmse_benchmark,5)` on the `test` set.
## Linear Model
Other features in the dataset are $RV^d$, $RV^w$, $RV^m$ and the newly created dummy variables $D^i$. In the following, I am incorporating these variables using different model variants.
### HAR-RV Model
For the first linear model variant, I am including all features except for the `weekday` dummy variables. This model is equivalent to Corsi's (2009) Heterogeneous Autoregressive model of Realized Volatility (HAR-RV).
$$RV_{t+1} = \alpha + \beta^d RV^d + \beta^wRV^w + \beta^mRV^m + \epsilon_{t+1}$$
```{r linear-model-har-rv}
# fit linear model using all RV measures
fit_har_rv <- lm(rv_t1 ~ rv_d + rv_w + rv_m, data=train)
# print summary
kable(summary(fit_har_rv)$coef, escape = TRUE, booktabs=TRUE) %>%
kable_styling(position = "center")
```
As seen above, all coefficient estimates in the HAR-RV model are significantly different from zero. Also note that all coefficients are positive, which is in line with what was discovered in the [Relationship Across Variables] section. Below, I am summarizing the model's RMSE performance on the `test` set.
```{r performance-summary-har-rv}
# use fitted model to predict RV_t1 on 'test' set
y_hat <- predict(fit_har_rv, newdata=test)
# obtain RMSE on test set
rmse_har_rv <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
models <- data.frame(Model=c("Benchmark", "HAR-RV"),
Variables = c("$\\mu$","$\\alpha, RV^d, RV^w, RV^m$"),
RMSE = c(rmse_benchmark, rmse_har_rv))
names(models)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models[2,3] <- paste0("\\textbf{",models[2,3],"}")
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 2)
```
The HAR-RV model greatly outperforms the benchmark model. The RMSE on the `test` set improves by `r 100*(-round((rmse_har_rv/rmse_benchmark -1),4))`%.
### Weekday Effect Model
For the second linear model I am adding _all_ features, including dummy variables $D^i$ for the newly created `weekday` variable. More specifically, for 5 workdays, 4 dummy variables are introduced to capture time dependencies across weekdays as shown in the [data exploration section][Stratification by Day of the Week].
$$RV_{t+1} = \alpha + \beta^d RV^d + \beta^wRV^w + \beta^mRV^m + \sum \beta_i D_i + \epsilon_{t+1}$$
where $D$ is a dummy variable, and $i \in [\textrm{Monday, Tuesday, Wednesday, Thursday}]$.
```{r linear-model-weekday-effect}
# fit linear model using all features
fit_weekday_linear <- lm(rv_t1 ~ rv_d + rv_w + rv_m +
weekday_Monday +
weekday_Tuesday +
weekday_Wednesday +
weekday_Thursday, data=train)
# print summary
kable(summary(fit_weekday_linear)$coef, escape = TRUE, booktabs=TRUE) %>%
kable_styling(position = "center")
```
Each dummy variable has a parameter value that is significantly different from zero ($\textrm{pval} < 0.05$). Note that after accounting for the variability across `weekday`, the intercept of the model, $\alpha$ is _not_ significantly different from zero.
```{r performance-weekday-effect}
# create dummy variables on 'test' set
test[, weekday := weekdays(date)]
test <- dummy_cols(.data = test, select_columns = "weekday")
test <- test[, weekday_Friday := NULL]
# use fitted model to predict RV_t1 on 'test' set
y_hat <- predict(fit_weekday_linear, newdata=test)
# obtain RMSE on test set
rmse_weekday_effect <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "Weekday Effect",
Variables = "$\\alpha, RV^d, RV^w, RV^m, \\sum D_i$",
RMSE = rmse_weekday_effect)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
# highlight best performing model
models[2,3] <- rmse_har_rv
models[3,3] <- paste0("\\textbf{",models[3,3],"}")
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3)
```
As summarized in the table above, the weekday effect model improves RMSE performance in comparison to the HAR-RV model by `r 100*(-round((rmse_weekday_effect/rmse_har_rv -1),4))`%.
## Non-Linear Methods
From this section onward, I am considering more 'versatile' methods as opposed to linear regression. The advantage is that non-parametric, machine-learning models impose "no or very limited" assumptions on the data [@whitley] and provide greater flexibility in how the parameters can be used.
As noted by @iriz, the `caret`-package [@kuhn] provides a convenient way to train machine-learning algorithms using cross-validation (CV). Furthermore, the package standardizes R functions across several libraries. In the following, I am using `caret` to tune and fit more flexible algorithms.
## Tree Ensembles
In this section I am considering tree ensembles, specifically random forests (RF) and gradient boosted trees (GBT).
### Random Forest
The random forest algorithm is a method that fits many decision trees with randomly selected predictors and aggregates the resulting trees to form a prediction (bootstrap aggregating). This procedure reduces the instability of single regression trees [@iriz].
The argument `getModelInfo('rf')$rf$parameters` reveals that the random forest algorithm of the package `randomForest` [@rf] only has one tuning parameter, i.e. `mtry`. As mentioned above, for each decision tree that is fitted, random forest _randomly_ selects the predictors to be used. `mtry` sets the _number_ of features to be picked at random for each tree. As there are 7 independent variables available, it is most sensible to consider a value for `mtry`$< 7$. Using 10-fold cross-validation, I am tuning `mtry` on the `train` set. I am considering values `mtry`$\in \{1,2, ..., 6\}$.
The random forest algorithm to be used is developed by Breiman et all, 2018 and is available in the `randomForest`-library.
```{r random-forest-tuning, out.width = '75%', fig.align='center'}
# select all 7 features
train_data <- train[,.(rv_t1, rv_d, rv_w, rv_m,
weekday_Monday,
weekday_Tuesday,
weekday_Wednesday,
weekday_Thursday)]
# tune 'mtry' parameter using 10-fold CV
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
tuning_result <- train(rv_t1 ~ ., method = "rf", data = train_data,
tuneGrid = data.frame(mtry = 1:6),
trControl = trainControl(method="cv", number=10))
tuning_result$results %>%
ggplot(aes(x=mtry, y=RMSE)) +
geom_point() +
ggtitle("Random Forest: Tuning Number of Predictors ('mtry')") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
Observe that the random forest model is best performing for `mtry`$=2$. Using this parameter value, I am fitting the random forest model on the `train` set and summarize the RMSE performance of predictions on the `test` set.
```{r performance-random-forest}
# use fitted model (mtry=2) to predict RV_t1 on 'test' set
y_hat <- predict(tuning_result, newdata=test)
# obtain RMSE on test set
rmse_rf <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "Random Forest",
Variables = "$RV^d, RV^w, RV^m, \\sum D_i$; mtry = 2",
RMSE = rmse_rf)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,4)
```
The well-tuned random forest model performs much better than the benchmark model, but fails to achieve an RMSE close to the performance of the linear regression variants. As noted in the [data exploration section][Relationship Across Variables], the relationship of $X$ vs. $RV_{t+1}$ appears to be well-approximated by a straight line. There seem to be limited non-linear dependencies that random forest could take advantage of.
### Gradient Boosted Trees (XGBoost)
While the random forest algorithm uses grown decision trees, 'boosting' is based on shallow trees. While the random forest algorithm aims at _reducing variance_ by aggregating many, fully grown decision trees, gradient boosting aims at reducing bias of shallow trees.
The [XGBoost package](https://cran.r-project.org/web/packages/xgboost/xgboost.pdf) [@xgboost] has the following tuning parameters:
* `eta`: Shrinkage of feature weights. Default value $0.3$, $\textrm{eta} \in [0,1]$
* `lambda` ($\lambda$): L2 regularization term on weights, default value $1$
* `alpha` ($\alpha$): L1 regularization term on weights, default value $0$
* `nrounds`: 50~150
Going forward, I am accepting the default values for `eta` and `nrounds` and tune the $\alpha$ and $\lambda$ parameters.
```{r tune-xgbLinear, out.width = '75%', fig.align='center'}
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
tuning_result <- train(rv_t1 ~ .,
data=train_data,
method="xgbLinear",
metric="RMSE",
preProc=c("center", "scale"),
objective="reg:squarederror",
trControl=trainControl(method="cv",
number=5,
search = "grid"),
tuneGrid = data.frame(lambda=seq(0.0024,0.025,length.out=50),
alpha=seq(0.0024,0.025,length.out=50),
nrounds=seq(150,150,length.out=50),
eta=seq(0.3,0.3,length.out=50)))
tuning_result$results %>%
mutate(nrounds = as.factor(nrounds)) %>%
ggplot(aes(x=lambda, y=RMSE, color=alpha)) +
geom_point() +
ggtitle(bquote("xgbLinear: Tuning" ~ lambda ~ "and" ~ alpha ~ "Parameters")) +
theme_bw() +
xlab(bquote(lambda)) +
labs(color = bquote(alpha)) +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
As seen in the plot above, the lowest cross-validated RMSE on the `train` set is achieved for $\lambda = \alpha = 0.019$. Analog to the previous section, I am applying the model on the out-of-sample `test` set.
```{r test-RMSE-xgbLinear}
# use best trained model to predict RV_t1 on 'test' set
y_hat <- predict(tuning_result, newdata=test)
# obtain RMSE on test set
rmse_xgbLinear <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "Gradient Boosted Trees",
Variables = "$RV^d, RV^w, RV^m, \\sum D_i; \\alpha = \\lambda = 0.019$",
RMSE = rmse_xgbLinear)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,5)
```
Gradient boosting significantly outperforms random forest in terms of `test` set's RMSE. Still, it fails to achieve the performance of the linear model variants.
## Support Vector Machine (SVM)
The idea of a support vector machine algorithm is to separate those points that are hard to distinguish (i.e. points that are closest to one another). With this separation in place, points that are further away from one another are more likely to be correctly classified.
Applying the SVM algorithm to many predictors can be achieved efficiently by using a kernel function. Using the `kernlab` library [@svm], I am considering a linear kernel function and the more flexible radial basis kernel.
### Linear Kernel
For the SVM algorithm with linear kernel, there is only one parameter to tune, i.e. `C` (cost). The cost parameter determines the margin of the separating hyperplane. A low value of `C` is associated with a large margin, a high value is associated with a small margin. Going forward, I am considering values `C`$\in[0.5, 1.2]$.
```{r tune-svm-linear, out.width = '75%', fig.align='center'}
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
tuning_result <- train(rv_t1 ~ .,
data=train_data,
method="svmLinear",
metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl(method="cv",
number=25,
search = "grid"),
tuneGrid = data.frame(C=seq(0.5,1.2,length.out=20)))
# note: above warning messages are expected and there is NO error
tuning_result$results %>%
# mutate(nrounds = as.factor(nrounds)) %>%
ggplot(aes(x=C, y=RMSE)) +
geom_point() +
ggtitle("SVM: Tuning C Parameter") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
According to the 25-fold cross validation procedure on the `train` set, a parameter value around $0.7$ seems to be the optimal choice (`C`$=0.6842$).
```{r test-RMSE-SVM-linear}
# use best trained model to predict RV_t1 on 'test' set
y_hat <- predict(tuning_result, newdata=test)
# obtain RMSE on test set
rmse_svm_linear <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "SVM Linear Kernel",
Variables = "$RV^d, RV^w, RV^m, \\sum D_i; C = 0.6842$",
RMSE = rmse_svm_linear)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
# highlight best performing model
models[3,3] <- rmse_weekday_effect
models[6,3] <- paste0("\\textbf{",models[6,3],"}")
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,5) %>%
pack_rows("Support Vector Machine",6,6)
```
So far, the SVM with linear kernel achieves the best RMSE performance on the out-of-sample `test` set. It beats the second best weekday effect model's RMSE by `r round(100*(0.0008568/0.0008178 -1),2)`%.
### Radial Basis Function
Below, I am tuning a SVM with the more flexible radial basis kernel. For the parameter values to be tuned, `C` and `sigma` ($\sigma$). For the tuning grid I am considering values `C`$\in[3, 3.6]$ and $\sigma\in[0.003, 0.007]$.
```{r tune-svm-radial-basis, out.width = '75%', fig.align='center'}
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
tuning_result <- train(rv_t1 ~ .,
data=train_data,
method="svmRadial",
metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl(method="cv",
number=10,
search = "grid"),
tuneGrid = data.frame(C=seq(3,3.6,length.out=15),
sigma=seq(0.003,0.007,length.out=15)))
# note: above warning messages are expected and there is NO error
tuning_result$results %>%
# mutate(nrounds = as.factor(nrounds)) %>%
ggplot(aes(x=C, y=RMSE, color=sigma)) +
geom_point() +
ggtitle(bquote("SVM: Tuning C and" ~ sigma ~ "Parameters")) +
theme_bw() +
labs(color = bquote(sigma)) +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
The optimal parameter values according to 10-fold cross-validation on the `train` set are `C`$= 3.3$ and $\sigma= 0.005$.
```{r test-RMSE-SVM-radial}
# use best trained model to predict RV_t1 on 'test' set
y_hat <- predict(tuning_result, newdata=test)
# obtain RMSE on test set
rmse_svm_radial <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "SVM Radial Basis",
Variables = "$RV^d, RV^w, RV^m, \\sum D_i; C = 3.3, \\sigma = 0.005$",
RMSE = rmse_svm_radial)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,5) %>%
pack_rows("Support Vector Machine",6,7)
```
Even after excessive tuning, the radial basis function kernel of the SVM algorithm fails to outperform the linear kernel in the out-of-sample application. This is consistent with the discovery that [less flexible methods might be superior for the dataset at hand][Relationship Across Variables].
## K Nearest Neighbor
The k nearest neighbor algorithm classifies a data point by looking at the outcomes of the $k$ _closest_ points to it. For regression, the mean y-variable value of the k closest points is assigned. For the tuning parameter `k` I am considering values `k`$\in[20,36]$.
```{r tune-knn-models, out.width = '75%', fig.align='center'}
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
tuning_result <- train(rv_t1 ~ .,
data=train_data,
method="knn",
metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl(method="cv",
number=20,
search = "grid"),
tuneGrid = data.frame(k=20:36))
# note: above warning messages are expected and there is NO error
tuning_result$results %>%
ggplot(aes(x=k, y=RMSE)) +
geom_point() +
ggtitle("KNN: Tuning the 'k' Parameter") +
theme_bw() +
theme(panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.title=element_text(size=15, family="sans"),
axis.text=element_text(size=13, family="sans"),
plot.title=element_text(size=15, family="sans"))
```
According to 20-fold cross validation on the `train` set, the optimal value for `k` is 29.
```{r test-RMSE-knn}
# use best trained model to predict RV_t1 on 'test' set
y_hat <- predict(tuning_result, newdata=test)
# obtain RMSE on test set
rmse_knn <- get_rmse(y_hat, test$rv_t1)
# summarize models' performance
new_model <- data.frame(Model= "KNN",
Variables = "$RV^d, RV^w, RV^m, \\sum D_i; k = 29$",
RMSE = rmse_knn)
names(new_model)[3] <- "$\\text{RMSE}_{\\text{test}}$"
models <- rbind(models, new_model)
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,5) %>%
pack_rows("Support Vector Machine",6,7) %>%
pack_rows("K Nearest Neighbor",8,8)
```
When it comes to out-of-sample performance on the `test` set, the KNN algorithm performs worst. In comparison to the support vector machine with linear kernel the RMSE performance is `r round(100*(0.0009446/0.0008178 -1),2)`% worse.
# Validation Set Performance
The table above summarizes the `test` set's RMSE performance of all models considered. The best performing models _per algorithm_ are:
* Linear model: Weekday effect model, $\textrm{RMSE}_{\textrm{test}} = 0.000857$
* Tree ensembles: Gradient boosted trees, $\textrm{RMSE}_{\textrm{test}} = 0.000897$
* Support vector machine: SVM with linear kernel, $\bf{\textrm{RMSE}_{\textrm{test}} = 0.000818}$
* K Nearest Neighbor (only one model considered), $\textrm{RMSE}_{\textrm{test}} = 0.000945$
Going forward, these four models are fit to the _complete_ `training` set and evaluated on the `validation` set. For tuning on the `training` set, for each model, I am considering a grid range of $\pm 50\%$ of the previously obtained parameter values:
```{r parameter-ranges}
# sumamrize tuning parameter grid search values
tbl <- data.frame(Model = c("Weekday Effect", "Gradient Boosted Trees", "SVM (Linear Kernel)", "KNN"),
Parameters = c("", "$\\sigma = \\lambda = 0.019465$", "$C = 0.684211$", "$k = 29$"),
Range = c("", "$[0.0097325, 0.0291975]$", "$[0.342105, 1.026315]$", "$[15, 44]$"))
# format tbl header
names(tbl) <- c("Model", "$\\textrm{Parameter}_{\\textrm{train}}$", "$\\pm 50\\%\\textrm{Parameter Range}_\\textrm{Validation}$")
# outout latex-style table
kable(tbl, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center")
```
For each model with parameters to be trained, the above parameter range $[a,b]$ is considered using 10-fold cross validation and a grid length of 15. This can be achieved using the `seq()`-function, e.g. for the SVM (linear kernel) `seq(0.342105, 1.026315, length.out = 15)`.
```{r train-best-models-per-method}
# extract 'weekday' variable for 'training' and 'validation' sets
training[, weekday := weekdays(date)]
validation[, weekday := weekdays(date)]
# create dummy variables for 'training' and 'validation' sets
training <- dummy_cols(.data = training, select_columns = "weekday")
validation <- dummy_cols(.data = validation, select_columns = "weekday")
# select features
training_data <- training[,.(rv_t1, rv_d, rv_w, rv_m,
weekday_Monday,
weekday_Tuesday,
weekday_Wednesday,
weekday_Thursday)]
# 1: Linear Model ('Weekday Effect')
fit_lm <- lm(rv_t1 ~., data=training_data) # fit on 'training'
y_hat_lm <- predict(fit_lm, newdata = validation) # predict RV_{t+1} on 'validation'
rmse_lm <- get_rmse(y_hat = y_hat_lm, y = validation$rv_t1) # get rmse
# 2: Tree Ensembles ('Gradient Boosted Trees')
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
fit_te <- train(rv_t1 ~ .,
data=training_data,
method="xgbLinear",
metric="RMSE",
preProc=c("center", "scale"),
objective="reg:squarederror",
trControl=trainControl(search = "grid"),
tuneGrid = data.frame(lambda=seq(0.0097325, 0.0291975, length.out=15), # +/- 50% previously tuned parameter
alpha=seq(0.0097325, 0.0291975, length.out=15),
nrounds=seq(150, 150, length.out=15),
eta=seq(0.3, 0.3, length.out=15)))
y_hat_te <- predict(fit_te, newdata = validation) # predict RV_{t+1} on 'validation'
rmse_te <- get_rmse(y_hat = y_hat_te, y = validation$rv_t1)
# 3: Support Vector Machine ('Linear Kernel')
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
fit_svm <- train(rv_t1 ~ .,
data=training_data,
method="svmLinear",
metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl(search = "grid"),
tuneGrid = data.frame(C = seq(0.342105, 1.026315,length.out=15))) # +/- 50% previously tuned parameter
y_hat_svm <- predict(fit_svm, newdata = validation) # predict RV_{t+1} on 'validation'
rmse_svm <- get_rmse(y_hat = y_hat_svm, y = validation$rv_t1) # get rmse
# 4: KNN
set.seed(1, sample.kind="Rounding") # for reproducibility (R 3.6.x)
fit_knn <- train(rv_t1 ~ .,
data=training_data,
method="knn",
metric="RMSE",
preProc=c("center", "scale"),
trControl=trainControl(search = "grid"),
tuneGrid = data.frame(k=seq(15,44,2))) # +/- 50% previously tuned parameter
y_hat_knn <- predict(fit_knn, newdata = validation) # predict RV_{t+1} on 'validation'
rmse_knn <- get_rmse(y_hat = y_hat_knn, y = validation$rv_t1) # get rmse
```
```{r summarize-results}
# summarize performance of models
models <- cbind(models,RMSE_validation=c("","",rmse_lm,"",rmse_te, rmse_svm, "", rmse_knn))
names(models)[4] <- "$\\text{RMSE}_{\\text{validation}}$"
# highlight best models' performance
models[,4] <- as.character(models[,4])
models[5,4] <- paste0("\\textbf{",models[5,4],"}")
# print latex-style table
kable(models, escape = FALSE, booktabs=TRUE) %>%
kable_styling(position = "center") %>%
pack_rows("Linear Model", 2, 3) %>%
pack_rows("Tree Ensembles",4,5) %>%
pack_rows("Support Vector Machine",6,7) %>%
pack_rows("K Nearest Neighbor",8,8)
```
On a first look, of all methods and models considered, the gradient boosted tree algorithm performs best on the out-of-sample `validation` set. Note that on the first pseudo out-of-sample `test` set the support vector machine with linear kernel performed best. I the following, I am analyzing the results in greater detail.
```{r plot-predicted-vs-actual}
# gather actual/predicted values
tbl <- data.table(date=validation$date,
actual=validation$rv_t1,
linear=y_hat_lm,
GBT=y_hat_te,
SVM=y_hat_svm,
KNN=y_hat_knn)
# rename variables
names(tbl)[2:6] <- c("Actual", "Linear (Weekday Effect)", "GBT", "SVM (Linear Kernel)", "KNN")
# transform 'tbl' to 'tidy' format
tbl <- tbl %>%
pivot_longer(!date, names_to="model", values_to="prediction") %>%
mutate(model = factor(model, levels = c("Actual", "Linear (Weekday Effect)", "GBT", "SVM (Linear Kernel)", "KNN")))
# plot actual vs. predicted
tbl %>%
ggplot(aes(x=date, y=prediction, color=model)) +
geom_line() +
scale_colour_wsj("colors6", "") + theme_wsj(color = "white") +
theme(axis.title=element_text(size=12, family="sans"),
axis.text=element_text(size=10, family="sans"),
plot.title=element_text(size=12, family="sans"),
plot.subtitle = element_text(size=10, family="sans")) +
ylab(bquote(RV[t+1])) +
xlab("") +
ggtitle("Actual vs. Predicted EUR/USD Volatility",
subtitle="Out-of-Sample Validation Set, Jan-Jul 2020")
```
In the plot above, the red line shows the actual volatility, $RV_{t+1}$, and the other lines are the different models' predictions, $\widehat{RV_{t+1}}$. The plot shows that the first few months of 2020 seem 'stable' and from March onward there is elevated EUR/USD volatility with large differences between actual and predicted values. This is due to market turmoil during the start of the Covid-19 pandemic.
To further investigate how the predictions of the considered models differ, I am comparing the predictions during 'normal' times (January 2020) to periods of market turmoil (March 2020).
\pagebreak
## Predictions During Pre-Crisis Period (January 2020)
```{r actual-vs-predictions-normal}
# filter for "January"
tbl_jan <- tbl %>%
filter(month(date)==1)
# 1. actual vs. linear model
p1 <- tbl_jan %>%
filter(model %in% c("Actual", "Linear (Weekday Effect)")) %>%