-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGAM-MARS-PPR_Agyapong-Project06_DS5494.Rmd
611 lines (436 loc) · 26.2 KB
/
GAM-MARS-PPR_Agyapong-Project06_DS5494.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
---
title: ' Project `r params$proj_number`: `r params$proj_title`'
subtitle: 'DS 5494 - Statistical Data Mining II'
author:
- Willliam Ofosu Agyapong^[[email protected]]
- University of Texas at El Paso (UTEP)
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output:
bookdown::pdf_document2:
fig_caption: true
latex_engine: xelatex
number_sections: true
toc: true
toc_depth: 4
header-includes:
- \usepackage{amsmath}
- \usepackage{amssymb}
- \usepackage{amsfonts}
- \usepackage{amsthm}
- \usepackage{fancyhdr}
- \pagestyle{fancy}
- \fancyhf{}
- \rhead{William O. Agyapong}
- \lhead{Project `r params$proj_number` -- `r params$proj_title`}
- \cfoot{\thepage}
- \usepackage{algorithm}
- \usepackage[noend]{algpseudocode}
geometry: margin = 0.8in
fontsize: 10pt
params:
proj_number: VI
proj_title: GAM, MARS, and PPR
---
```{r setup, include=FALSE}
# Set global options for output rendering
knitr::opts_chunk$set(eval = T, echo = T, warning = F, message = F, fig.pos = "H", out.extra = "", fig.align = "center")
#----------------- Load required packages
library(dplyr)
library(ggthemes)
library(ggplot2)
library(tidyr)
library(knitr)
library(patchwork) # interface for laying out plots from ggplot2
options(kableExtra.auto_format = F) # disable kableExtra automatic global options
library(kableExtra)
#----------------- set the current working directory to this path
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
#----------------- Set default rounding to 4 decimal places
options(digits = 4)
#----------------- Set default ggplot theme
# theme_set(theme_fivethirtyeight())
theme_set(theme_bw())
```
<!-- QUESTION ONE: WHAT -->
<!-- \noindent\rule{17.5cm}{0.8pt} -->
\newpage
# Introduction
In this project, we shall be considering various models including generalized additive models (GAM), multivariate adaptive regression splines (MARS), and projection pursuit regression (PPR). We aim, among other things, to **obtain a model with the best predictive power and also identify the important drivers of employee turnover or retention.**
We make use of a human resource data set concerning employee retention from one Kaggle data analytics competition. The data set contains 14,999 observations for 10 variables as shown in Table 1.
## Data Description
```{r}
data.frame(
rbind(
c("satisfaction_level", "Satisfaction Level"),
c("last_evaluation","Last evaluation"),
c("number_project", "Number of projects"),
c("average_montly_hours", "Average monthly hours"),
c("time_spend_company ", "Time spent at the company"),
c("Work_accident ", "Whether they have had a work accident"),
c("left ", "Whether the employee has left"),
c("promotion_last_5years", "Whether had a promotion in the last 5 years"),
c("sales ", "Departments (column sales)"),
c("salary ", "Salary")
)
) %>%
kable(caption = "Varaibles and their meanings", booktabs = T,
col.names = c("Variable name", "Description")) %>%
kable_styling(latex_options = c("HOLD_position")) %>%
kable_classic()
```
- **For the left, work accident and promotion in last 5 years binary variables, 0 and 1 should be interpreted as "No" and "Yes", respectively**.
## Data Preparation
```{r}
# bring in the data
hr <- read.csv("HR_comma_sep.csv")
# dim(hr)
# head(hr)
# names(hr)
# 1. change the categorical variable salary to ordinal
# 2. change name for variable sales to department.
# 3. make left variable a factor variable
hr_new <- hr %>%
mutate(salary = factor(salary, levels = c("low","medium","high"),ordered = T),
left = as.factor(left)) %>%
rename(department = sales)
# str(hr_new)
```
```{r}
# get data types
output <- NULL
for(i in seq_along(hr_new)) {
output <- rbind(output, c(names(hr_new)[i],
paste(class(hr_new[[i]]), collapse = " "),
length(unique(hr_new[[i]]))
)
)
}
output <- as.data.frame(output)
# checking for missing values
output %>% left_join(
naniar::miss_var_summary(hr_new), by=c("V1"="variable")) %>%
kable(booktabs = T, linesep="", align = "lcc",
col.names = c("Variable name", "Type", "levels", "Number missing", "Percent missing"),
cap = "Data types and amount of missing values in the HR data") %>%
kable_styling(latex_options = c("HOLD_position"))
```
**We do not have any missing values in the data.**
From the output above, it can be seen that among all the predictors, 2 are continuous, 5 variables are integer counts among which 3 (number_project, average_monthly_hours and time_spend_company) can be reasonably treated as continuous, while the other 2 together with department and salary will be treated as categorical variables.
# Explaratory Data Analysis (EDA)
In this section, we explore the underlying data set with the hope of discovering any interesting patterns or insights to aid our understanding of the data.
```{r}
# prepare data for EDA
hr_eda <- hr_new %>%
mutate(left = as.factor(ifelse(left == 0, "No", "Yes")),
Work_accident = as.factor(ifelse(Work_accident == 0, "No", "Yes")),
promotion_last_5years =
as.factor(ifelse(promotion_last_5years==0, "NO", "Yes")))
```
## Part (a): How does satisfaction level relate to number of projects?
```{r fig.cap="Relationship between satisfaction level and the number of projects."}
ggplot(hr_eda, aes(number_project, satisfaction_level, color=left)) +
geom_point(alpha=0.7) +
scale_color_brewer(palette = "Set1") +
labs(x="Number project", y="Satisfaction level",
title = "How does satisfaction level relate to number of projects?")
```
In general, we do not observe any clear increasing or decreasing patterns between satisfaction level and number of projects as either of them increases or decreases. That is, there does not appear to be a linear relationship between satisfaction level and the number of project. However, it is seen that satisfaction level decreased with increasing number of projects after the $5^{th}$ project only for employees who left, indicating some kind of effect between the two variables.
As expected, employees who did not leave during the review period recorded the highest satisfaction levels. It turns out however that employees who worked on the most number of projects (7 projects) tend to have low satisfaction levels and all those workers also left the company.
## Part (b): Association between variables
Since the data contain different types of variables, both continuous and categorical, the Goodman and Kruskal tau measure was used to investigate the association among the variables instead of the pearson correlation. Another benefit of this association measure is that it is sensitive to directional associations. Applying the `GKtauDataframe()` function to the HR data set yielded the association plot shown below, which brings to the light some interesting details.
```{r GKtau-assocplot, fig.cap="Association between variables using the Goodman and Kruskal tau measure."}
# install.packages("GoodmanKruskal")
library(GoodmanKruskal)
# data1<- diabetes %>% select(class)
dat <- GKtauDataframe(hr_new)
plot(dat, colorPlot=T)
```
We observe from the $6^{th}$ column in Figure \@ref(fig:GKtau-assocplot) that ***satisfaction level***, ***last evaluation***, ***number of projects***, ***average monthly hours***, and ***time spent at the company*** are moderately associated with employee turnover (left). The opposite is not true which confirms the asymmetric nature of the Goodman and Kruskal tau association values. This suggests that these variables have the ability to explain variation in employee turnover, a sign that such variables, especially satisfaction level and number of projects having the highest values of **0.53** and **0.36** respectively, are likely to be most influential in the modelling phase of the project.
Generally, there is no high association among the predictors, suggesting that multicollinearity will not be an issue for us. It is interesting to note that only satisfaction level and number of projects show some association with a tau value of **0.18** measuring the influence of satisfaction level on the number of projects employees work on. Note that the strength of the association in the reverse case is weaker for these two variables.
## Other findings
### Distribution of Target variable
```{r}
library(scales)
hr_eda %>%
group_by(left) %>%
summarise(n=n()) %>%
mutate(pct = n/sum(n),
lbl = percent(pct)) %>%
ggplot(aes(left, pct,fill=left)) +
geom_bar(stat = "identity",position = "dodge",alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
geom_text(aes(label = lbl), size=3, color = "white",
position = position_stack(vjust = 0.5)) +
labs(y="Percent", title = "") +
theme(legend.position = "none")
```
It can be observed that about 76% of employees stayed while 24% of employees left the company.
### Distribution of continuous predictors by employee turnover
```{r boxplots, fig.cap="Distribution of continuous predictors by employee turnover (left)"}
hr_eda %>%
dplyr::select(-c(salary, department, Work_accident,promotion_last_5years)) %>%
pivot_longer(-left, names_to = "variable", values_to = "value") %>%
ggplot(aes(left, value, fill = left)) +
geom_boxplot(alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
facet_wrap(vars(variable), scales = "free") +
labs(x="", y="") +theme_bw() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
```
From Figure \@ref(fig:boxplots), the following observations can be made:
- Except for time spent at the company, there is high variability in the measures for employees who left the company during the period under consideration compared to those who remained as indicated by the sizes of the box plots.
- On average, employees who left recorded higher values for average monthly working hours, last evaluation, and time spent at the company and low satisfaction levels.
- Older employees in the company have spent up to 10 years, suggesting that the company is relatively young. Employees who left the company typically spent about 4 years, while typical active employees have remained in the company for about 3 years.
### Distribution of categorical predictors by employee turnover
```{r dept-bar, fig.cap="Distribution of department by employee turnover (left)"}
plt1 <- ggplot(data = hr_eda, aes(x = department, fill=left)) +
geom_bar(position = "dodge", alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
labs(x = "",y = "Number of employees",
title = "Department versus employee turnover") +
theme(axis.text.x = element_text(angle = 45),
legend.position = "right", plot.title = element_text(hjust = .5))
plt2 <- ggplot(data = hr_eda, aes(x = salary, fill=left)) +
geom_bar(position = "dodge", alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
labs(x = "", y = "Number of employees",
title = "Salary") + theme(legend.position = "none", plot.title = element_text(hjust = .5))
plt3 <- ggplot(data = hr_eda, aes(x = factor(Work_accident), fill=left)) +
geom_bar(position = "dodge", alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
labs(x = "", y = "",
title = "Work accident") +
theme(legend.position = "bottom",
plot.title = element_text(hjust = .5))
plt4 <- ggplot(data = hr_eda, aes(x = factor(promotion_last_5years), fill=left)) +
geom_bar(position = "dodge", alpha=0.7) +
scale_fill_brewer(palette = "Set1") +
labs(x = "", y = "",
title = "Promotion in last\n 5 years") +
theme(legend.position = "none", plot.title = element_text(hjust = .5))
plt1
```
- Sales, technical, and support department were the top three departments to have employee turnover whiles the management department had the smallest amount of turnover. This could probably due to the fact that those three departments happen to have the largest number of employees.
```{r cat-pred-bars, fig.cap="Distribution of other categorical predictors by employee turnover (left)"}
(plt2 + plt3 + plt4)
```
Figure \@ref(fig:cat-pred-bars) suggests that:
- Most employees receive low to medium salaries and the majority of employees who left the company belong to this same salary group, while few employees with high salaries left. This shows that salary levels have a high tendency to influence employee turnover.
- Majority are the employees who have never had a work related accident.
- Only a small proportion of employees got promoted in the last 5 years during the review period.
# Data Partitioning: Randomly splitting data into training and test sets
For the purpose of model validation, we randomly partitioned the data $D$ into the training set $D_1$ and the test set $D_2$ with a ratio of approximately 2:1 on the sample size. The output below shows that 9999 observations and 5000 observations were allocated to the resulting training and test sets, respectively. A ***9940*** seed was used to ensure reproducibility of results affected by random splitting of the data.
```{r}
set.seed(9940)
ratio <- 2/3
train_ind <- sample(1:NROW(hr_new), size = NROW(hr_new)*ratio)
train_set <- hr_new[train_ind, ]
test_set <- hr_new[-train_ind, ]
dim(train_set); dim(test_set)
```
# Logistic Regression: A baseline classifier
A 10-fold cross validation regularized logistic regression model with LASSO penalty was fitted to the training set $D_1$.
```{r}
library(glmnet)
# select target and create the design matrix
y <- train_set$left
X <- model.matrix(~ . -left, data = train_set)
# fit the model
set.seed(125)
cv.lasso <- cv.glmnet(X, y, nfolds = 10, family="binomial", alpha=1, lambda.min=.0001,
thresh = 1e-07, nlambda=500, standardize=T, maxit=2000, type.measure = "deviance")
# best tuning parameter
best.lambda <- cv.lasso$lambda.1se
plot(cv.lasso)
```
The best tuning parameter $\lambda$ was obtained as `r best.lambda` based on the minimum cross-validated such that error is within 1 standard error of the minimum. From the graph above we observe that 13 terms are selected with this choice of $\lambda$.
```{r}
fit.lasso <- glmnet(x=X, y=y, family="binomial", alpha = 1, lambda=best.lambda, standardize = T, thresh = 1e-07, maxit=2000)
fit.lasso$beta
```
Applying a zero (0) cutoff on the absolute values of the coefficients as a selection criterion, all 9 predictors will be retained in our final model. It is worthy to note that although the coefficient associated with some of the levels of `department` and `saalary` are zero (0), the nonzero ones justify their inclusion in the model. We therefore derive our final logistic regression model as follows:
```{r}
logistic <- glm(left ~ ., data = train_set, family = "binomial")
# summary(logistic)
logistic %>% broom::tidy(conf.int=T,conf.level=0.95) %>%
kable(booktabs=T, linesep="", caption = "Parameter estimates for the Logistic model")%>%
kable_styling(latex_options =c("HOLD_position"))
```
From the above table of results, there is enough evidence at 5% significance level to conclude that all the predictors are statistically significant in the model as evidenced by the small p-values. Even though most of the levels for the department variable did not show significance, we consider the whole variable to be significant.
The sign of the coefficients show that most of the predictors including satisfaction level and number of projects have negative effect on turnover. From the odds table below, we learn that the estimated odds for satisfaction level is $e^{-4.1856} = 0.0152$, meaning for every unit increase in satisfaction level, the odds (likelihood) of an employee turnover decreases by a factor of 0.0152, holding all other factors constant. Similar interpretations can be made for the other variables.
```{r eval=T, echo=T}
exp(cbind(OR = coef(logistic), confint(logistic))) %>%
kable(booktabs=T, linesep="", caption =
"Odds ratio based on parameter estimates from the logistic model") %>%
kable_styling(latex_options =c("HOLD_position"))
```
## Applying the fnal model to the test data and presenting the ROC curve
```{r}
library(verification)
library(cvAUC)
phat.logit <- predict(logistic, newdata = test_set, type="response") # predicted probabilities
# a custom function for computing and plotting ROC curve
roc_curve <- function (phat, main = "", col="blue", roc_val=F) {
yobs <- as.integer(as.character(test_set$left))
AUC <- ci.cvAUC(predictions=phat, labels=yobs, folds=1:NROW(test_set),
confidence=0.95)
if(roc_val) return(AUC$cvAUC)
auc.ci <- round(AUC$ci, digits=3) # confidence interval for cross-validated
# Area Under the ROC Curve
mod <- verify(obs=yobs, pred=phat)
roc.plot(mod, plot.thres = NULL, main=main)
text(x=0.6, y=0.16, paste("Area under ROC =", round(AUC$cvAUC, digits=3),
"with 95% CI (", auc.ci[1], ",", auc.ci[2], ")",
sep=" "), col=col, cex=.9)
}
roc_curve(phat.logit, main="ROC curve for final logistic regression")
```
The Area under the ROC curve is obtained as **0.819**, which is indicative of a relatively good predictive performance. With 95% confidence level, the ROC is estimated to lie between **0.806** and **0.833**.
# Random Forest (RF): Another baseline model
```{r}
library(randomForest)
fit.rf <- randomForest(left ~., data=train_set,importance=T, ntree=500)
print(fit.rf)
```
## Partial dependence plots
```{r partial-rf, fig.cap="Partial dependence plots for continuous predictors"}
# partial dependence plot
par(mfrow=c(2,3))
partialPlot(fit.rf, pred.data = train_set, x.var = satisfaction_level, rug = T,
main = "", xlab = "Satisfaction level")
partialPlot(fit.rf, pred.data = train_set, x.var = number_project, rug = T,
main = "", xlab = "Number of projects")
partialPlot(fit.rf, pred.data = train_set, x.var = last_evaluation, rug = T,
main = "", xlab = "Last evaluation")
partialPlot(fit.rf, pred.data = train_set, x.var = time_spend_company, rug = T,
main = "", xlab = "Time spent")
partialPlot(fit.rf, pred.data = train_set, x.var = average_montly_hours, rug = T,
main = "")
```
We observe strong non-linear patterns in all the plots in Figure \@ref(fig:partial-rf), which signifies that a linear model such as the logistic regression is not appropriate to model the relationship between employee turnover (left) and the variables indicated in the plots.
## Variable importance rankings
```{r}
varImpPlot(fit.rf, main = "Variable importance ranking from RF", cex=0.89)
```
According to the mean decrease accuracy, the top four variables are `satisfaction level`, `number of projects`, `last evaluation`, and `average monthly hours`. Satisfaction level and number of projects happen to be the two most important determinants of employee turnover or retention. Based on the mean decrease GINI, satisfaction level remains the most influential variable.
## ROC curve based on the predictions on the test set
```{r}
# get predicted probabilities
phat.rf <- predict(fit.rf, newdata=test_set, type="prob")[, 2]
# create ROC curve
roc_curve(phat.rf, main="ROC curve for the RF model")
```
The AUC value of **0.992** shows that the random forest model has very high predictive performance for the problem at hand.
# Generalized Additive Model (GAM)
A stepwise selection procedure with AIC was employed to select the best fitting GAM model. The `scope` argument for the `step.Gam()` function allowed us to choose the smoothing parameters adaptively in the backfitting algorithm by specifying whether a term could either appear not at all, linearly, or as a smooth function estimated non-parametrically via smoothing splines or loess smoother.
```{r}
library(gam)
# Create a GAM object for use in the stepwise selection. Note that the smoothing
# terms will be specified later in the selection stage following.
fit.gam <- gam( left ~ satisfaction_level + number_project + time_spend_company +
department + last_evaluation + average_montly_hours + Work_accident +
promotion_last_5years + salary , family = binomial,
data=train_set, trace=T, control = gam.control(epsilon=1e-04, bf.epsilon = 1e-04,
maxit=50, bf.maxit = 50))
#--- perform a stepwise selection
# register parallel backend for paralel execution
require(doMC)
registerDoMC(cores = (detectCores()-8))
fit.step.gam <- step.Gam(fit.gam, scope=list(
"satisfaction_level"=~1 + satisfaction_level + lo(satisfaction_level) +
s(satisfaction_level),
"last_evaluation"=~1+ last_evaluation + lo(last_evaluation)+ s(last_evaluation),
"number_project"=~1 + number_project + lo(number_project) + s(number_project),
"average_montly_hours"=~1 + average_montly_hours + lo(average_montly_hours) +
s(average_montly_hours),
"time_spend_company"=~1 + time_spend_company + lo(time_spend_company) +
s(time_spend_company)),
scale =2, steps=1000, parallel=T, direction="both", trace = F)
summary(fit.step.gam)
```
The results from the anova for nonparametric effects' table indicate that the loess smoother is appropriate for the nonparametric smoothing terms in `number of projects`. On the other hand, smoothing splines were chosen for `satisfaction level`, `last evaluation`, `average monthly hours`, and `time spent at the company` as the best smoothing functions with 3 degrees of freedom for each term.
## ROC curve based on predictions on the test set
```{r}
# get predicted probabilities
phat.gam <- predict(fit.step.gam, newdata=test_set, type="response", se.fit=F)
roc_curve(phat.gam, main="ROC curve for the best GAM model")
```
From the plot, with AUC value of **0.957**, we notice that the final GAM model obtained performed well predicting the probability of an employee leaving or remaining in the company on the test data.
## Plots of the functional forms for continuous predictors
```{r}
par(mfrow=c(2,3))
plot(fit.step.gam, se=T)
```
The functional forms for the continuous predictors show non-linear relationships, indicating the inadequacy of linear models for our classification problem.
# Multivariate Adaptive Regression Splines (MARS)
We allowed maximum degree of interactions up to 3 by setting `degree` to 3.
```{r}
library(earth)
fit.mars <- earth(left ~ ., data=train_set, degree=3,
glm=list(family=binomial(link = "logit")))
fit.mars
summary(fit.mars) %>% .$coefficients %>% head(10)
```
## Variable importance ranking
```{r mars, fig.cap="Variable importance based on impact to GCV as predictors are added to the model"}
library(vip)
# generating variable importance plot
vip(fit.mars, num_features = 10, aesthetics = list(fill="dodgerblue",
color="dodgerblue")) +
ggtitle("Variable importance (GCV) ranking from MARS") +
scale_fill_brewer(palette = "Set1") +
theme(plot.title = element_text(hjust = .5))
```
We see here that the top four important variables include `satisfaction level` and `number of projects`, `time spent at the company`, and `last evaluation`, signifying that these variables played major role in predicting employee turnover or retention on the test data. Once again, the satisfaction level and number of projects are the top two most important variables.
## Partial dependence plots
```{r}
library(pdp)
# partial dependence plot
(partial(fit.mars, pred.var = "satisfaction_level", grid.resolution = 10)%>%autoplot() +
partial(fit.mars, pred.var = "number_project", grid.resolution = 10)%>%autoplot()) /
(partial(fit.mars, pred.var = "last_evaluation", grid.resolution = 10)%>%autoplot() +
partial(fit.mars, pred.var = "time_spend_company", grid.resolution = 10)%>%autoplot())
```
These plots help us to assess the marginal effect of each variable on predicting employee turnover. For example, higher satisfaction levels and time spent have negative effect on the the likelihood of employee turnover.
```{r}
# get predicted probabilities
phat.mars <- predict(fit.mars, newdata=test_set, type="response")
# create ROC curve
roc_curve(phat.mars, main="ROC curve from the best MARS model")
```
With AUC of **0.978**, it is clear that the MARS model performed well predicting the likelihood of employee turnover on the the test data.
# Projection Pursuit Regression
```{r}
train_set2 <- train_set %>%
mutate(left = as.integer(as.character(left)))
fit.ppr <- ppr(left ~ ., sm.method = "supsmu",
data = train_set2, nterms = 2, max.terms = 12, bass=3)
# summary(fit.ppr)
fit2.ppr <- update(fit.ppr, bass=5, nterms=4)
summary(fit2.ppr)
```
## ROC curve based on predictions on the test set
```{r}
# get predicted probabilities
phat.ppr <- predict(fit2.ppr, newdata=test_set)
phat.ppr <- scale(phat.ppr,center = min(phat.ppr),
scale = max(phat.ppr)-min(phat.ppr))
# create ROC curve
roc_curve(phat.ppr, main="ROC curve from the PPR model")
```
The AUC obtained based on the PPR model is **0.972**, showing a better predictive performance.
# Summary of results
```{r}
roc_values <- c(roc_curve(phat.logit, roc_val = T),
roc_curve(phat.rf, roc_val = T),
roc_curve(phat.gam, roc_val = T),
roc_curve(phat.mars, roc_val = T),
roc_curve(phat.ppr, roc_val = T)
)
data.frame("Model"= c("Logistic (LASSO)","Random Forest","GAM","MARS","PPR"), "AUC"= roc_values) %>%
kable(booktabs=T, align = "lc")%>%
kable_styling(latex_options =c("HOLD_position"))
```
Among all the five supervised learning approaches considered, the random forest (RF) model yielded the best predictive performance since it provides the largest AUC value for determining the likelihood of employee turnover or retention in the company. However, among GAM, MARS, and PPR, MARS is the best performing model followed closely by PPR. The regularized logistic regression model performed poorly relative to the other methods in terms of predictive performance, however, it must be noted that its results are highly interpretable compared to the others. For instance, using the odds ratios obtained from the logistic coefficients, we can explain how exactly a particular predictor influences employee retention.
Overall, satisfaction level and number of projects emerged as the top two variables that predict an employee's turnover or retention. Therefore, it is recommended that employers take these factors seriously into account in their bid to reducing turnover rate or improving retention.