-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathMotorTrendAnalysis.Rmd
154 lines (111 loc) · 5.95 KB
/
MotorTrendAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Motor Trend MPG Regression Model Analysis"
author: "Trent Parkinson"
date: "January 15, 2018"
output:
pdf_document
header-includes:
- \usepackage{caption}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
\captionsetup[table]{labelformat=empty}
## Overview
This project explores the `mtcars` data set and explores how miles per gallon (MPG) is affected by different variables, specifically the affect automatic and manual transmissions have on MPG. The following will be answered,
- Is an automatic or manual transmission better for MPG?
- Quantify the MPG difference between automatic and manual transmissions.
## Setting up environment
Necessary libraries for loading, plotting, and model selection. Reading the `mtcars` dataset and making a copy in a `data.table`.
```{r, message=FALSE}
library(data.table)
library(ggplot2)
library(leaps)
library(printr)
data("mtcars")
mtcars_num <-copy(mtcars)
```
## Data Structure
Viewing `mtcars` data, and viewing structure of variables.
```{r}
head(mtcars)
as.data.frame(t(apply(mtcars,2,class)))
```
## Data Processing
Changing categorical variables to factors. Relabeling `am` to `Automatic` and `Manual`.
```{r}
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
```
## Visualizations
Plotting the miles per gallon (MPG) for automatic and manual transmissions.
```{r, fig.height= 4}
plot1 <- ggplot(mtcars, aes(x=am, y=mpg)) +
geom_boxplot(aes(fill = am)) +
xlab("Transmission") +
ylab("MPG") +
theme(legend.position = "none")
plot1
```
## Analysis
It looks like there is a definite difference in the type of transmission for MPG. Performing a t-test will help verify if the difference in means is significant.
```{r}
auto_vs_manu_ttest <- t.test(mpg ~ am, mtcars)
auto_vs_manu_ttest
```
The t-test rejected the null-hypothesis that the difference in means is equal to zero, with a p-value of $.0014$. Therefore there is a difference in transmission type, with manual transmissions having a higher MPG.
## Linear Regression Fitting
Since the project is trying to quantify the difference in MPG for automatic and manual transmissions. The best starting place is a simple linear model with transmission type as the dependent variable.
```{r}
basic_fit <- lm(mpg ~ am, mtcars)
summary(basic_fit)$coefficients
summary(basic_fit)$r.squared
```
The basic linear model with `am` as the only regressor explains $36\%$ of the variation, not a very good model. To gain a better model it gets tricky after one variable, since regressors can correlate with not only the predictor but also other regressors adding a variable that is highly correlated could help, but could also hurt the prediction.
One method is called stepwise regression which uses AIC to choose the best model, the other method is called best subsets regression which goes through all possible models with the specified regressors and chooses the best model based on different criterion.
```{r}
everything_fit <- lm(mpg ~ ., mtcars)
step_fit <- step(everything_fit,direction="both",trace=FALSE)
best_subset <- regsubsets(mpg ~ ., mtcars, nvmax = 25)
best_subset_summary <- summary(best_subset)
adjr2 <- which.max(best_subset_summary$adjr2)
cp <- which.min(best_subset_summary$cp)
bic <- which.min(best_subset_summary$bic)
best_set <- best_subset_summary$outmat[c(adjr2,cp),]
best_set[,1:13]
sub3_fit <- lm(mpg ~ am + wt + qsec, mtcars)
sub5_fit <- lm(mpg ~ am + cyl + hp + wt + vs, mtcars)
```
## Model Selection
Stepwise regression gave us a best model, but best subsets gave us two different models as well. Using Mallows's $C_p$ and BIC both returned model three as the best, while model five has the best for the adjusted $R^2$. The code below grabs the adjusted $R^2$ and also the p-value for the transmission type in the regression coefficients. Since the goal of the project is to quantify MPG, the best model would have confidence in this coefficient as well as explain the variance well.
```{r}
models <- c("mpg ~ am + wt + qsec", "mpg ~ am + wt + cyl + hp", "mpg ~ am + wt + cyl + hp + vs")
adj_r_squared <- round(c(summary(sub3_fit)$adj.r.squared,
summary(step_fit)$adj.r.squared,
summary(sub5_fit)$adj.r.squared),4)
amManual_Pvalues <- round(c(summary(sub3_fit)$coefficients["amManual",4],
summary(step_fit)$coefficients["amManual",4],
summary(sub5_fit)$coefficients["amManual",4]),4)
results <- as.data.frame(cbind(models,adj_r_squared,amManual_Pvalues))
results
```
## Checking Model
The only model with a p-value for transmission type below $5\%$ is `mpg ~ am + wt + qsec`, it doesn't have the highest adjusted $R^2$ but its not much lower than the other two models.
```{r}
summary(sub3_fit)
```
Everything so far looks solid, but lets make sure this model fits our data well by printing the diagnostic plots.
```{r}
par(mfrow = c(2,2))
plot(sub3_fit, col = "blue", lwd = 2)
```
- Residuals vs Fitted: The points are randomly scattered, but may have a slight non-linear relationship.
- Normal Q-Q: The points pass normality, they deviate slightly from the diagonal, but they follow the diagonal fairly close.
- Scale-Location: The upward slope line is worrisome, the residues spread slightly wider.
- Residuals vs Leverage: No high leverage points.
## Conclusions
The best transmission type for MPG would have to be the manual transmission. Its confirmed by the t-test, as well as our final linear model. By having a manual transmission instead of an automatic the MPG will increase by 2.94 as can be seen in the best model's `amManual` coefficient.
The model fit well with a $p < 0.05$ and and $R^2 = 0.85$, but the diagnostic plots did warn us that something may be missing in our model. I believe the true cause for these trends are do to the small sample size with little overlap on the parameters `wt` and `qsec`.