-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLily_McMullen_Module3_Assignment2.Rmd
147 lines (101 loc) · 5.68 KB
/
Lily_McMullen_Module3_Assignment2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Module 3 Assignment 2"
author: "Ellen Bledsoe" 'Lily McMullen'
date: '2022-10-27'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Assignment Details
### Purpose
The goal of this assignment is to assess your ability to compare means numerically, visually, and statistically.
### Task
Write R code which produces the correct answers and correctly interpret the results of visualizations and statistical tests.
### Criteria for Success
- Code is within the provided code chunks
- Code is commented with brief descriptions of what the code does
- Code chunks run without errors
- Code produces the correct result
- Code that produces the correct answer will receive full credit
- Code attempts with logical direction will receive partial credit
- Written answers address the questions in sufficient detail
### Due Date
November 8 at midnight MST
# Assignment Questions
In this assignment, we're going to explore another data set on wind turbines that generate a significant portion of the energy for us down here in Antarctica.
### Set-Up
Let's load the `tidyverse` and read in the data set. Call the data `turbines`.
```{r}
library(tidyverse)
turbines <- read_csv("wind_turbines.csv")
```
1. Explore the data set, either through the environment or through code. Answer the following questions (2 point):
a. How many turbine makers are there?
There are two turbine makers.
a. What does each row of data represent
Each row of data represents a different turbine.
```{r}
head(turbines)
# optional; only if you want space for coding
```
### Numeric
2. Generate a summary of the data set that calculates the mean wind speed and mean power output for each wind turbine company. (2 point)
```{r}
turbine_summary <- turbines %>%
group_by(manufacturer) %>%
summarize(wind_speed_mean = mean(wind_speed),
power_output_mean = mean(power_output))
head(turbine_summary)
```
### Visual
3. Create a density plot for the power output variable. (3 points)
- be sure to have a density plot for each turbine producer; the color and the fill should be determined by the maker of the turbine
- add in vertical lines for the mean values in the same color as the turbine makers
- make sure the x-axis, y-axis, and legend labels are capitalized and easier to understand (power output in measured in kilowatts, or kWh)
- use the `theme_classic()` function
```{r}
ggplot(turbines, aes(power_output, color = manufacturer, fill = manufacturer)) +
geom_density(alpha = 0.5) +
labs(x="Power Output (kWh)",
y="Density",
color = "Manufacturer",
fill="Manufacturer") +
theme(legend.position = "top") +
geom_vline(data = turbine_summary, aes(xintercept=power_output_mean, color = manufacturer),
linetype="dashed") +
theme_classic()
```
4. Generate a box-and-whisker plot using `ggplot2` that compares the wind speed between different turbine makers (3 points).
The plot should:
- have capitalized and more descriptive axis labels (hint: wind speed is measured in kilometers per hour---km/hr)
- show raw data points in addition to the boxes. The points should be jittered.
- use the `theme_classic()` function
```{r}
ggplot(turbines, aes(manufacturer, wind_speed)) +
geom_boxplot() +
labs(x="Manufacturer",
y="Wind Speed (km/hr)") +
geom_jitter(alpha = 0.5, width = 0.1) +
theme_classic()
```
### Statistic
5. Write a null hypothesis and an alternative hypothesis for the question we are asking and that we will be using statistics to answer. (2 points)
**Null Hypothesis** (H~0~): The means for power output and wind speed of the turbines from both Turbo Turbines and Windmill Inc will be equal.\
**Alternative Hypothesis** (H~A~): The means for power output and wind speed of the turbines from Turbo Turbines will be higher than the means from Windmill Inc.
6. Based on the mean values in the `turbine_summary` data frame and the plots you've created above, predict the outcome of each t-test (graded for completion, not accuracy). Explain your reasoning (1-2 sentences for each t-test is fine). (2 points)
*Answer:*
*I predict that the p-values for power output by turbine maker will be above 0.05, meaning there is a meaningful difference between makers for power output values. This is because there is a clear difference in the mean v-line on our data visualization.*
I predict that the p-values for wind speed by turbine maker will be below 0.05, meaning there is no meaningful difference between makers for wind speed values. This is because the means look very similar on our data visualization.
7. Perform a t-test on the power output by turbine maker. (1 point)
```{r}
t.test(data = turbines, power_output ~ manufacturer)
```
8. In 2-3 sentences, interpret the output from question 7. Focus on what the p-value is in reference to the cutoff of 0.05, what that means, and whether that means we accept or reject the null hypothesis. (2 points)
*Answer: The p-value for this test (power output vs manufacturer) is less than 0.05. This means there is a meaningful difference in power output per manufacturer and we can reject our null hypothesis.*
9. Perform another t-test, this time on the wind_speed variable by manufacturer. (1 point)
```{r}
t.test(data = turbines, wind_speed ~ manufacturer)
```
10. In 2-3 sentences, interpret the output from question 9 (focus on the same ideas as question 8). (2 points)
*Answer: Our p-value is above 0.05 for wind speed vs manufacturer. This means that there is no meaningful difference between wind speed between the different manufacturers. We can reject our alternate hypothesis.*