Skip to content

Commit 0c8055e

Browse files
added HW3
1 parent 6bcbac7 commit 0c8055e

File tree

1 file changed

+297
-0
lines changed

1 file changed

+297
-0
lines changed

homework/HW3/HW3.Rmd

+297
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
title: "Homework 3: Is Donald Trump going to win the republican nomination?"
3+
output: html_document
4+
---
5+
6+
**This homework is due Tuesday March 8, 2016 at 8PM EST. When complete, submit your code in an R Markdown file and the knitted HTML via GitHub.**
7+
8+
# Motivation
9+
10+
In 2012 Nate Silver, and other data scientists, [predicted the outcome of each state correctly](http://mashable.com/2012/11/07/nate-silver-wins/#2WkAUaXCVaqw).
11+
They did this by aggregating data from many polls to create more precise
12+
estimates than what one single poll can provide.
13+
14+
In this homework, we will try to predict the results of the democratic
15+
and republican primaries by studying the performance of polls in
16+
elections that already occurred and then aggregating results.
17+
18+
19+
# Problem 1
20+
21+
The first step in our analysis will be to wrangle the data in a way
22+
that will simplify the analysis. Ultimately, we want a table of results
23+
with each poll represented by a row and including results for each
24+
candidate as well as information about the poll such as name and date.
25+
26+
# Problem 1A
27+
28+
Install and load the `pollstR` package. This package provides functions
29+
to access data in the Huffington Post's database. Read the help file
30+
for the `pollstr_polls()` function and write a function that reads
31+
**all** the polls related to the republican primaries. Name the object
32+
`race2016`. Hint: Visit
33+
[this webpage](http://elections.huffingtonpost.com/pollster/api)
34+
to select the right `topic` and make sure to change the `max_pages` argument.
35+
36+
37+
```{r, echo=FALSE, cache=TRUE, warning=FALSE, message=FALSE}
38+
##Your code here
39+
40+
```
41+
42+
# Problem 1B
43+
44+
Examine and familiarize yourself with the `race2016` object. Note
45+
that the `questions` component has a table with election results.
46+
Look at the `topic` component of the `questions` component. Create a new
47+
table with only the results from the `2016-president-gop-primary`
48+
and only state (or territory) polls, no national polls. Hint: create
49+
a new object called `results` with the table of results and
50+
use `dplyr`. How many rows are we left with?
51+
52+
```{r}
53+
##Your code here
54+
55+
```
56+
57+
58+
## Problem 1C
59+
60+
In Problem 1B, we created a table called `results` with over 4000 rows.
61+
Does this mean that we have data for 4000 polls? How many polls
62+
did we actually have?
63+
Hint: look at the `id` column and use the `group_by` command.
64+
65+
```{r}
66+
##Your code here
67+
68+
```
69+
70+
71+
## Problem 1D
72+
73+
Look at the first row of your `results` table.
74+
What date was this poll conducted?
75+
Hint: Use the `polls` component of the `race2016` object to find the date.
76+
77+
```{r}
78+
##Your code here
79+
80+
```
81+
82+
## Problem 1E
83+
84+
Now examine the candidates in the "choices" column included in `results` table.
85+
Hint: use the `table()` function. Note that there are several choices that
86+
not going to be informative. For example, we have candidates that have
87+
dropped out. We also have entries such as `No one`, `No One` and
88+
`No Preference`. Filter the `results` table to include only Rubio and Trump.
89+
90+
```{r}
91+
##Your code here
92+
93+
```
94+
95+
## Problem 1F
96+
97+
In our `results` table, we have one row for each candidate in each poll.
98+
Transform the `results` table to have one row for each poll and columns
99+
for each Rubio and Trump. Next, create a column called `diff` with the
100+
difference between Trump and Rubio. Hint: Remove the `first_name` and
101+
`last_name` columns then use the `tidyr` function `spread()`.
102+
103+
104+
```{r}
105+
##Your code here
106+
107+
```
108+
109+
## Problem 1G
110+
111+
For each poll in the `results` table, we want to know the start date and the
112+
end date of the poll along with the pollster name and the type of poll it was.
113+
Hint: This information is in the `polls` component of `race2016`.
114+
You can select the relevant columns then use the `id` column to join the
115+
tables. One of the `join` functions in `tidyr` will do the trick.
116+
117+
```{r}
118+
##Your code here
119+
120+
```
121+
122+
123+
## Problem 1H
124+
125+
Study the type of values in the `pollster` column. Notice that you
126+
have many different values but that certain names commonly appear
127+
in these values. For example, consider the name "NBC" in the `pollster`
128+
column. NBC here is the Survey House. Use a join function again to add the survey
129+
house to the `results` table. Rename the column `house`.
130+
Hint: `race2016$survey_house` has the information you need.
131+
132+
```{r}
133+
##Your code here
134+
135+
```
136+
137+
138+
## Problem 2
139+
140+
We now have a table with all the information we need. We will now use
141+
the results from Iowa, New Hampshire, Nevada and South Carolina
142+
to determine how to create a prediction for upcoming primaries.
143+
144+
## Problem 2A
145+
146+
Use an internet search to determine the results for the Iowa,
147+
New Hampshire, Nevada and South Carolina primaries for the top three
148+
candidates. Create a table called `actual` with this information.
149+
Also, create a column with the actual election difference.
150+
Use a join function to add this information to our `results` table.
151+
152+
153+
```{r}
154+
##Your code here
155+
156+
```
157+
158+
## Problem 2B
159+
160+
Create boxplots of the poll results for Trump in Iowa stratified by
161+
the pollster survey house for polls having more than 4 total results.
162+
Add a horizontal line with the actual results.
163+
Hint: Use the `group_by`, `mutate`, `filter` and `ungroup` functions in
164+
`dplyr` for the filtering step.
165+
166+
```{r}
167+
##Your code here
168+
169+
```
170+
171+
## Problem 2C
172+
173+
Using the poll results for Trump in Iowa,
174+
compute the standard deviation for the results from each pollster house
175+
for polls having more than 4 total results.
176+
Then, study the typical standard deviation sizes used in
177+
these polls. Create a new table with two columns: the observed
178+
standard deviation and the standard deviations that theory predicts.
179+
For the prediction you have several observations. Pick the smallest
180+
one. Which is larger, the observed or the theoretical?
181+
182+
```{r}
183+
##Your code here
184+
185+
```
186+
187+
## Problem 2D
188+
189+
Now using the data from Problem 2C, plot the individual values
190+
against the time the poll was taken (use the `end_date`).
191+
Repeat this for each of the four states. Use color to denote pollster house.
192+
Using this plot, explain why the theory does not match the observed results?
193+
194+
```{r}
195+
##Your code here
196+
197+
```
198+
199+
## Problem 2E
200+
201+
Consider the Trump - Rubio difference. For each poll in IA, NH, SC and NV,
202+
compute the error between the prediction and actual election results.
203+
Use exploratory data analysis to get an idea of how time and pollster
204+
impacts accuracy.
205+
206+
```{r}
207+
##Your code here
208+
209+
```
210+
211+
212+
# Problem 2F
213+
214+
For polls from IA, NH, and SC, aggregate all polls from within 1 week of the
215+
election (use the `start_date` to determine cutoff) to provide a
216+
95% confidence interval for the difference between Trump and Rubio.
217+
Compare the following two approaches:
218+
(1) the method that assumes that all variance comes from sampling error
219+
and (2) the approach that estimates variance empirically.
220+
221+
```{r}
222+
##Your code here
223+
224+
```
225+
226+
227+
# Problem 3
228+
229+
Before seeing any polls my _prior belief_ is that Rubio will beat
230+
Trump in Florida. If I were to quantify this belief I would say that
231+
the distribution of the `Trump` - `Rubio` was normal with mean
232+
$\mu=-20$ percent and standard deviation $\tau=10$.
233+
Let's call the difference $\theta$. Then
234+
235+
$$
236+
\theta \sim N( \mu, \tau)
237+
$$
238+
239+
# Problem 3A
240+
241+
Under my prior belief, what is the chance that Trump would beat Rubio in Florida.
242+
243+
```{r}
244+
##Your code here
245+
246+
```
247+
248+
# Problem 3B
249+
250+
Consider the latest 25 Florida polls. Assume the poll results for the
251+
difference are normal distributed with mean $\theta$ and standard
252+
deviation $\sigma$. Provide an estimate for $\theta$ and an estimate
253+
of the standard deviation $\sigma$.
254+
255+
```{r}
256+
##Your code here
257+
258+
```
259+
260+
$$ \hat{\theta} \sim N( \theta, \sigma/ \sqrt{25})$$
261+
262+
Now use the Central Limit Theorem to construct a confidence interval.
263+
264+
```{r}
265+
##Your code here
266+
267+
```
268+
269+
## Problem 3C
270+
271+
Combine these two results to provide the mean and standard deviation of
272+
a posterior distribution for $\theta$.
273+
274+
```{r}
275+
##Your code here
276+
277+
```
278+
279+
## Problem 3D
280+
281+
Use the result form Problem 3C to provide your estimate of
282+
Trump beating Rubio in Florida.
283+
284+
```{r}
285+
##Your code here
286+
287+
```
288+
289+
290+
## Problem 4
291+
292+
Use the poll data as well as the results from Super Tuesday (March 1st) and other election results that happen before the deadline to make predictions for each remaining primary. Then use these results to estimate the probability of Trump winning the republican nomination. Justify your answer with figures, statistical arguments, and Monte Carlo simulations.
293+
294+
It will help to learn about how delegates are assigned. Here is [the manual](http://www.scribd.com/doc/294928557/2016-Presidential-Nominating-Process-Book-version-2-0-Dec-2015-pdf)
295+
296+
297+

0 commit comments

Comments
 (0)