example_problem_set.Rmd

---
title: "Problem Set / Data Exercise Example"
author: "Devin Judge-Lord"
date: \today
output: pdf_document 
header-includes:  ## Add any Latex packages you need (or use a preamble/template)
    - \usepackage{setspace} ## spacing text 
---

```{r setup, include=FALSE}
## Sets defaults for R chunks
knitr::opts_chunk$set(echo = TRUE, # echo = TRUE means that your code will show
                      warning=FALSE,
                      message=FALSE,
                      # fig.path='Figs/', ## where to save figures
                      fig.height = 3,
                      fig.width = 3,
                      fig.align = 'center')

## Add any R packages you require. 
## Here are some we will use in 811:
requires <- c("tidyverse", # tidyverse includes dplyr and ggplot2
              "magrittr",
              "foreign",
              "readstata13",
              "here")

## Install any you don't have
to_install <- c(requires %in% rownames(installed.packages()) == FALSE)
install.packages(c(requires[to_install], "NA"), repos = "https://cloud.r-project.org/" )

## Load all required R packages
library(tidyverse)
library(ggplot2); theme_set(theme_bw())
library(magrittr)
library(here)
```
<!-- The above header sets everything up. -->

<!-- The below is just example content, edit/delete as needed. -->

<!-- NOTE: Just like LaTeX, Markdown is plain text. To use LaTex syntax: $LaTeX syntax$ -->
Imagine that you are provided a sample of data and asked to estimate the linear regression model $y_i = \alpha + \beta x_i + \epsilon_i$ (or, in equivilant notation, $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$).

Let us say that these data contain 20 observations for two variables:

`Leg_Act` $\in\{-20,40\}$ is the legislative activity of state assembly members, where -20 is no significant legislative activity and 40 is the maximum level of activity. This is the dependent variable, $Y$, with each observation being a $y_i$.

`terms` is the number of terms in office. This is your explanatory variable, X, with each observation being a $x_i$.

You have a number of tasks:

1. Plot the dependent variable against the explanatory variable.

2. Estimate the parameters $\alpha$ and $\beta$.

3. Compute the residuals (the difference between the observed values of the dependent variable and the predicted values from the estimated linear model (i.e. the distance of each observed $x_i$ from the regression line).

4. Plot the residuals against the explanatory variable.

5. Correlate the observed values of the dependent variable $Y$ (the vector of each $y_i$) with the predicted values $\hat{Y}$

6. Compare the square of this correlation (between the observed values of $Y$ and predicted $\hat{Y}$) to the model $R^2$.

7. Test the null hypothesis that $\beta = 0$ against an alternative that $\beta \neq 0$.

8. Write a paragraph (double-spaced) interpreting the parameters and explaining the results of your hypothesis test.  

**But**, for whatever reason, you want to do your problem set in R. [R Markdown](http://rmarkdown.rstudio.com) offers an easy way to do this without cutting and pasting. If you accidentally regressed X on Y rather than Y on X, fix the model and **pow**, your plots and estimates cited in your discussion are instantly corrected.

- [Here is the RMarkdown template](https://github.com/judgelord/PS811/raw/master/example_problem_set.Rmd) that made this pdf. Save it as a .Rmd file. 

- [Here is a pdf about writing in RMarkdown](https://github.com/judgelord/PS811/raw/master/example_notes.pdf)

**But** the data are in STATA!?! No problem. R can read .dta files.

In STATA, save the data generated by the `PS813_EX1` function with your seed:
```{}
net install PS813_EX1, from(https://faculty.polisci.wisc.edu/weimer/)

PS813_EX1 yourseed

save "EX1.dta"
```

Alternativly, run STATA in a chunk (R Markdown supports [many languages](https://bookdown.org/yihui/rmarkdown/language-engines.html)!). First [install Statamarkdown](https://www.ssc.wisc.edu/~hemken/Stataworkshops/Stata%20and%20R%20Markdown/InstallingStatamarkdown.html). Then, add a STATA setup chunk (just like our R setup chunk above) that allows STATA chunks: [Instructions here.](https://www.ssc.wisc.edu/~hemken/Stataworkshops/Stata%20and%20R%20Markdown/randstata.html)

Then load it into R with the `readstata13` package:

**Note: R is looking for "EX1.dta" in a folder called "data" whereever this .Rmd files is saved**
```{r data}
## Load your data, defining an R object called "d"
d <- readstata13::read.dta13(here("data/EX1.dta"))
glimpse(d)
```

```{r if_data_fail, echo=FALSE}
# empty data if your loading data failed
if(is.null(d)){
  d <- data.frame("terms" = 0, "Leg_Act" = 0 )
  print("data/EX1.dta NOT FOUND in a folder called data where this .Rmd files is saved")
  } 
```

Now on to the tasks: 
\newpage
<!-- Obviously, delete the above before you turn this in -->

In STATA, generate data with the `PS813_EX1` function:
```{}
net install PS813_EX1, from(https://faculty.polisci.wisc.edu/weimer/)

PS813_EX1 yourseed

save "EX1.dta"
```

# 1. A plot of Legislative Activity by Terms in Office

```{r plot_variables}
## STATA: plot Leg_Act terms
## R: 
ggplot(d, aes(y = Leg_Act, x = terms)) + 
  geom_point()
```

```{r correlation_variables}
## STATA: corr Leg_Act terms
## R: 
corXY <- cor(d$Leg_Act, d$terms)
corXY
```
The correlation between Legislative Activity and Terms in Office is `r corXY`

# 2. Estimating linear regression
```{r regression}
## STATA: regress Leg_Act terms
## R: 
model <- lm(d$Leg_Act ~ d$terms)
# summary(model)
alpha <- model$coefficients[1]
beta <- model$coefficients[2]
```
<!-- We can print R objects right in the text by typing "r object"" in grave accent ticks -->
Regression coefficients: $\alpha$ = `r alpha` and $\beta$ = `r beta`

# 3. Computing residuals
```{r residuals}
## STATA: predict p_Leg_Act
## R:
d$p_Leg_Act <- predict(model)

## STATA: generate resid = Leg_Act - p_Leg_Act
## R:
d$resid <- d$Leg_Act - d$p_Leg_Act
```

# 4. Plot of Residuals
```{r plot_residuals, fig.width = 5}
## STATA:  plot resid terms
## R:
ggplot(d) +
  aes(y = resid, x = terms) + # "aesthetics"
  geom_point() + # a layer of points
  ## to show how risiduals are the distance between an observation and the regression line:
  geom_hline(yintercept = 0) +
  geom_col(alpha = .1, width = .1, position = "dodge") +
  ## + labels:
  labs(title = "Residuals (Observed - Predicted Legislative Activity)",
       x = "Terms in Office",
       y = "Residuals")
```

# 5. $Cor(Y,\hat{Y})$
```{r correlation_observed_predicted}
## STATA: corr Leg_Act p_Leg_Act
## R:
correlation <- cor(d$Leg_Act, d$p_Leg_Act)
```
$Cor(Y,\hat{Y})$ = `r correlation`

# 6. $Cor(Y,\hat{Y})^2$ vs. $R^2$.
```{r correlation_vs_R2}
## STATA: generate r2 =r(rho)*r(rho)
## R: 
r2 <- summary(model)$r.squared
```
$R^2$ = `r r2`

# 7. Hypothesis test
\Large

Lorem ipsum $\beta = 0$

Lorem ipsum $\beta \neq 0$

# 8. Discussion
<!-- If printing assignments, it is nice to use \large or \Large text -->
\Large
\doublespacing
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.