-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCODE Marriage Ages Bayesian Analysis.Rmd
195 lines (145 loc) · 6.2 KB
/
CODE Marriage Ages Bayesian Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Age and Marriage Bayesian Analysis"
author: "Rory Quinlan"
output: html_document
---
```{r, warning=F}
library(MASS)
library(ggplot2)
library(kableExtra)
```
```{r}
agehw <- read.table("C:\\Users\\roryq\\Downloads\\agehw.dat", header = T)
nrow(agehw)
```
### Descriptive Statistics
```{r}
# Table Descriptive stat
`Mean Age of Wives`<-c(mean(agehw$agew))
`Mean Age of Husbands`<-c(mean(agehw$ageh))
table<-rbind(`Mean Age of Wives`,`Mean Age of Husbands`)
t(table) %>%
kbl() %>%
kable_styling()
```
```{r}
`Standard Deviation for Age of Wives`<-c(sd(agehw$agew))
`Standard Deviation for Age of Husbands`<-c(sd(agehw$ageh))
table<-rbind(`Standard Deviation for Age of Wives`,`Standard Deviation for Age of Husbands`)
t(table) %>%
kbl() %>%
kable_styling()
```
```{r}
# Table covariance
cov(agehw) %>%
kbl() %>%
kable_styling()
```
```{r}
# Visualization of relationship between husbands and wives
plot(x=agehw$ageh, y= agehw$agew, pch=21, xlab="Age of Husband", ylab="Age of Wife", main="Relationship Between Age of Husbands and Wives (years)", col="black",bg="lightblue")
```
### Bayesian Analysis
```{r}
# Generate predictive data set (size 100) from average ages (theta) and cov matrix (sigma)
n = 100
s = 10
# Mean of ages
mu0 <- c(42,42)
L0 <- matrix(c(441, 330.75, 330.75, 441), nrow = 2, ncol = 2)
# Mean ages follow multivariate norm
theta <- mvrnorm(s, mu0, L0)
# sample sigmas following inverse wishart distribution
sigmas <- list()
for(i in 1:s){
sigma <- solve(rWishart(1, 4, solve(L0))[,,1])
sigmas[[i]] <- sigma
}
# Sample age from multivariate normal distribution
data <- data.frame(h_age= c(), w_age = c(), dataset= c())
for(i in 1:s){
y <- mvrnorm(100, mu0, L0)
new <- data.frame(h_age = y[,1], w_age = y[,2], dataset =i)
data <- rbind(data, new)
}
```
```{r}
# Plot the 10 predictive datasets on scatter plot to see if it matches the trend of the original and confirm our selection of prior (inverse whishart for sigma and multivariate normal for theta)
ggplot(data = data, aes(x = h_age, y = w_age)) + geom_point(color='darkblue') + facet_wrap(~dataset)+ labs(x=" Age of Husband", y="Age of Wife", title="Age of Husband vs Wife for Predictive Data")
```
+ There appears to be a moderately strong positive correlation between age of spouses for each of the predictive data sets. This is the expected relationship and reflects the one we observed from the real data. The similarity confirms our selections of priors for the mean and covariance of ages.
```{r}
# MCMC Approximations
#prior
mu0 <- c(42,42)
nu0 <- 4
L0 <- S0 <- matrix(c(150, 112.5, 112.5, 150), nrow = 2, ncol = 2)
ybar <- apply(agehw, 2, mean)
Sigma <- cov(agehw)
n <- dim(agehw)[1]
THETA<-SIGMA <- NULL
set.seed(10000)
for(s in 1:10000){
#Update theta
Ln <- solve(solve(L0) + n * solve(Sigma))
mun <- Ln %*% (solve(L0) %*% mu0 + n * solve(Sigma) %*% ybar)
theta <- mvrnorm(1, mun, Ln)
#Update Simga
Sn <- S0 + (t(agehw) - c(theta)) %*% t( t(agehw) - c(theta))
Sigma <- solve( rWishart(1, nu0 + n, solve(Sn))[,,1])
# save results
THETA <- rbind(THETA, theta) ; SIGMA <- rbind(SIGMA, c(Sigma))
}
```
```{r}
# Make MCMC correlations into dataframe and plot
cov <- SIGMA[,2]
var_h <- SIGMA[,1]
var_w <- SIGMA[,4]
corr <- cov/sqrt(var_h*var_w)
corr <- data.frame(corr)
# Plot Sampled Correlation
ggplot(data = corr, aes(x = corr)) + geom_density()+labs(title="Distribution of Correlation Samples")
```
```{r}
# Print Confidence intervals for descriptive statistics
`Average Age for Husband CI`<- c(quantile(THETA[,1],.05),quantile(THETA[,1],.95))
`Average Age of Wives CI`<- c(quantile(THETA[,2],.05),quantile(THETA[,2],.95))
`Correlation Coefficient CI`<-c(quantile(corr[,1], c(.05, .95)))
bt<- as.data.frame(rbind(`Average Age of Wives CI`,`Average Age for Husband CI`,`Correlation Coefficient CI`))
t(bt) %>%
kbl() %>%
kable_styling()
```
### Frequentist Analysis
```{r}
result <- t.test(agehw$agew)
# Extract the confidence interval
`Avg Age of Wives CI` <- result$conf.int
result1 <- t.test(agehw$ageh)
# Extract the confidence interval
`Avg Age of Husbands CI` <- result1$conf.int
# Correlations
`fcor` <- cor.test(~ ageh + agew, data = agehw)
`correlation Coefficient CI`<-`fcor`$conf.int
ft<-as.data.frame(rbind(`Avg Age of Wives CI`, `Avg Age of Husbands CI`,`correlation Coefficient CI`))
t(ft) %>%
kbl() %>%
kable_styling()
```
### Results
```{r}
# Improvement with bayesian analysis
Improve<-as.data.frame(cbind(t(bt),t(ft)))
Improve$Improvement_Correlation<-abs(Improve$`Correlation Coefficient CI`-Improve$`correlation Coefficient CI`)
Improve$Wife_Improvement<- abs(Improve$`Average Age of Wives CI`-Improve$`Avg Age of Wives CI`)
Improve$Husband_Improvement<- abs(Improve$`Average Age for Husband CI`-Improve$`Avg Age of Husbands CI`)
Improve[,7:9] %>%
kbl() %>%
kable_styling()
```
The limited sample size and no prior information about the variables in question population leads to a relatively wide confidence interval using standard frequentist approaches. Using the bayesian approach, the confidence interval for average age of husbands and wives was reduced by over a year. And the confidence interval for the correlation coefficient was reduced by 2%, compared to the frequentist.
After visualizing the relationship we formed priors for the 2 variables we are interested in predicting. Correlation is 2 dimensional so we formulated it following an inverse wishart distribution, and the mean age for spouses follows a multivariate normal distribution.
Once we established our priors, we can generate more data by taking random samples of these distributions. We confirm our priors and data validity by comparing simulated data scatter plots to the actual data scatter plots. The moderately strong positive correlation holds in each predictive data set to the original.
After confirming distributions, we use MCMC approximation to estimate the mean correlation, age of husband, and age of wife. With the prior information about the distribution of the data we are able to create new confidence intervals for average ages, and correlations.