Chapter3.qmd

---
title: "Chapter3"
format: 
  html:
    code-fold: false
    code-tools: true
editor: source
author: Lilla Gurtner
editor_options: 
  chunk_output_type: console
---

```{r setup, echo = F, message = F, error=F, include = F}

# knitr global options ----
knitr::opts_chunk$set(fig.pos = 'H',
                      echo = T,
                      message = F,
                      warning = F, 
                      dpi = 600,
                      fig.align = "center", 
                      fig.asp = 0.62) # golden ratio

library(tidyverse)
library(tidybayes) # nice ploting
library(brms)
library(wesanderson)

theme_set(theme_tidybayes())

```


The Easy problems use the samples from the posterior distribution for the globe tossing example. This code will give you a specific set of samples, so that you can check your answers exactly. Use the values in samples to answer the questions that follow.


```{r data4easy}
# hypothezised_p_of_water is a sequence of all possible values from percentage of water on planet
n_water = 6
n_tosses = 9

d_e <- tibble(hypothezised_p_of_water = seq(from = 0, to = 1, length.out = 1000),
           prior  = 1) |> # define prior
    mutate(likelihood = dbinom(n_water, size = n_tosses, prob = hypothezised_p_of_water)) |>  # compute likelihood at each value in grid
    mutate(unstd_posterior = likelihood * prior) |> # compute product of likelihood and prior
    mutate(posterior = unstd_posterior / sum(unstd_posterior))   # standardize the posterior, so it sums to 1


n_samples <- 10000
set.seed(100)

samples_e <- d_e |> 
  sample_n(size = n_samples, weight = posterior, replace = T) # sampling entire lines from d_e. the "posterior" argument gives every line in d_e the probability of being chosen for the samples tibble


d_e |> ggplot(aes(x = hypothezised_p_of_water, y = posterior)) + geom_line()

samples_e |> ggplot(aes(x = hypothezised_p_of_water, y = posterior)) + geom_line() # samples for the easy questions

```


The two distributions look similar. However, as `n_samples` becomes small, the shape of samples$posterior becomes less smooth.

# 3E1. 
How much posterior probability lies below p = 0.2? 
```{r}
samples_e |> 
  filter(hypothezised_p_of_water < 0.2) |>  # all the p's < 0.2
  summarise(perc_p = n()/n_samples) # get percentage of all p's < 0.2
```


# 3E2. 
How much posterior probability lies above p = 0.8? 

```{r}
samples_e |> 
  filter(hypothezised_p_of_water > 0.8) |>  
  summarise(perc_p = n()/n_samples)

```

# 3E3. 
How much posterior probability lies between p = 0.2 and p = 0.8? 
```{r}

samples_e |> 
  filter(hypothezised_p_of_water < 0.8 & hypothezised_p_of_water > 0.2) |> 
  summarise(perc_p = n()/n_samples)

# or:
samples_e |> 
  filter(between(hypothezised_p_of_water, 0.2, 0.8)) |> 
  summarise(perc_p = n()/n_samples)

```


# 3E4. 
20% of the posterior probability lies below which value of p? 
```{r}
quantile(samples_e$hypothezised_p_of_water, probs = 0.2)
```


#3E5. 
20% of the posterior probability lies above which value of p? 
```{r}

quantile(samples_e$hypothezised_p_of_water, probs = 0.8)
```


# 3E6. 
Which values of p contain the narrowest interval equal to 66% of the posterior probability? 

```{r}
mode_hdi(samples_e$hypothezised_p_of_water, .width = 0.66)
```


# 3E7. 
Which values of p contain 66% of the posterior probability, assuming equal posterior probability both below and above the interval?

```{r}
# 33% of the posterior must be outside, 17.5% on each side of the distribution
quantile(samples_e$hypothezised_p_of_water, probs = c(0.175, .825))
```


# 3M1.
Suppose the globe tossing data had turned out to be 8 water in 15 tosses. Construct the posterior distribution, using grid approximation. Use the same flat prior as before. 

```{r}
# hypothezised_p_of_water is a sequence of all possible values from percentage of water on planet
n_water = 8
n_tosses = 15

d_m <- tibble(hypothezised_p_of_water = seq(from = 0, to = 1, length.out = 1000),
           prior  = 1) |> # define prior
    mutate(likelihood = dbinom(n_water, size = n_tosses, prob = hypothezised_p_of_water)) |>  # compute likelihood at each value in grid
    mutate(unstd_posterior = likelihood * prior) |> # compute product of likelihood and prior
    mutate(posterior = unstd_posterior / sum(unstd_posterior))   # standardize the posterior, so it sums to 1

d_m |> ggplot(aes(x = hypothezised_p_of_water, y = posterior)) + geom_line()
```


# 3M2. 
Draw 10,000 samples from the grid approximation from above. Then use the samples to calculate the 90% HPDI for p. 

```{r}
n_samples <- 10000
set.seed(100)

samples_m <- d_m %>% 
  sample_n(size = n_samples, weight = posterior, replace = T)


mode_hdi(samples_m$hypothezised_p_of_water, .probs = c(0.05, 0.95))

samples_m |> ggplot(aes(x = hypothezised_p_of_water, y = posterior)) + geom_line()

```

# 3M3. 
Construct a posterior predictive check for this model and data. This means simulate the distribution of samples, averaging over the posterior uncertainty in p. What is the probability of observing 8 water in 15 tosses?

=> here, we use code from Solomon Kurz' translation of the textbook, [section 3.3.2](https://bookdown.org/content/4857/sampling-the-imaginary.html#sampling-to-simulate-prediction)

```{r}
n_dummy_water <- tibble(n_waters = rbinom(n_samples, # same size in samples_m
                                          size = n_tosses, # 15
                                          prob = samples_m$hypothezised_p_of_water)) # create 10000 simulated observed water in n_tosses, with the actual percentage of water coming from the p's sampled proportionally from the posterior 


# how many of the dummy-data results show 8 waters in 15 tosses?
mean(n_dummy_water$n_waters == 8)


n_dummy_water %>% 
  ggplot(aes(x = n_waters)) +
  geom_histogram(binwidth = 1, center = 0,
                 color = "grey92") +
  scale_x_continuous("dummy water count") +
  ylab("frequency") 

```

# 3M4. 
Using the posterior distribution constructed from the new (8/15) data, now calculate the probability of observing 6 water in 9 tosses.


```{r}
n_dummy_water <- tibble(n_waters = rbinom(n_samples, # same size in samples_m
                                          size = 9, # 15
                                          prob = samples_m$hypothezised_p_of_water)) # create 10000 simulated observed water in n_tosses, with the actual percentage of water coming from the p's sampled proportionally from the posterior 


# how many of the dummy-data results show 6 waters in 9 tosses, but taking the posterior from the 8/15 data?
mean(n_dummy_water$n_waters == 6)


n_dummy_water %>% 
  ggplot(aes(x = n_waters)) +
  geom_histogram(binwidth = 1, center = 0,
                 color = "grey92") +
  scale_x_continuous("dummy water count") +
  ylab("frequency") 
```

# 3M5
Start over at 3M1, but now use a prior that is zero below p = 0.5 and a constant above p = 0.5. This corresponds to prior information that a majority of the Earth’s surface is water. Repeat each problem above and compare the inferences (using both priors) to the true value p = 0.7.

```{r 3m5}
n_water = 8
n_tosses = 15

d_m2 <- tibble(hypothezised_p_of_water = seq(from = 0, to = 1, length.out = 1000),
           prior  = case_when(hypothezised_p_of_water < 0.5 ~ 0, 
                              hypothezised_p_of_water >= 0.5 ~ 0.1)) %>%                                       # define prior
    mutate(likelihood = dbinom(n_water, size = n_tosses, prob = hypothezised_p_of_water)) %>%  # compute likelihood at each value in grid
    mutate(unstd_posterior = likelihood * prior) %>%             # compute product of likelihood and prior
    mutate(posterior = unstd_posterior / sum(unstd_posterior))   # standardize the posterior, so it sums to 1

n_samples <- 10000
set.seed(3)

samples_m2 <- d_m2 %>% 
  sample_n(size = n_samples, weight = posterior, replace = T)

# re-doing also the easy ones, for the sake of completeness
#: post-prob < 0.2
mean(samples_m2$hypothezised_p_of_water < 0.2)

mean(samples_m2$hypothezised_p_of_water > 0.8)

mean(between(samples_m2$hypothezised_p_of_water, left = 0.2, right = 0.8))

quantile(samples_m2$hypothezised_p_of_water, probs = 0.2)
mode_hdi(samples_m2$hypothezised_p_of_water, .width = .66)

quantile(samples_m2$hypothezised_p_of_water, probs = c(0.175, 0.825))

# Draw 10,000 samples from the grid approximation from above. Then use the sample to calculate the 90% HPDI for p.

mode_hdi(samples_m2$hypothezised_p_of_water, .width = 0.9)

# Construct a posterior predictive check for this model and data. The means simulate the distribution of samples, averaging over the posterior uncertainty in p. What is the probability of observing 8 water in 15 tosses?

dummy_tosses <- rbinom(1000, 15, prob = samples_m2$hypothezised_p_of_water)

mean(dummy_tosses == 8)


# Using the posterior distribution constructed from the new (8/15) data, now calculate the probability of observing 6 water in 9 tosses.

dummy_tosses <- rbinom(1000, 9, prob = samples_m2$hypothezised_p_of_water)

mean(dummy_tosses == 6)

d_m2 |>  ggplot(aes(x = hypothezised_p_of_water, y= posterior))  + geom_line()
hist(samples_m2$hypothezised_p_of_water)

```

# 3M6. 

Suppose you want to estimate the Earth's proportion of water very precisely. Specifically, you want the 99% percentile interval of the posterior distribution of p to be only 0.05 wide. This means the distance between the upper and lower bound of the interval should be 0.05. How many times will you have to toss the globe to do this?

```{r}
# strategy: trying out until the mode_hdi is as small as required
#generating the "real" data
n_tosses <- 2201 # play around with this number
n_water <- rbinom(1, n_tosses, prob = 0.7) # fake observations of earth

# getting posterior => best guess of proportion
d_m3 <- tibble(hypothezised_p_of_water = seq(from = 0, to = 1, length.out =1000), 
            prior = 1, 
            likelyhood = dbinom(n_water, n_tosses, prob = hypothezised_p_of_water), 
            posterior = likelyhood*prior / (sum(likelyhood*prior)))

# sampling from the posterior to simulate redoing the experiment 10000 times
samples_m3 <- sample(d_m3$hypothezised_p_of_water,size = 10000, prob = d_m3$posterior, replace = T )


# what is the 99% hdi?
a <- mode_hdi(samples_m3, .width = 0.99)
a$ymax - a$ymin
```


# 3H1.
Using grid approximation, compute the posterior distribution for the probability of a birth being a boy. Assume a uniform prior probability. Which parameter value maximizes the posterior probability?

```{r hard data}
birth1 <- c(1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0, 0,0,0,1,1,1,0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0, 1,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0, 1,0,1,1,1,0,1,1,1,1) 

birth2 <- c(0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0, 1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1, 0,0,0,1,1,1,0,0,0,0)

all_births <- c(birth1, birth2)

```


```{r}

n_births <- length(all_births)
n_boys <- sum(birth1) + sum(birth2)

d_h <- tibble(p_boy_hypothetical = seq(from = 0, 
                                     to  = 1,
                                     length.out = 1000), 
            prior_boy = 0.01, 
            likelyhood = dbinom(x = n_boys, 
                                size = n_births, 
                                prob = p_boy_hypothetical), 
            post_unst = likelyhood * prior_boy, 
            standartization_term = sum(likelyhood * prior_boy),
            posterior = post_unst / standartization_term)


# Which parameter value maximizes the posterior probability?
d_h |> filter(posterior == max(d_h$posterior))


d_h |> pivot_longer(names_to = "value_name", values_to = "value", 2:6) |> 
  ggplot(aes(x = p_boy_hypothetical, y = value, color = value_name)) + geom_point()+
  scale_color_manual(values = wes_palette("Darjeeling2", n = 5, type = "discrete"))
```

# 3H2.
Using the sample function, draw 10,000 random parameter values from the posterior distribution you calculated above. Use these samples to estimate the 50%, 89%, and 97% highest posterior density intervals.

```{r}
samples_h <- sample(d_h$p_boy_hypothetical, size = 1e4, prob = d_h$posterior, replace = T)

mode_hdi(samples_h, .width = c(0.5, 0.89, 0.97))
```


#3H3. 
Use rbinom to simulate 10,000 replicates of 200 births. You should end up with 10,000 numbers, each one a count of boys out of 200 births. Compare the distribution of predicted numbers of boys to the actual count in the data (111 boys out of 200 births). There are many good ways to visualize the simulations, but the dens command (part of the rethinking package) is probably the easiest way in this case. Does it look like the model fits the data well? That is, does the distribution of predictions include the actual observation as a central, likely outcome?

```{r}
#simulating 10'000 cohorts 200 births to get the number of births from the posterior of estimations of the most likely true percentage of boys
dummy_data_n_boys<- rbinom(1e5, size = 200, prob = samples_h)

break_func <- function(x) {
  length(seq(min(x), max(x), by = 1)) + 1
}

ggplot() + 
  stat_histinterval(aes(x = dummy_data_n_boys), .width = c(0.66, 0.89), breaks = break_func) +
  geom_vline(aes(xintercept = n_boys), linetype = "dashed", color = "red") +
  labs(x = "Number of Boys", y = "Density")
```

# 3H4.

Now compare 10,000 counts of boys from 100 simulated first borns only to the number of boys in the first births, birth1. How does the model look in this light?

```{r}

dummy_data_n_boys<- rbinom(1e5, size = 100, prob = samples_h)

break_func <- function(x) {
  length(seq(min(x), max(x), by = 1)) + 1
}

ggplot() + stat_histinterval(aes(x = dummy_data_n_boys), .width = c(0.66, 0.89), breaks = break_func) +
  geom_vline(aes(xintercept = sum(birth1)), linetype = "dashed", color = "red") +
  labs(x = "Number of Boys", y = "Density")

```


# 3H5.

The model assumes that sex of first and second births are independent. To check this assumption, focus now on second births that followed female first borns. Compare 10,000 simulated counts of boys *(coming from the original model which assumes independence)* to only those second births that followed girls. To do this correctly, you need to count the number of first borns who were girls and simulate that many births, 10,000 times. Compare the counts of boys in your simulations to the actual observed count of boys following girls. How does the model look in this light? Any guesses what is going on in these data?

```{r}
dummy_data_n_boys<- rbinom(1e5, size = 100, prob = samples_h)

n_girl1 <- sum(birth1 == 0)

dummy_data_boys_after_girls <- rbinom(n = 1e5, size = n_girl1, prob = samples_h)

births_data <- tibble(birth1 = birth1, 
                      birth2 = birth2) |> 
  mutate(boy_after_girl = case_when(birth1 == 0 & birth2 == 1 ~ 1, 
                                    .default = 0))

true_n_boys_after_girl <- sum(births_data$boy_after_girl)

ggplot() +
   geom_histogram(aes(x = dummy_data_boys_after_girls))  + 
  geom_vline(xintercept = true_n_boys_after_girl)

```

Conclusion: The model, assuming independence (i.e. drawing from the posterior of all the births `sample_h`) would predict much fewer boys in the second births. So either having a girl alters something biological such that conception of boys becomes more likely, or families who have already a girls are more likely to abort female fetuses and keep the male ones more often.