wordfish_bund.Rmd

---
title: "wordfish_bund"
author: "Anton Könneke"
output:
  html_document:
    df_print: paged
---

This is an analysis of party positions over time that relies on party's press 
releases and wordfish.

# Data preperation
I use the press releases of the bundestag factions of CDU, Greens, SPD and AfD for now. 
All press releases were scraped from the respective websites and are available in this repo. 
Before I conduct the analyses, stopwords, author's names and partynames should be
deleted. The documents need to be harmonised with regards to covariates, mainly dates.

```{r setup}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(pacman)
p_load(tidyverse, quanteda, quanteda.textstats, readtext, quanteda.textmodels, ggforce, grid)
press_releases <- readRDS("texts/press_releases.rds")
```

# Frequency
Lets start by simply examining the activity of both parties over time:
```{r frequency}
party_colors <- 
  scale_color_manual("Party",
                     values = c("afd" = "#009ee0", "cdu/csu" =  "black", "spd" = "red",
                                "greens" = "forestgreen", left = "magenta",
                                "fdp" = "yellow3"))

press_releases <- press_releases %>% 
  mutate(month = lubridate::floor_date(date, "month"))
press_releases %>% 
  filter(month >= lubridate::ymd("2020-10-01")) %>% 
  summarise(n = n(), .by = c(month, party)) %>% 
  ggplot(aes(month, n, color = party))+
  geom_line(size = 1)+
  party_colors+
  theme_bw()
```

# Preprocessing
```{r clean}
press_corp <- corpus(press_releases, text_field = "alltext")
press_tok <-
  tokens(
    press_corp,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbol = TRUE,
    remove_separators = TRUE,
    split_hyphens = T,
    remove_url = T
  ) %>% 
  tokens_tolower()
press_tok <- tokens_remove(press_tok, pattern = c(stopwords("de")))

# remove names from BT-Members
mdbs <- btmembers::import_members()
mdbs19_20wp <- mdbs$wp %>% 
  filter(wp %in% c(19, 20)) %>% pull(id)

mdb_names <- mdbs$namen %>%
  filter(id %in% mdbs19_20wp) %>%
  mutate(name = paste(vorname, nachname, sep = " ")) %>% 
  pull(name) 

mdb_names <- tokens(paste(mdb_names, collapse = " "),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbol = TRUE,
    remove_separators = TRUE
  ) %>% tokens_tolower() %>% 
  as.character()

press_tok <- tokens_remove(press_tok, mdb_names)
press_tok <- press_tok %>% 
  tokens_remove(mdb_names) %>% 
  tokens_remove(c("cdu", "afd", "spd", "grüne", "fraktion", "sprecher", "fdp", "linke"))

```

# Wordfish
```{r wordfish}
# press_dfmat <- dfm_subset(press_dfmat, date > as.Date("2021-10-01"))
press_dfmat$party_mon = paste(press_dfmat$party, press_dfmat$month, sep = "_")
press_dfmat_oppo <- dfm_subset(press_dfmat, party %in% c("cdu/csu", "afd", "left"))

press_dfmat_oppo <- press_dfmat_oppo %>% 
              dfm_trim(min_termfreq = 5, termfreq_type = "count",
                       max_docfreq = 0.5, docfreq_type = "prop")

wf <- quanteda.textmodels::textmodel_wordfish(
  x = dfm_group(press_dfmat_oppo, groups = party_mon),
  dir = c(1,2))

quanteda.textplots::textplot_scale1d(wf, margin = "documents")
quanteda.textplots::textplot_scale1d(wf, margin = "features",
                                     highlighted = c("impfzwang", "muezzins",
                                                     "masseneinwanderung",
                                                     "steuerwettbewerb",
                                                     "schutzmaßnahmen",
                                                     "asyl",
                                                     "antifa",
                                                     "cancel"))


tibble(
  psi = wf$psi,
  beta = wf$beta,
  features = wf$features) %>% 
  arrange((beta))

tibble(
  doc = wf$docs,
  theta  = wf$theta,
  se = wf$se.theta,
  date = wf$x@docvars$month,
  party = wf$x@docvars$party
) %>% 
  filter(date > as.Date("2020-01-01")) %>% 
  mutate(hi = theta + (se*1.96),
         lo = theta - (se*1.96)) %>% 
  ggplot(aes(date, theta, shape = party, color = party,
             ymin = lo, ymax = hi))+
  geom_pointrange()+
  geom_smooth(se = F)+
  theme_bw()+
  # annotate(
  #   "curve",
  #   xend = as.Date("2021-12-17"),
  #   yend = -0.25,
  #   x = as.Date("2022-07-10"),
  #   y = 0.25,
  #   curvature = .5,
  #   arrow = arrow()
  # ) +
  # annotate(
  #   "text",
  #   x = as.Date("2022-07-10"),
  #   y = 0.25,
  #   label  = "Friedrich Merz\nwird CDU-Vorsitzender",
  #   hjust = "left",
  #   vjust = "bottom"
  # ) +
  party_colors
```