learntidytext.Rmd

---
title: "Text mining with tidy data principles"
author: "Julia Silge"
output:
  learnr::tutorial:
    allow_skip: true
runtime: shiny_prerendered
description: "Learn how to use tidytext!"
---

<style>
@import url('https://fonts.googleapis.com/css2?family=Roboto+Mono&family=Open+Sans+Condensed&family=Opens+Sans:wght@300;400&display=swap');
</style>

```{r custom-css, echo=FALSE}
shiny::includeCSS("css/custom_css.css")
```

```{r setup, include=FALSE}
library(learnr)
library(tidyverse)
library(tidytext)
library(stopwords)
library(ggrepel)
library(scales)
library(plotly)
library(here)
library(gradethis)
knitr::opts_chunk$set(echo = FALSE, exercise.checker = gradethis::grade_learnr)
theme_set(theme_light())

ted_talks <- read_rds(here::here("data", "ted_talks.rds"))
tidy_talks <- ted_talks %>% unnest_tokens(word, text)

shakespeare <- read_rds(here::here("data", "shakespeare.rds"))
tidy_shakespeare <- shakespeare %>%
  group_by(title) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nyt_headlines <- read_rds(here::here("data", "nyt_headlines.rds"))
tidy_nyt <- nyt_headlines %>%
  mutate(id = row_number()) %>%
  unnest_tokens(word, headline)


song_lyrics <- read_rds(here::here("data", "song_lyrics.rds"))
tidy_lyrics <- song_lyrics %>% unnest_tokens(word, lyrics)
```

## 1. Introduction

![](images/tidytext.png)

Text data sets are diverse and ubiquitous, and **tidy data principles** provide an approach to make text mining easier, more effective, and consistent with tools already in wide use. In this tutorial, you will develop your text mining skills using the [tidytext](https://juliasilge.github.io/tidytext/) package in R, along with other [tidyverse](https://www.tidyverse.org/) tools. You will apply these skills in several case studies, which will allow you to:

- practice important data handling skills,
- learn about the ways text analysis can be applied, and
- extract relevant insights from real-world data.

### Working through this tutorial

Throughout this tutorial, you will see code exercises that look like this:

```{r library-tidytext, exercise=TRUE}
# load the tidytext package

```

```{r library-tidytext-solution}
# load the tidytext package
library(tidytext)
```

```{r library-tidytext-check}
grade_code("Be sure to click \"Submit Answer \" on exercises throughout the tutorial because there are hints, answers, and other content available to you after you submit.")
```

You can type in these code exercises. **Give it a try now!** If you mess up, click "Start Over" to get back to the original state. Use the "Run Code" button to see what happens, and click on "Solution" to check out the solution.

In the exercise above, type `library(tidytext)` and click "Submit Answer".

This tutorial is organized into **four case studies**, each with its own data set:

- transcripts of TED talks
- a collection of comedies and tragedies by Shakespeare
- one month of newspaper headlines
- song lyrics spanning five decades

These case studies demonstrate how you can use text analysis techniques with diverse kinds of text!

### Prerequisites

To get the most from this tutorial, you should have some familiarity with R and [tidyverse](https://www.tidyverse.org/) functions like those from dplyr and ggplot2. If you have read [*R for Data Science*](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund, you are good to go!

## 2. Thank you for coming to my TED talk {data-progressive=TRUE}

<iframe src="https://giphy.com/embed/495RS6uWPjq008HACu" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/usnationalarchives-applause-hollywood-audience-495RS6uWPjq008HACu">via GIPHY</a></p>

The first case study of this tutorial uses a data set of TED talks created by Katherine M. Kinnaird and John Laudun for their paper ["TED Talks as Data"](https://doi.org/10.22148/16.042). The specific talks we are using are from the main TED event, and the data set was curated in the summer of 2018.

There are two main pieces of R software we will use in our text analysis work throughout this tutorial, the [tidyverse](https://www.tidyverse.org/) metapackage and [tidytext](https://juliasilge.github.io/tidytext/). To clarify for yourself what tools you are using, load the two packages below (first tidyverse, and then tidytext) by replacing the `___` with the package names.

```{r library-tidyverse, exercise=TRUE}
# load the tidyverse and tidytext packages
library(___)
library(___)
```

```{r library-tidyverse-solution}
# load the tidyverse and tidytext packages
library(tidyverse)
library(tidytext)
```

```{r library-tidyverse-check}
grade_code("The tidyverse metapackage contains dplyr, ggplot2, tidyr, and other packages for data science, and tidytext contains functions for text analysis using tidy data principles.")
```

### TED talk transcripts

The TED talk transcripts are available to you in a dataframe called `ted_talks`. There are three variables in this data set:

- `talk_id`: the identifier from the TED website for this particular talk
- `text`: the text of this TED talk
- `speaker`: the main or first listed speaker (some TED talks have more than one speaker)

```{r ted-talks, exercise=TRUE}
# glimpse `ted_talks` to see what is in the data set
glimpse(___)
```

```{r ted-talks-solution}
# glimpse `ted_talks` to see what is in the data set
glimpse(ted_talks)
```

```{r ted-talks-check}
grade_code()
```

### How to tidy text data

The `text` data is currently in a dataframe, but it is not **tidy** in the sense of being compatible with tidy tools. We need to transform it so that it is in a different format, with **one observation per row**.

When we do text analysis, the observations we are interested in aren't the whole talks at once, but rather individual _tokens_. A **token** is a meaningful unit of text for analysis; in many cases, this just means a single word. The process of **tokenization** identifies and breaks apart text into individual tokens. You can use [tidytext’s `unnest_tokens()` function](https://juliasilge.github.io/tidytext/reference/unnest_tokens.html) to accomplish all of this at once, both the tidying and the tokenization.

```{r tidy-talks, exercise=TRUE}
tidy_talks <- ted_talks %>%
  ___(word, text)
```

```{r tidy-talks-solution}
tidy_talks <- ted_talks %>%
  unnest_tokens(word, text)
```

```{r tidy-talks-check}
grade_code("Think about the syntax of this function a bit, because you'll be using it over and over again. We piped in the original, non-tidy data set. We gave an argument that we want to tokenize _into_ and an argument that we are tokenizing _from_.")
```

### Tidy TED talks

The `unnest_tokens()` function transforms non-tidy text data into tidy text data. It takes three arguments:

- the input dataframe that contains your text (often you will use the pipe `%>%` to send this argument to `unnest_tokens()`),
- the output column that you want to unnest *to*, and
- the input column that you want to unnest *from*.

```{r display-tidy-ted, echo=TRUE}
tidy_talks
```

What did `unnest_tokens()` do here? Instead of having `r nrow(ted_talks)` rows and reading each talk across the line in `text`, we now have `r scales::comma(nrow(tidy_talks))` rows and can read each talk _down_ the column in `word`. We have **tokenized** and **tidied** the text, as well as a few other transformations:

- Other columns have been retained.
- Punctuation has been stripped out.
- Words have been converted to lower-case.

These are defaults in the function that can be changed, if not appropriate to your analysis.

### Tokenize to bigrams

We said before that tokenization is the process of identifying and breaking apart text into **tokens**, meaningful units of text; those meaningful units of text are most often single words in text analysis but they do not have to be! We can move beyond single words to other kinds of tokens, like **n-grams**. An n-gram is a consecutive sequence of `n` words. Let's look at these TED talks and tokenize to bigrams, n-grams of order 2.

- Use the same function for tokenizing and tidying to create the bigrams.

```{r ted-bigram, exercise=TRUE}
ted_bigrams <- ted_talks %>%
  ___(bigram, text, token = "ngrams", n = 2)

ted_bigrams
```

```{r ted-bigram-solution}
ted_bigrams <- ted_talks %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

ted_bigrams
```

```{r ted-bigram-check}
grade_code("The bigrams slide along the text to create overlapping sequences of two words. Notice what did and did not change from tokenizing to words. We used the `token` argument to specify a non-default tokenizer, and we specified the n-gram order of 2.")
```

### Most common TED talk words

Let's go back to single words. Now that our data in a tidy format, a whole world of analysis opportunity has opened up for us. We can start by computing term frequencies in just one line. What are the **most common words** in these TED talks?

- Use `count()` to find the most common words.

```{r top-ted, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
  ___(word, sort = TRUE)
```

```{r top-ted-solution}
tidy_talks %>%
  count(word, sort = TRUE)
```

```{r top-ted-check}
grade_code(correct = "That was just one line, but also these are not very interesting words.")
```

### Removing stop words

Words like "the", "and", and "to" that aren't very interesting for a text analysis are called **stop words**. Often the best choice is to remove them. The tidytext package provides access to stop word lexicons, with a default list and then other options and other languages.

- First, run the code the way it is.
- Next, try out the `language` argument, which takes two-letter language abbreviations like `"es"`.

```{r display-stop-words, exercise=TRUE}
get_stopwords()
```

When text data is in a tidy format, stop words can be removed using an [`anti_join()`](https://dplyr.tidyverse.org/reference/filter-joins.html). This type of join will "filter" or remove items that are in the right-hand side, keeping those in the left-hand side.

- Within the `anti_join()`, add the call to `get_stopwords()`.
- Add the arguments to `count()` so we count up words and sort them with the largest groups at the top.

```{r stop-word, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
  anti_join(___) %>%
  count(___)
```

```{r stop-word-solution}
tidy_talks %>%
  anti_join(get_stopwords()) %>%
  count(word, sort = TRUE)
```

```{r stop-word-check}
grade_code(correct = "These are now more interesting words and are starting to show the focus of TED talks.")
```

### Visualize top words

Because we are using tidyverse tools, we can fluently pipe from the kind of code we just wrote straight to ggplot2 functions. One of the significant benefits of using tidy data principles is consistency with widely-used tools that are broadly supported.

- Remove stop words (the default list) via an `anti_join()`.
- Create a plot with `n` on the x-axis and `word` on the y-axis.

```{r stop-word-viz, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
  # remove stop words
  ___ %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 20) %>%
  mutate(word = reorder(word, n)) %>%
  # put `n` on the x-axis and `word` on the y-axis
  ggplot(aes(___, ___)) +
  geom_col()
```

```{r stop-word-viz-solution}
tidy_talks %>%
  # remove stop words
  anti_join(get_stopwords()) %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 20) %>%
  mutate(word = reorder(word, n)) %>%
  # put `n` on the x-axis and `word` on the y-axis
  ggplot(aes(n, word)) +
  geom_col()
```

```{r stop-word-viz-check}
grade_code(correct = "The word \"laughter\" is so high because the transcripts include it when the audience laughed.")
```

### Compare TED talk vocabularies

One of my favorite approaches to text analysis is to compare how different people or groups use language. There are lots of different ways to do this, but you can start with plain old word counts! Let's look at two TED talk speakers, [Jane Goodall](https://en.wikipedia.org/wiki/Jane_Goodall) and [Temple Grandin](https://en.wikipedia.org/wiki/Temple_Grandin), and count up the words they used in their TED talks.

*If you want to explore other speakers, switch out for different speakers' names from the data set and hit "Run Code", after finishing the exercise.*

- Use `filter()` to keep only the words spoken by Jane Goodall and Temple Grandin.
- Remove the default list of stop words.
- Use `count()` with two arguments to count up the term frequencies by `speaker` and `word`. (These first three steps could actually be completed in any order but this makes most sense to me.)
- Come back and `filter()` again to only keep words spoken at least 10 times by both women.

The function [`pivot_wider()` from tidyr](https://tidyr.tidyverse.org/reference/pivot_wider.html) pivots the long, tidy dataframe to a wide dataframe so we can more easily compare the two speakers' word counts.

```{r ted-final-pivot, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
  ___(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
  # remove stop words
  ___ %>%
  # count with two arguments
  ___ %>%
  group_by(word) %>%
  ___(sum(n) > 10) %>%
  ungroup() %>%
  pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0)
```

```{r ted-final-pivot-solution}
tidy_talks %>%
  filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
  # remove stop words
  anti_join(get_stopwords()) %>%
  # count with two arguments
  count(speaker, word) %>%
  group_by(word) %>%
  filter(sum(n) > 10) %>%
  ungroup() %>%
  pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0)
```

```{r ted-final-pivot-check}
grade_code(correct = "Using tidyverse functions lets us flexibly approach many kinds of text analysis questions.")
```

### Visualize vocabulary comparison

It's interesting to look at the numbers themselves, but this would be better as a **visualization**. We can put one speaker on the x-axis and the other on the y-axis to compare their word counts. Words that appear close to the `slope = 1` line are used about evenly by both (since the talks were about the same length) and words far away are used more by one speaker.

- Load the [ggrepel](https://ggrepel.slowkow.com/) package, for better plotting of text labels.
- Use `geom_text_repel()` to place the text annotations on the plot.

```{r ted-final-viz, fig.width=6, fig.height=6, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
library(___)

tidy_talks %>%
  filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
  anti_join(get_stopwords()) %>%
  count(speaker, word) %>%
  group_by(word) %>%
  filter(sum(n) > 10) %>%
  ungroup() %>%
  pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0) %>%
  ggplot(aes(`Jane Goodall`, `Temple Grandin`)) +
  geom_abline(color = "gray50", size = 1.2, alpha = 0.8, lty = 2) +
  # use the special ggrepel geom for nicer text plotting
  ___(aes(label = word)) +
  coord_fixed()
```


```{r ted-final-viz-solution}
library(ggrepel)

tidy_talks %>%
  filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
  anti_join(get_stopwords()) %>%
  count(speaker, word) %>%
  group_by(word) %>%
  filter(sum(n) > 10) %>%
  ungroup() %>%
  pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0) %>%
  ggplot(aes(`Jane Goodall`, `Temple Grandin`)) +
  geom_abline(color = "gray50", size = 1.2, alpha = 0.8, lty = 2) +
  # use the special ggrepel geom for nicer text plotting
  geom_text_repel(aes(label = word)) +
  coord_fixed()
```

```{r ted-final-viz-check}
grade_code(correct = "Jane Goodall spoke about chimpanzees while Temple Grandin spoke about autism, but both women spoke about people and being different.")
```


## 3. Shakespeare gets sentimental {data-progressive=TRUE}

> To tidy or not to tidy, that is the question.

<iframe src="https://giphy.com/embed/10pVeIaRAbekUw" width="480" height="336" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/pri-shakespeare-studio-360-10pVeIaRAbekUw">via GIPHY</a></p>

In the first case study of this tutorial, you learned about how to transform text data into a tidy format, with one token per row, and how to summarize and visualize the results. Our second case study uses a data set of classic plays by Shakespeare, made available by [Project Gutenberg](https://www.gutenberg.org/). We will again tidy the text, and we will learn how to implement **sentiment analysis**.

The data set is available for you to explore as a tibble called `shakespeare`, and there are three columns.

- The first column gives you the `title` of the play, such as _Much Ado about Nothing_.
- The next gives you the `genre` of play, either tragedy or comedy.
- The last column gives you `text` from that play, one line at a time.

### The game's afoot!

Modify the following code to find out how many lines of tragedy vs. comedy text we have:

```{r count-genre, exercise=TRUE, exercise.eval=TRUE}
shakespeare %>%
  count(title)
```

```{r count-genre-solution}
shakespeare %>%
  count(genre)
```

```{r count-genre-check}
grade_code("There are six plays in this data set, and we have slightly more total lines of text from the tragedies than from the comedies.")
```

### To be, or not to be

Now use `count()` with **two** variables (separate by a comma) so you can see which titles are tragedies and which are comedies:

```{r count-title-genre, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
shakespeare %>%
  count(___)
```

```{r count-title-genre-solution}
shakespeare %>%
  count(genre, title)
```

```{r count-title-genre-check}
grade_code(correct = "We have three comedies, and three tragedies.")
```

### Unnesting from text to word

The `shakespeare` data set is not yet compatible with tidy tools. You need to first break the text into individual tokens, like you learned in the first case study. There are **three** verbs to add to add to this analysis pipeline:

- group by `title`, so that you can add a line number to count up the lines spoken in each play
- tokenize and tidy the text using `unnest_tokens()`
- count the words so you can see which are most common in these Shakespeare plays


```{r tidy-shakespeare, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare <- shakespeare %>%
  ___(title) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  ___(word, text)

tidy_shakespeare %>%
  # count to find out how many times each word is used
  ___(word, sort = TRUE)
```

```{r tidy-shakespeare-solution}
tidy_shakespeare <- shakespeare %>%
  group_by(title) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_shakespeare %>%
  # count to find out how many times each word is used
  count(word, sort = TRUE)
```

```{r tidy-shakespeare-check}
grade_code(correct = 'Notice how the most common words in the data frame are words like "the", "and", and "i" that are not particularly meaningful or specific to these plays. We could remove these stop words like we did in the first case study.')
```

### Sentiment lexicons

Sentiment analysis is a way to measure the attitudes and opinions expressed in text, and can be approached in multiple ways. A common approach is to use **sentiment lexicons**, lists of words that have been curated and scored in some way. Lexicons are typically created by NLP researchers and some have licenses that restrict their use, for example in commercial settings. The `"bing"` lexicon of Hu and Liu (2004) is a general purpose English lexicon (which can be used in commercial settings with attribution) that categorizes words as either positive or negative.

```{r lexicon, exercise=TRUE}
get_sentiments("bing")
```

Sentiment lexicons include many words, but some words are unlikely to be found in a sentiment lexicon or dictionary.

```{r sentiment-quiz}
question('Which of the following words is most unlikely to be found in a sentiment lexicon?',
         answer("pessimism"),
         answer("and", correct = TRUE),
         answer("peace"),
         answer("merry"),
         answer("repulsion"),
         allow_retry = TRUE,
         random_answer_order = TRUE,
         incorrect = "Incorrect. Sentiment lexicons include words that express emotions and opinions, like this one. Which of these words is the most neutral?",
         correct = "Correct! A word like \"and\" is neutral and unlikely to be included in a sentiment lexicon."
)
```

### Sentiment analysis via an `inner_join()`

When text data is in a tidy data structure like we are working with in this tutorial, with **one word per row**, you can perform sentiment analysis using an inner join, like you may have worked with in SQL or in [dplyr](https://dplyr.tidyverse.org/). Remember that the sentiment lexicon you just saw is a data frame with one word per row. When your text data is also stored in that form, with one word per row, then the result of an inner join between those two data sets will give you all the words that are both in the text you are analyzing and the lexicon.

<div class = "row">

<div class = "col-md-6">

Let's say you have the following text data set, available in your environment as `text`:

```{r}
tibble(word = c("chocolate", "makes", "me", "happy",
                "but", "this", "movie", "is", "sad"))
```

</div>

<div class = "col-md-6">
And the following sentiment lexicon, in your environment as `lexicon`:

```{r}
tibble(word = c("happy", "sad", "chocolate", "broccoli"))
```

</div>
</div>

```{r chocolate-quiz}
question('Which word will be in the output of an inner join between this text data set and this sentiment lexicon?',
         answer("movie"),
         answer("chocolate", correct = TRUE),
         answer("broccoli"),
         answer("me"),
         allow_retry = TRUE,
         random_answer_order = TRUE,
         incorrect = "Incorrect. This word will not be in the output of an inner join.",
         correct = "Correct! The word \"chocolate\" is in both the text data set and the lexicon."
)
```

Using tidy data principles for sentiment analysis is convenient not only for implementing the analysis itself, but for understanding the results in detail. Let's walk through what that can look like with the Shakespearean plays.

### Sentiment analysis of Shakespeare

After transforming the text of these Shakespearean plays to a tidy text data set in a previous exercise, the resulting data frame `tidy_shakespeare` is ready for sentiment analysis. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.

- Use the correct kind of join to implement sentiment analysis.
- Add the `"bing"` lexicon as the argument to the join function.
- Find how many positive and negative words each play has by using two arguments in `count()`.


```{r shakespeare-sentiment, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
shakespeare_sentiment <- tidy_shakespeare %>%
  # implement sentiment analysis with the "bing" lexicon
  ___(get_sentiments(___))

shakespeare_sentiment %>%
  # find how many positive/negative words each play has
  ___(___)
```

```{r shakespeare-sentiment-solution}
shakespeare_sentiment <- tidy_shakespeare %>%
  # implement sentiment analysis with the "bing" lexicon
  inner_join(get_sentiments("bing"))

shakespeare_sentiment %>%
  # find how many positive/negative words each play has
  count(title, sentiment)
```

```{r shakespeare-sentiment-check}
grade_code(correct = 'When text is in a tidy data structure, sentiment analysis can be implemented with an inner join.')
```

### Tragedy or comedy?

> But soft! What light through yonder window breaks? It is the number of sentiment words you just calculated, and you are on the path to comparing which plays have proportionally more positive or negative words.

Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies?

**First,** calculate how many negative and positive words each play used.

- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find the number of words for each combination of `title`, `genre,` and `sentiment`.

**Next,** find the percentage of negative words for each play.

- Group by the titles of the plays.
- Find the total number of words in each play using `sum()`.
- Calculate a `percent` for each play that is the number of words of each sentiment divided by the total words in that play.
- Filter the results for only `"negative"` sentiment.

```{r shakespeare-percent, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
sentiment_counts <- tidy_shakespeare %>%
  # implement sentiment analysis using the "bing" lexicon
  ___ %>%
  # count the number of words by title, genre, and sentiment
  ___

sentiment_counts %>%
  group_by(___) %>%
  # find the total number of words in each play
  mutate(total = ___,
         percent = n / total) %>%
  # filter the results for only negative sentiment
  filter(___) %>%
  arrange(percent)
```

```{r shakespeare-percent-solution}
sentiment_counts <- tidy_shakespeare %>%
  # implement sentiment analysis using the "bing" lexicon
  inner_join(get_sentiments("bing")) %>%
  # count the number of words by title, genre, and sentiment
  count(title, genre, sentiment)

sentiment_counts %>%
  group_by(title) %>%
  # find the total number of words in each play
  mutate(total = sum(n),
         percent = n / total) %>%
  # filter the results for only negative sentiment
  filter(sentiment == "negative") %>%
  arrange(percent)
```

```{r shakespeare-percent-check}
grade_code(correct = "Looking at the `percent` column of your output, you can see that tragedies do in fact have a higher percentage of negative words.")
```

### Most common positive and negative words

You found in the previous exercise that Shakespeare's tragedies use proportionally more negative words than the comedies. Now you can explore which specific words are driving these sentiment scores. Which are the most common **positive** and **negative** words in these plays?

There are **three** steps in the code in this exercise. The first step counts how many times each word is used, the second step takes the top 10 most-used positive and negative words, and the final step makes a plot to visualize this result.

- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find word counts by `sentiment`.
- Group by `sentiment` so you can take the top 10 words in each sentiment.
- Notice what the line `mutate(word = reorder(word, n))` does; it converts `word` from a character that would be plotted in alphabetical order to a factor that will be plotted in order of `n`.

Now you can make a visualization of `top_words` using [ggplot2](https://ggplot2.tidyverse.org/) to see these results.

- Put `n` on the x-axis and `word` on the y-axis.
- Use `geom_col()` to make a bar chart.


```{r shakespeare-common-plot, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
word_counts <- tidy_shakespeare %>%
  ___ %>%
  # count by word and sentiment
  ___

top_words <- word_counts %>%
  # group by sentiment
  ___ %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n))

ggplot(___, aes(___, ___, fill = sentiment)) +
  ___(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free")
```

```{r shakespeare-common-plot-solution}
word_counts <- tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  # count by word and sentiment
  count(word, sentiment)

top_words <- word_counts %>%
  # group by sentiment
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n))

ggplot(top_words, aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free")
```

```{r shakespeare-common-plot-check}
grade_code(correct = "Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare's time? Do you see a word that the lexicon has misidentified?")
```


### Which word was misidentified?

In the last exercise, you found the top 10 words that contributed most to negative sentiment in these Shakespearean plays, but lexicons are not always foolproof tools for use with all kinds of text.

```{r shakespeare-quiz}
question('Which of those top 10 "negative"" words used by Shakespeare was misidentified as negative by the sentiment lexicon?',
         answer("death"),
         answer("wilt", correct = TRUE),
         answer("poor"),
         answer("fear"),
         answer("die"),
         allow_retry = TRUE,
         random_answer_order = TRUE,
         incorrect = "Incorrect. That is a negative word correctly identified by the sentiment lexicon. Which one is an archaic version of a different word used today?",
         correct = "The word \"wilt\" was used differently in Shakespeare's time and was not negative; the lexicon has misidentified it. For example, from *Romeo and Juliet*, \"For thou wilt lie upon the wings of night\". It is important to explore the details of how words were scored when performing sentiment analyses."
)
```


### Sentiment changes through a play

In the last set of exercises in this case study, you will examine how sentiment changes through the narrative arcs of these Shakespearean plays. We will start by first implementing sentiment analysis using `inner_join()`, and then use `count()` with **four** arguments:

- `title`,
- `genre`,
- an `index` that will section together lines of the play, and
- `sentiment`.

After these lines of code, you will have the number of positive and negative words used in each `index`-ed section of the play. These sections will be 70 lines long in your analysis here.

_You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays._

- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find the number of words for each sentiment used in each play in sections, using four arguments.
  - The first argument for `count()` maps to the plays themselves.
  - The second argument keeps track of whether the play is a comedy or tragedy.
  - The third argument is defined by you; call it `index` and set it equal to `linenumber %/% 70`. This `index` makes chunks of text that are 70 lines long using integer division (`%/%`).
  - The fourth argument maps to the different `sentiment` categories.


```{r shakespeare-count-four, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare %>%
  # implement sentiment analysis using "bing" lexicon
  ___ %>%
  # count using four arguments
  ___
```

```{r shakespeare-count-four-solution}
tidy_shakespeare %>%
  # implement sentiment analysis using "bing" lexicon
  inner_join(get_sentiments("bing")) %>%
  # count using four arguments
  count(title, genre, index = linenumber %/% 70, sentiment)
```

```{r shakespeare-count-four-check}
grade_code(correct = "This is the first step in looking at narrative arcs. Once more unto the breach, dear friends!")
```


### Visualizing narrative arcs

Now you will build on the code from the previous exercise and continue to move forward to see how sentiment changes through these Shakespearean plays.

- Use `pivot_wider()` from [tidyr](https://tidyr.tidyverse.org/) to pivot `sentiment` and `n` to multiple columns.
- Make a new column using `mutate()` that computes the net sentiment by subtracting `negative` sentiment from `positive`.
- Put `index` on the x-axis, `sentiment` on the y-axis, and use `genre` for `fill`.
- Use `geom_col()` to make a bar chart.
- Call `facet_wrap()` to make a separate panel for each title.

***SPOILER ALERT:*** *It doesn't look like Romeo and Juliet are going to have a happy ending!!!*

```{r shakespeare-arcs, out.width='100%', fig.width=8, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, genre, index = linenumber %/% 70, sentiment) %>%
  # pivot sentiment and n wider
  ___(names_from = sentiment, values_from = n, values_fill = 0) %>%
  # use mutate to compute net sentiment
  ___(sentiment = ___ - ___) %>%
  # put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
  ggplot(aes(___)) +
  # make a bar chart with geom_col()
  ___ +
  # make small multiples for each title with facet_wrap()
  ___(~ title, scales = "free_x")
```

```{r shakespeare-arcs-solution}
tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, genre, index = linenumber %/% 70, sentiment) %>%
  # pivot sentiment and n wider
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  # use mutate to compute net sentiment
  mutate(sentiment = positive - negative) %>%
  # put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
  ggplot(aes(index, sentiment, fill = genre)) +
  # make a bar chart with geom_col()
  geom_col() +
  # make small multiples for each title with facet_wrap()
  facet_wrap(~ title, scales = "free_x")
```

```{r shakespeare-arcs-check}
grade_code(correct = "These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment overall than the tragedies.")
```


## 4. Newspaper headlines {data-progressive=TRUE}

<iframe src="https://giphy.com/embed/7IW6Jnw29TYmgkuu3M" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/usnationalarchives-paper-newspaper-printing-press-7IW6Jnw29TYmgkuu3M">via GIPHY</a></p>

You have already learned a lot in the first two case studies of this tutorial, from how to transform text to a tidy format and how to use sentiment lexicons. In this third case study, we will learn some new ways to analyze text, using a data set of newspaper headlines. These particular headlines are from the *New York Times* during November 2020.

[![NYT](images/poweredby_nytimes_150a.png)](https://developer.nytimes.com/)

The *New York Times* is a busy paper; there were over 5,000 headlines during this period! The data set we'll be working with has two variables:

- `headline`, the text of the headline as published by the NYT
- `section`, the main section of the paper that the article with this headline was published in

Headlines are interesting for many reasons. One important one is that news organizations in general and the *New York Times* in particular have come under criticism for their practices in crafting headlines, from excessive use of passive voice when reporting killings by the police to how outright lies by politicians are handled in headlines. Text data is often fraught and complex, and that is certainly true for newspaper headlines!

Use `count()` with `sort = TRUE` so you can see how many different values for `section` there are in this data set from November 2020 and which had the most headlines.

```{r count-section, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
nyt_headlines %>%
  ___
```

```{r count-section-solution}
nyt_headlines %>%
  count(section, sort = TRUE)
```

```{r count-section-check}
grade_code("November 2020 was a busy month in the United States! The next most used categories in this data set are headlines about world news and the opinion page.")
```

### Tidying newspaper headlines

This data set contains over 5000 newspaper headlines, from November 2020. The headlines are all in one column, so they are not yet in a tidy format, ready for our tidy set of tools. It's time for you to tidy these headlines!

- Pipe the `nyt_headlines` object to the next line.
- Use `unnest_tokens()` to tokenize and tidy the `headline` column into a new `word` column.

We also have added a headline `id` column via `mutate()` so we can keep track of which word belongs to which headline.

```{r tidy-headlines, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
# pipe `nyt_headlines` to the next line
tidy_nyt <- ___ %>%
  mutate(id = row_number()) %>%
  # transform the `headline` column to a `word` column
  ___

tidy_nyt
```

```{r tidy-headlines-solution}
# pipe `nyt_headlines` to the next line
tidy_nyt <- nyt_headlines %>%
  mutate(id = row_number()) %>%
  # transform the `headline` column to a `word` column
  unnest_tokens(word, headline)

tidy_nyt
```

```{r tidy-headlines-check}
grade_code("This is the third time you have tidied a text data set in this tutorial. Notice that you use whatever column name the text happens to be in. You also can give the output whatever name is most convenient for you; it does not necessarily have to be `word`.")
```

### Most common headline words

What are the most common words in these headlines? Analyzing headlines is not quite like analyzing other kinds of text, but to be honest, that turns out to be true a lot of the time. Many text data sets have **specific characteristics** (document length, vocabulary, semantic patterns, etc) that make them stand out compared to other data sets, sometimes even when you look at just the most common words.

- Use `count()` to tally up the words.
- Set `sort = TRUE` to order by the count, instead of alphabetically.

```{r top-nyt, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_nyt %>%
  ___
```

```{r top-nyt-solution}
tidy_nyt %>%
  count(word, sort = TRUE)
```

```{r top-nyt-check}
grade_code(correct = "The word \"election\" is used more in this data set than the word \"and\"! This is pretty wild.")
```

### Removing stop words

The overall top words tell us a lot about this corpus of headlines, but let's remove the default list of stop words to get a better idea of what was being reported at the NYT in November 2020. Remember from the first case study that you remove stop words using an `anti_join()`.

```{r nyt-stop-words, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_nyt %>%
  ___ %>%
  count(word, sort = TRUE)
```

```{r nyt-stop-words-solution}
tidy_nyt %>%
  anti_join(get_stopwords()) %>%
  count(word, sort = TRUE)
```

```{r nyt-stop-words-check}
grade_code(correct = "In November 2020, election results were in headlines at a much higher rate than coronavirus.")
```

### What are the newspaper sections about?

A central question in text mining is how to quantify what different documents or categories of documents are about. One approach to measure this is to use _tf-idf_, term frequency-inverse document frequency. A word's **term frequency** is how frequently it occurs in a document. A word's **inverse document frequency** is a weight, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents:

$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
If you multiply the two together, you get tf-idf, a statistic intended to measure how important a word is to a document in a collection (or corpus) of documents. The [`bind_tf_idf()` function from tidytext](https://juliasilge.github.io/tidytext/reference/bind_tf_idf.html) can compute tf-idf for a text data in a tidy format.

Let's look at this in the context of the NYT headlines. In our case study here, each headline is a document, and the corpus is the whole month of headlines.

- Start by counting up the words in the `tidy_nyt` data, using three arguments: `section` (to keep track of the newspaper sections), `id` for each headline, and `word`.
- Specify the appropriate three arguments for `bind_tf_idf()`:
  - The first is the token or term, `word`, and the second is the document, the headline `id`.
  - The last argument is the column that contains the document-term counts, which is `n`, the output of `count()`.

Because of the line `arrange(-tf_idf)`, the output shows the _highest_ tf-idf words in the whole data set of headlines, the most distinctive words found.

```{r nyt-tf-idf, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
nyt_tf_idf <- tidy_nyt %>%
  # count with three arguments
  ___ %>%
  # there are also three arguments for `bind_tf_idf()`
  bind_tf_idf(___) %>%
  arrange(-tf_idf)

nyt_tf_idf
```

```{r nyt-tf-idf-solution}
nyt_tf_idf <- tidy_nyt %>%
  # count with three arguments
  count(section, id, word) %>%
  # there are also three arguments for `bind_tf_idf()`
  bind_tf_idf(word, id, n) %>%
  arrange(-tf_idf)

nyt_tf_idf
```

```{r nyt-tf-idf-check}
grade_code(correct = "The highest tf-idf words also have both `n` and term frequency equal to 1, meaning that these were headlines that only had a single word in them. Not all the headlines in this data set are what we would think of as traditional headlines.")
```


### Visualize tf-idf

Let's use visualization to understand the high tf-idf words better. Let's connect the highest tf-idf words for each headline to the **sections** of the newspaper and look at the top 10 highest tf-idf headline words for four different newspaper sections.

The following code looks at the food, opinion, and technology sections, along with [The Upshot](https://www.nytimes.com/section/upshot).

_Once you complete the code, you can switch these out for any of the other sections you found in the first exercise in this case study (just use "Run Code" instead of "Submit Answer")._

- Put `tf_idf` on the x-axis, and `word` on the y-axis.
- Use `facet_wrap()` to make small multiples for each newspaper section.

```{r nyt-tf-idf-prep}
nyt_tf_idf <- tidy_nyt %>%
  count(section, id, word) %>%
  bind_tf_idf(word, id, n) %>%
  arrange(-tf_idf)
```

```{r nyt-tf-idf-viz, out.width='100%', fig.height=6, fig.width=9, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE, exercise.setup = "nyt-tf-idf-prep"}
sections <- c("Food", "Opinion", "Technology", "The Upshot")

nyt_tf_idf %>%
  filter(!near(tf, 1)) %>%
  filter(section %in% sections) %>%
  group_by(section) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  # put `tf_idf` on the x-axis, `word` on the y-axis
  ggplot(aes(___, fill = section)) +
  geom_col(show.legend = FALSE) +
  # make small multiples
  ___(~ section, ncol = 2, scales = "free")
```

```{r nyt-tf-idf-viz-solution}
sections <- c("Food", "Opinion", "Technology", "The Upshot")

nyt_tf_idf %>%
  filter(!near(tf, 1)) %>%
  filter(section %in% sections) %>%
  group_by(section) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  # put `tf_idf` on the x-axis, `word on the y-axis
  ggplot(aes(tf_idf, word, fill = section)) +
  geom_col(show.legend = FALSE) +
  # make small multiples
  facet_wrap(~ section, ncol = 2, scales = "free")
```

```{r nyt-tf-idf-viz-check}
grade_code(correct = "Words about IPOs were more characteristic of the technology section while words about stews and pasta were more characteristic of the food section. Also notice that we filtered out those headlines with one word.")
```

### Tokenize headlines to bigrams

We've focused on single word (unigrams) in this case study so far, but you can compute tf-idf for lots of different kinds of things, including higher order n-grams, like bigrams. Let's go back to our original data set of headlines and tokenize again, this time to bigrams, so that we can find the most distinctive **bigrams** for different sections of the paper in November 2020.

- We use the same function for tokenizing and tidying text to bigrams that we have for other tokenization through this tutorial.
- The last argument specifies what order n-grams to create, `n = 2`.

```{r bigram-headlines, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
nyt_bigrams <- nyt_headlines %>%
  mutate(id = row_number()) %>%
  # tidy/tokenize with n = 2
  ___(bigram, headline, token = "ngrams", ___)

nyt_bigrams
```

```{r bigram-headlines-solution}
nyt_bigrams <- nyt_headlines %>%
  mutate(id = row_number()) %>%
  # tidy/tokenize with n = 2
  unnest_tokens(bigram, headline, token = "ngrams", n = 2)

nyt_bigrams
```

```{r bigram-headlines-check}
grade_code("Notice again that bigrams slide along text to create overlapping sets of two words.")
```


### Compute tf-idf of bigrams

Computing tf-idf for bigrams works almost just the same as computing tf-idf for words! Take a look at the previous exercise and see if you can modify the code you wrote there for the `bigram` case.

```{r nyt-bigram-prep}
nyt_bigrams <- nyt_headlines %>%
  mutate(id = row_number()) %>%
  unnest_tokens(bigram, headline, token = "ngrams", n = 2) %>%
  arrange(id)

bigram_tf_idf <- nyt_bigrams %>%
  count(section, id, bigram) %>%
  bind_tf_idf(bigram, id, n) %>%
  arrange(-tf_idf)
```


```{r bigram-tf-idf, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE, exercise.setup = "nyt-bigram-prep"}
bigram_tf_idf <- nyt_bigrams %>%
  # count with three arguments
  ___ %>%
  ___ %>%
  arrange(-tf_idf)

bigram_tf_idf
```

```{r bigram-tf-idf-solution}
bigram_tf_idf <- nyt_bigrams %>%
  # count with three arguments
  count(section, id, bigram) %>%
  bind_tf_idf(bigram, id, n) %>%
  arrange(-tf_idf)

bigram_tf_idf
```

```{r bigram-tf-idf-check}
grade_code(correct = "We've got more single-bigram headlines from the crossword section! I've never really paid much attention to headlines there but we are learning here that they are used differently.")
```


### Highest tf-idf bigrams

Now we can make another visualization comparing newspaper sections, but this time with our **bigrams**.

*Again, please explore any of the sections you are interested in by changing the items in `sections` and using the "Run Code" button.*

There are whole lines missing from this pipeline, but check out the previous exercise where you created a visualization to refresh your memory about what steps to take!

```{r nyt-bigram-viz, out.width='100%', fig.height=6, fig.width=9, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE, exercise.setup = "nyt-bigram-prep"}
sections <- c("Books", "Climate", "New York", "Parenting")

bigram_tf_idf %>%
  filter(!near(tf, 1)) %>%
  filter(section %in% sections) %>%
  # what should you group by?
  ___ %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
  # how should the plot aesthetics be set up?
  ___ +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ section, ncol = 2, scales = "free")
```

```{r nyt-bigram-viz-solution}
sections <- c("Books", "Climate", "New York", "Parenting")

bigram_tf_idf %>%
  filter(!near(tf, 1)) %>%
  filter(section %in% sections) %>%
  # what should you group by?
  group_by(section) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
  # how should the plot aesthetics be set up?
  ggplot(aes(tf_idf, bigram, fill = section)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ section, ncol = 2, scales = "free")
```

```{r nyt-bigram-viz-check}
grade_code(correct = "Remember that we computed tf-idf per **headline** and did not put any information about the sections into the tf-idf computation. The reason that high tf-idf bigrams about drilling are associated with climate reporting is because tf-idf is good at finding distinctive tokens in a document compared to a collection of other documents.")
```


## 5. Singing a different tune {data-progressive=TRUE}

<iframe src="https://giphy.com/embed/KeoXXlp8DwhdHfZLr6" width="480" height="382" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/edsullivanshow-girl-all-i-need-the-temptations-KeoXXlp8DwhdHfZLr6">via GIPHY</a></p>

You have made it to your final case study of this tutorial! This case study demonstrates again how these kinds of techniques are applicable for many diverse kinds of texts. You have analyzed TED talks, Shakespeare's plays, *New York Times* headlines, and now, in this final case study, you are going to use your text mining and sentiment analysis skills on a data set of **lyrics from pop songs over 50 years**. All these texts are very different from each other, but you are able to gain insight in each case because this approach is flexible and powerful.

This data set was created by [Kaylin Pavlik](https://github.com/walkerkq/musiclyrics), who published [her own analysis of these song lyrics](https://www.kaylinpavlik.com/50-years-of-pop-music/). The data set consists of songs that were listed on Billboard's Year-End Hot 100 over five decades. You'll have a data set available called `song_lyrics`, with variables:

- `rank`, the rank a song achieved on the Billboard Year-End Hot 100
- `song`, the song's title
- `artist`, the artist who recorded the song
- `year`, the year the song reached the given rank on the Billboard chart
- `lyrics`, the lyrics of the song


### Tidying song lyrics

This data set contains over 5000 songs, from 1965 to 2015. The lyrics are all in one column, so they are not yet in a tidy format, ready for analysis using tidy tools. It's your turn to tidy this text data!

- Pipe the `song_lyrics` object to the next line.
- Use `unnest_tokens()` to tokenize and tidy the `lyrics` column into a new `word` column.

```{r tidy-lyrics, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
# pipe song_lyrics to the next line
tidy_lyrics <- ___ %>%
  # transform the lyrics column to a word column
  ___

# print tidy_lyrics
tidy_lyrics
```

```{r tidy-lyrics-solution}
# pipe song_lyrics to the next line
tidy_lyrics <- song_lyrics %>%
  # transform the `lyrics` column to a `word` column
  unnest_tokens(word, lyrics)

# print tidy_lyrics
tidy_lyrics
```

```{r tidy-lyrics-check}
grade_code()
```

### Most common words in pop songs

What are the **most common words** in these song lyrics? In her project, [Kaylin Pavlik](https://www.kaylinpavlik.com/50-years-of-pop-music/) made a word cloud of most common words, but she must have removed some stop words. We are used to seeing pretty uninformative examples when we look at the very top words, but are there differences compared to the other data sets we have used?

- Use `count()` to tally up the words.
- Set `sort = TRUE` to order by the count, instead of alphabetically.

```{r top-lyrics, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_lyrics %>%
  ___
```

```{r top-lyrics-solution}
tidy_lyrics %>%
  count(word, sort = TRUE)
```

```{r top-lyrics-check}
grade_code(correct = "This data set is made up of somewhat different words than the other bodies of language we've looked in this tutorial. We see words like \"you\" and \"I\" with the very highest frequencies, and \"me\" and \"my\" also quite high; this is unique compared to most kinds of text.")
```

### Words per song

Let's recreate one of the plots that Kaylin Pavlik makes in her analysis and plot the number of **words per song over time**.

```{r words-per-song, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_lyrics %>%
  # count with two arguments to find the number of words used in each song each year
  ___ %>%
  # put `year` on the x-axis and `n` on the y-axis
  ggplot(aes(___)) +
  # make a scatter plot with `geom_point()`
  ___(alpha = 0.4, size = 5, color = "darkcyan") +
  # fit a line to the points with `geom_smooth()`
  ___(method = "lm", color = "black")
```

```{r words-per-song-solution}
tidy_lyrics %>%
  # count with two arguments to find the number of words used in each song each year
  count(year, song) %>%
  # put `year` on the x-axis and `n` on the y-axis
  ggplot(aes(year, n)) +
  # make a scatter plot with `geom_point()`
  geom_point(alpha = 0.4, size = 5, color = "darkcyan") +
  # fit a line to the points with `geom_smooth()`
  geom_smooth(method = "lm", color = "black")
```

```{r words-per-song-check}
grade_code(correct = "If you wanted to recreate the plot for unique words per song, you would add one more line to this pipeline, using `distinct()` to find the unique words used in each song before counting them.")
```

### Change over time

```{r lyrics-change-quiz}
question('Over these decades, artists on the Billboard Hot 100 are releasing songs with',
         answer("fewer words."),
         answer("no change in words."),
         answer("more words.", correct = TRUE),
         allow_retry = TRUE,
         random_answer_order = TRUE,
         incorrect = "Incorrect. Look at the slope of the line in the plot you made.",
         correct = "Correct! The shift in words per song over these decades is not small!"
)
```


### Pass the Courvoisier

Are you wondering what these songs are with either very few or very many words? Modify the following code to `arrange()` the results with the **shortest lyrics** at the top.

```{r shortest-longest, exercise=TRUE, exercise.eval=TRUE}
tidy_lyrics %>%
  count(year, song) %>%
  arrange(-n)
```

```{r shortest-longest-solution}
tidy_lyrics %>%
  count(year, song) %>%
  arrange(n)
```

```{r shortest-longest-check}
grade_code("Honestly I didn't know that \"Chariots of Fire\" had lyrics.")
```

### HA HA HA

This is the point in this case study where I put up a disclaimer that these are the real lyrics from real songs from the Billboard Hot 100, and many of them include offensive language. It could be argued that the **rest of this case study is NSFW** because of the text content of these lyrics. This is for adults only; please proceed using your own judgment.

```{r ha-ha-ha, exercise=TRUE, exercise.eval=TRUE}
song_lyrics %>%
  filter(song == "wipe out")
```


### Pop vocabulary over decades

We could use any of the text analysis techniques we have explored so far in our case studies with these song lyrics, but let's take this opportunity to go a bit further and learn how to build **linear models with text data** to understand how the pop vocabulary is changing over time. Specifically, let's build on the ideas in the [modeling section of *R for Data Science*](https://r4ds.had.co.nz/model-intro.html), especially the chapter on [**many models**](https://r4ds.had.co.nz/many-models.html).

***NOTE:*** *If you haven't read that chapter before, you may want to skim it before finishing this case study.*

We'll create our linear models in several steps. First, let's create a data set of `word_counts`.

- Start by counting to find the number of words used in each `song` each `year` (use two arguments).
- Next, group by year and create a column via `mutate()` for the total words used each year.
- After ungrouping, group by word.
- Use `filter()` to keep only words used above a threshold, of 500 total uses in the data set. (We don't want to try to train models on words used only a handful of times.)

```{r lyric-word-counts, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
word_counts <- tidy_lyrics %>%
  anti_join(get_stopwords()) %>%
  ___ %>%
  # group by `year`
  ___ %>%
  # create a new column for the total words per year
  mutate(year_total = ___) %>%
  ungroup() %>%
  # now group by `word`
  ___ %>%
  # keep only words used more than 500 times
  ___(sum(n) > 500) %>%
  ungroup()

word_counts
```

```{r lyric-word-counts-solution}
word_counts <- tidy_lyrics %>%
  anti_join(get_stopwords()) %>%
  count(year, word) %>%
  # group by `year`
  group_by(year) %>%
  # create a new column for the total words per year
  mutate(year_total = sum(n)) %>%
  ungroup() %>%
  # now group by `word`
  group_by(word) %>%
  # keep only words used more than 500 times
  filter(sum(n) > 500) %>%
  ungroup()

word_counts
```

```{r lyric-word-counts-check}
grade_code(correct = "Now you have the total number of uses of each word in each year.")
```

### Many models

Now we can use the new `word_counts` data set to train many linear models, **one per word**. We will load the [broom](https://broom.tidymodels.org/) package to handle the model output.

- Create list-columns by nesting the word count data, nesting by `word`.
- Use `mutate()` to create a new column for our models. We are training one model for each word, with `year` as the predictor.

***NOTE:*** *We are creating a slightly more complicated model than shown in the [many models](https://r4ds.had.co.nz/many-models.html) chapter, because we are modeling word counts, rather than a continuous property. The following is a generalized linear model `glm()` using numbers of "successes" (i.e. counts for a specific word) and "failures" (i.e. the total counts for the year) as the response; this approach works better for this kind of data.*

- Within the call to `summarize()`, use the broom function `tidy()` to summarize the `model` objects we just created for each word.
- Use `filter()` to only keep the `"year"` terms in the output, since we are not interested in the intercepts.

```{r many-models-setup}
word_counts <- tidy_lyrics %>%
  anti_join(get_stopwords()) %>%
  count(year, word) %>%
  group_by(year) %>%
  mutate(year_total = sum(n)) %>%
  ungroup() %>%
  group_by(word) %>%
  filter(sum(n) > 500) %>%
  ungroup()

library(broom)

slopes <- word_counts %>%
  nest_by(word) %>%
  mutate(model = list(glm(cbind(n, year_total) ~ year,
                          family = "binomial", data = data))) %>%
  summarize(tidy(model)) %>%
  ungroup() %>%
  filter(term == "year") %>%
  mutate(p.value = p.adjust(p.value)) %>%
  arrange(estimate)
```


```{r many-models, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE, exercise.setup = "many-models-setup"}
library(broom)

slopes <- word_counts %>%
  nest_by(___) %>%
  # create a new column for our `model` objects
  ___(model = list(glm(cbind(n, year_total) ~ year,
                          family = "binomial", data = data))) %>%
  summarize(tidy(___)) %>%
  ungroup() %>%
  # filter to only keep the "year" terms
  ___(term == "year") %>%
  mutate(p.value = p.adjust(p.value)) %>%
  arrange(estimate)

slopes
```

```{r many-models-solution}
library(broom)

slopes <- word_counts %>%
  nest_by(word) %>%
  # create a new column for our `model` objects
  mutate(model = list(glm(cbind(n, year_total) ~ year,
                          family = "binomial", data = data))) %>%
  summarize(tidy(model)) %>%
  ungroup() %>%
  # filter to only keep the "year" terms
  filter(term == "year") %>%
  mutate(p.value = p.adjust(p.value)) %>%
  arrange(estimate)

slopes
```

```{r many-models-check}
grade_code(correct = "This is some pretty hefty statistics you are doing now; you have trained one generalized linear model for each word in our data set. These models ask the question, \"Is this word changing in usage over time?\"")
```

### Visualizing many models

It's time for the last exercise in this case study on song lyrics! Let's create a visualization for all these models we trained with [plotly](https://plotly-r.com/). The visualization we are going to create is what is called a [volcano plot](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)), comparing **effect size** and **statistical significance**. You can use your mouse to see the words in the tooltip of the interactive plot.

```{r lyric-volcano, out.width='100%', out.height="500px", exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE,  exercise.setup = "many-models-setup"}
library(plotly)

p <- slopes %>%
  # put `estimate` on the x-axis and `p.value` on the y-axis
  ggplot(aes(___, label = word)) +
  geom_vline(xintercept = 0, lty = 2, size = 1.5, alpha = 0.7, color = "gray50") +
  # make a scatter plot using `geom_point()`
  ___(color = "darkcyan", alpha = 0.5, size = 2.5) +
  scale_y_log10()

# call ggplotly() on the plot we made above
ggplotly(___)
```

```{r lyric-volcano-solution}
library(plotly)

p <- slopes %>%
  # put `estimate` on the x-axis and `p.value` on the y-axis
  ggplot(aes(estimate, p.value, label = word)) +
  geom_vline(xintercept = 0, lty = 2, size = 1.5, alpha = 0.7, color = "gray50") +
  # make a scatter plot using `geom_point()`
  geom_point(color = "darkcyan", alpha = 0.5, size = 2.5) +
  scale_y_log10()

ggplotly(p)
```

```{r lyric-volcano-check}
grade_code(correct = "The effect size on the x-axis tells us whether a word is being used more (positive) or less (negative) as time passes. The p-value on the y-axis tells us about the significance of the result for that particular word. In earlier decades, song lyrics were more likely to address love/lovin'/loves while in more recent decades, song lyrics were more likely to include profanity.")
```


## 6. Going further

Congratulations! You have finished these four case studies and learned so much about how to analyze text data with tidy data principles. We've linked to the [documentation site for tidytext](https://juliasilge.github.io/tidytext/) several times throughout this tutorial, but there are also other resources available for you to **extend your learning**.

### Learn more about exploratory data analysis for text

To dive deeper into how to summarize, visualize, and thoroughly explore text data, check out my book [*Text Mining with R: A Tidy Approach*](https://www.tidytextmining.com/) with my coauthor David Robinson.

### Learn about supervised machine learning for text

To learn how to build reliable and appropriate machine learning models with text data, check out my book [*Supervised Machine Learning for Text Analysis in R*](https://smltar.com/) with my coauthor Emil Hvitfeldt.

### Explore other approaches to text analysis in R

Using the tidytext package isn't the only way to approach your text analysis needs. Other R packages I have used and like include [quanteda](https://quanteda.io/), [cleanNLP](https://statsmaths.github.io/cleanNLP/), and [spacyr](http://spacyr.quanteda.io/). The [CRAN task view on natural language processing](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html) contains many more R packages.

### Learn more about learnr

This tutorial was made with the learnr package in R. See the learnr introduction and some example tutorials here: https://rstudio.github.io/learnr/