-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathlearntidytext.Rmd
1449 lines (1083 loc) · 61 KB
/
learntidytext.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Text mining with tidy data principles"
author: "Julia Silge"
output:
learnr::tutorial:
allow_skip: true
runtime: shiny_prerendered
description: "Learn how to use tidytext!"
---
<style>
@import url('https://fonts.googleapis.com/css2?family=Roboto+Mono&family=Open+Sans+Condensed&family=Opens+Sans:wght@300;400&display=swap');
</style>
```{r custom-css, echo=FALSE}
shiny::includeCSS("css/custom_css.css")
```
```{r setup, include=FALSE}
library(learnr)
library(tidyverse)
library(tidytext)
library(stopwords)
library(ggrepel)
library(scales)
library(plotly)
library(here)
library(gradethis)
knitr::opts_chunk$set(echo = FALSE, exercise.checker = gradethis::grade_learnr)
theme_set(theme_light())
ted_talks <- read_rds(here::here("data", "ted_talks.rds"))
tidy_talks <- ted_talks %>% unnest_tokens(word, text)
shakespeare <- read_rds(here::here("data", "shakespeare.rds"))
tidy_shakespeare <- shakespeare %>%
group_by(title) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
nyt_headlines <- read_rds(here::here("data", "nyt_headlines.rds"))
tidy_nyt <- nyt_headlines %>%
mutate(id = row_number()) %>%
unnest_tokens(word, headline)
song_lyrics <- read_rds(here::here("data", "song_lyrics.rds"))
tidy_lyrics <- song_lyrics %>% unnest_tokens(word, lyrics)
```
## 1. Introduction

Text data sets are diverse and ubiquitous, and **tidy data principles** provide an approach to make text mining easier, more effective, and consistent with tools already in wide use. In this tutorial, you will develop your text mining skills using the [tidytext](https://juliasilge.github.io/tidytext/) package in R, along with other [tidyverse](https://www.tidyverse.org/) tools. You will apply these skills in several case studies, which will allow you to:
- practice important data handling skills,
- learn about the ways text analysis can be applied, and
- extract relevant insights from real-world data.
### Working through this tutorial
Throughout this tutorial, you will see code exercises that look like this:
```{r library-tidytext, exercise=TRUE}
# load the tidytext package
```
```{r library-tidytext-solution}
# load the tidytext package
library(tidytext)
```
```{r library-tidytext-check}
grade_code("Be sure to click \"Submit Answer \" on exercises throughout the tutorial because there are hints, answers, and other content available to you after you submit.")
```
You can type in these code exercises. **Give it a try now!** If you mess up, click "Start Over" to get back to the original state. Use the "Run Code" button to see what happens, and click on "Solution" to check out the solution.
In the exercise above, type `library(tidytext)` and click "Submit Answer".
This tutorial is organized into **four case studies**, each with its own data set:
- transcripts of TED talks
- a collection of comedies and tragedies by Shakespeare
- one month of newspaper headlines
- song lyrics spanning five decades
These case studies demonstrate how you can use text analysis techniques with diverse kinds of text!
### Prerequisites
To get the most from this tutorial, you should have some familiarity with R and [tidyverse](https://www.tidyverse.org/) functions like those from dplyr and ggplot2. If you have read [*R for Data Science*](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund, you are good to go!
## 2. Thank you for coming to my TED talk {data-progressive=TRUE}
<iframe src="https://giphy.com/embed/495RS6uWPjq008HACu" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/usnationalarchives-applause-hollywood-audience-495RS6uWPjq008HACu">via GIPHY</a></p>
The first case study of this tutorial uses a data set of TED talks created by Katherine M. Kinnaird and John Laudun for their paper ["TED Talks as Data"](https://doi.org/10.22148/16.042). The specific talks we are using are from the main TED event, and the data set was curated in the summer of 2018.
There are two main pieces of R software we will use in our text analysis work throughout this tutorial, the [tidyverse](https://www.tidyverse.org/) metapackage and [tidytext](https://juliasilge.github.io/tidytext/). To clarify for yourself what tools you are using, load the two packages below (first tidyverse, and then tidytext) by replacing the `___` with the package names.
```{r library-tidyverse, exercise=TRUE}
# load the tidyverse and tidytext packages
library(___)
library(___)
```
```{r library-tidyverse-solution}
# load the tidyverse and tidytext packages
library(tidyverse)
library(tidytext)
```
```{r library-tidyverse-check}
grade_code("The tidyverse metapackage contains dplyr, ggplot2, tidyr, and other packages for data science, and tidytext contains functions for text analysis using tidy data principles.")
```
### TED talk transcripts
The TED talk transcripts are available to you in a dataframe called `ted_talks`. There are three variables in this data set:
- `talk_id`: the identifier from the TED website for this particular talk
- `text`: the text of this TED talk
- `speaker`: the main or first listed speaker (some TED talks have more than one speaker)
```{r ted-talks, exercise=TRUE}
# glimpse `ted_talks` to see what is in the data set
glimpse(___)
```
```{r ted-talks-solution}
# glimpse `ted_talks` to see what is in the data set
glimpse(ted_talks)
```
```{r ted-talks-check}
grade_code()
```
### How to tidy text data
The `text` data is currently in a dataframe, but it is not **tidy** in the sense of being compatible with tidy tools. We need to transform it so that it is in a different format, with **one observation per row**.
When we do text analysis, the observations we are interested in aren't the whole talks at once, but rather individual _tokens_. A **token** is a meaningful unit of text for analysis; in many cases, this just means a single word. The process of **tokenization** identifies and breaks apart text into individual tokens. You can use [tidytext’s `unnest_tokens()` function](https://juliasilge.github.io/tidytext/reference/unnest_tokens.html) to accomplish all of this at once, both the tidying and the tokenization.
```{r tidy-talks, exercise=TRUE}
tidy_talks <- ted_talks %>%
___(word, text)
```
```{r tidy-talks-solution}
tidy_talks <- ted_talks %>%
unnest_tokens(word, text)
```
```{r tidy-talks-check}
grade_code("Think about the syntax of this function a bit, because you'll be using it over and over again. We piped in the original, non-tidy data set. We gave an argument that we want to tokenize _into_ and an argument that we are tokenizing _from_.")
```
### Tidy TED talks
The `unnest_tokens()` function transforms non-tidy text data into tidy text data. It takes three arguments:
- the input dataframe that contains your text (often you will use the pipe `%>%` to send this argument to `unnest_tokens()`),
- the output column that you want to unnest *to*, and
- the input column that you want to unnest *from*.
```{r display-tidy-ted, echo=TRUE}
tidy_talks
```
What did `unnest_tokens()` do here? Instead of having `r nrow(ted_talks)` rows and reading each talk across the line in `text`, we now have `r scales::comma(nrow(tidy_talks))` rows and can read each talk _down_ the column in `word`. We have **tokenized** and **tidied** the text, as well as a few other transformations:
- Other columns have been retained.
- Punctuation has been stripped out.
- Words have been converted to lower-case.
These are defaults in the function that can be changed, if not appropriate to your analysis.
### Tokenize to bigrams
We said before that tokenization is the process of identifying and breaking apart text into **tokens**, meaningful units of text; those meaningful units of text are most often single words in text analysis but they do not have to be! We can move beyond single words to other kinds of tokens, like **n-grams**. An n-gram is a consecutive sequence of `n` words. Let's look at these TED talks and tokenize to bigrams, n-grams of order 2.
- Use the same function for tokenizing and tidying to create the bigrams.
```{r ted-bigram, exercise=TRUE}
ted_bigrams <- ted_talks %>%
___(bigram, text, token = "ngrams", n = 2)
ted_bigrams
```
```{r ted-bigram-solution}
ted_bigrams <- ted_talks %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
ted_bigrams
```
```{r ted-bigram-check}
grade_code("The bigrams slide along the text to create overlapping sequences of two words. Notice what did and did not change from tokenizing to words. We used the `token` argument to specify a non-default tokenizer, and we specified the n-gram order of 2.")
```
### Most common TED talk words
Let's go back to single words. Now that our data in a tidy format, a whole world of analysis opportunity has opened up for us. We can start by computing term frequencies in just one line. What are the **most common words** in these TED talks?
- Use `count()` to find the most common words.
```{r top-ted, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
___(word, sort = TRUE)
```
```{r top-ted-solution}
tidy_talks %>%
count(word, sort = TRUE)
```
```{r top-ted-check}
grade_code(correct = "That was just one line, but also these are not very interesting words.")
```
### Removing stop words
Words like "the", "and", and "to" that aren't very interesting for a text analysis are called **stop words**. Often the best choice is to remove them. The tidytext package provides access to stop word lexicons, with a default list and then other options and other languages.
- First, run the code the way it is.
- Next, try out the `language` argument, which takes two-letter language abbreviations like `"es"`.
```{r display-stop-words, exercise=TRUE}
get_stopwords()
```
When text data is in a tidy format, stop words can be removed using an [`anti_join()`](https://dplyr.tidyverse.org/reference/filter-joins.html). This type of join will "filter" or remove items that are in the right-hand side, keeping those in the left-hand side.
- Within the `anti_join()`, add the call to `get_stopwords()`.
- Add the arguments to `count()` so we count up words and sort them with the largest groups at the top.
```{r stop-word, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
anti_join(___) %>%
count(___)
```
```{r stop-word-solution}
tidy_talks %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
```
```{r stop-word-check}
grade_code(correct = "These are now more interesting words and are starting to show the focus of TED talks.")
```
### Visualize top words
Because we are using tidyverse tools, we can fluently pipe from the kind of code we just wrote straight to ggplot2 functions. One of the significant benefits of using tidy data principles is consistency with widely-used tools that are broadly supported.
- Remove stop words (the default list) via an `anti_join()`.
- Create a plot with `n` on the x-axis and `word` on the y-axis.
```{r stop-word-viz, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
# remove stop words
___ %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 20) %>%
mutate(word = reorder(word, n)) %>%
# put `n` on the x-axis and `word` on the y-axis
ggplot(aes(___, ___)) +
geom_col()
```
```{r stop-word-viz-solution}
tidy_talks %>%
# remove stop words
anti_join(get_stopwords()) %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 20) %>%
mutate(word = reorder(word, n)) %>%
# put `n` on the x-axis and `word` on the y-axis
ggplot(aes(n, word)) +
geom_col()
```
```{r stop-word-viz-check}
grade_code(correct = "The word \"laughter\" is so high because the transcripts include it when the audience laughed.")
```
### Compare TED talk vocabularies
One of my favorite approaches to text analysis is to compare how different people or groups use language. There are lots of different ways to do this, but you can start with plain old word counts! Let's look at two TED talk speakers, [Jane Goodall](https://en.wikipedia.org/wiki/Jane_Goodall) and [Temple Grandin](https://en.wikipedia.org/wiki/Temple_Grandin), and count up the words they used in their TED talks.
*If you want to explore other speakers, switch out for different speakers' names from the data set and hit "Run Code", after finishing the exercise.*
- Use `filter()` to keep only the words spoken by Jane Goodall and Temple Grandin.
- Remove the default list of stop words.
- Use `count()` with two arguments to count up the term frequencies by `speaker` and `word`. (These first three steps could actually be completed in any order but this makes most sense to me.)
- Come back and `filter()` again to only keep words spoken at least 10 times by both women.
The function [`pivot_wider()` from tidyr](https://tidyr.tidyverse.org/reference/pivot_wider.html) pivots the long, tidy dataframe to a wide dataframe so we can more easily compare the two speakers' word counts.
```{r ted-final-pivot, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_talks %>%
___(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
# remove stop words
___ %>%
# count with two arguments
___ %>%
group_by(word) %>%
___(sum(n) > 10) %>%
ungroup() %>%
pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0)
```
```{r ted-final-pivot-solution}
tidy_talks %>%
filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
# remove stop words
anti_join(get_stopwords()) %>%
# count with two arguments
count(speaker, word) %>%
group_by(word) %>%
filter(sum(n) > 10) %>%
ungroup() %>%
pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0)
```
```{r ted-final-pivot-check}
grade_code(correct = "Using tidyverse functions lets us flexibly approach many kinds of text analysis questions.")
```
### Visualize vocabulary comparison
It's interesting to look at the numbers themselves, but this would be better as a **visualization**. We can put one speaker on the x-axis and the other on the y-axis to compare their word counts. Words that appear close to the `slope = 1` line are used about evenly by both (since the talks were about the same length) and words far away are used more by one speaker.
- Load the [ggrepel](https://ggrepel.slowkow.com/) package, for better plotting of text labels.
- Use `geom_text_repel()` to place the text annotations on the plot.
```{r ted-final-viz, fig.width=6, fig.height=6, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
library(___)
tidy_talks %>%
filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
anti_join(get_stopwords()) %>%
count(speaker, word) %>%
group_by(word) %>%
filter(sum(n) > 10) %>%
ungroup() %>%
pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0) %>%
ggplot(aes(`Jane Goodall`, `Temple Grandin`)) +
geom_abline(color = "gray50", size = 1.2, alpha = 0.8, lty = 2) +
# use the special ggrepel geom for nicer text plotting
___(aes(label = word)) +
coord_fixed()
```
```{r ted-final-viz-solution}
library(ggrepel)
tidy_talks %>%
filter(speaker %in% c("Jane Goodall", "Temple Grandin")) %>%
anti_join(get_stopwords()) %>%
count(speaker, word) %>%
group_by(word) %>%
filter(sum(n) > 10) %>%
ungroup() %>%
pivot_wider(names_from = "speaker", values_from = "n", values_fill = 0) %>%
ggplot(aes(`Jane Goodall`, `Temple Grandin`)) +
geom_abline(color = "gray50", size = 1.2, alpha = 0.8, lty = 2) +
# use the special ggrepel geom for nicer text plotting
geom_text_repel(aes(label = word)) +
coord_fixed()
```
```{r ted-final-viz-check}
grade_code(correct = "Jane Goodall spoke about chimpanzees while Temple Grandin spoke about autism, but both women spoke about people and being different.")
```
## 3. Shakespeare gets sentimental {data-progressive=TRUE}
> To tidy or not to tidy, that is the question.
<iframe src="https://giphy.com/embed/10pVeIaRAbekUw" width="480" height="336" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/pri-shakespeare-studio-360-10pVeIaRAbekUw">via GIPHY</a></p>
In the first case study of this tutorial, you learned about how to transform text data into a tidy format, with one token per row, and how to summarize and visualize the results. Our second case study uses a data set of classic plays by Shakespeare, made available by [Project Gutenberg](https://www.gutenberg.org/). We will again tidy the text, and we will learn how to implement **sentiment analysis**.
The data set is available for you to explore as a tibble called `shakespeare`, and there are three columns.
- The first column gives you the `title` of the play, such as _Much Ado about Nothing_.
- The next gives you the `genre` of play, either tragedy or comedy.
- The last column gives you `text` from that play, one line at a time.
### The game's afoot!
Modify the following code to find out how many lines of tragedy vs. comedy text we have:
```{r count-genre, exercise=TRUE, exercise.eval=TRUE}
shakespeare %>%
count(title)
```
```{r count-genre-solution}
shakespeare %>%
count(genre)
```
```{r count-genre-check}
grade_code("There are six plays in this data set, and we have slightly more total lines of text from the tragedies than from the comedies.")
```
### To be, or not to be
Now use `count()` with **two** variables (separate by a comma) so you can see which titles are tragedies and which are comedies:
```{r count-title-genre, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
shakespeare %>%
count(___)
```
```{r count-title-genre-solution}
shakespeare %>%
count(genre, title)
```
```{r count-title-genre-check}
grade_code(correct = "We have three comedies, and three tragedies.")
```
### Unnesting from text to word
The `shakespeare` data set is not yet compatible with tidy tools. You need to first break the text into individual tokens, like you learned in the first case study. There are **three** verbs to add to add to this analysis pipeline:
- group by `title`, so that you can add a line number to count up the lines spoken in each play
- tokenize and tidy the text using `unnest_tokens()`
- count the words so you can see which are most common in these Shakespeare plays
```{r tidy-shakespeare, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare <- shakespeare %>%
___(title) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
___(word, text)
tidy_shakespeare %>%
# count to find out how many times each word is used
___(word, sort = TRUE)
```
```{r tidy-shakespeare-solution}
tidy_shakespeare <- shakespeare %>%
group_by(title) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_shakespeare %>%
# count to find out how many times each word is used
count(word, sort = TRUE)
```
```{r tidy-shakespeare-check}
grade_code(correct = 'Notice how the most common words in the data frame are words like "the", "and", and "i" that are not particularly meaningful or specific to these plays. We could remove these stop words like we did in the first case study.')
```
### Sentiment lexicons
Sentiment analysis is a way to measure the attitudes and opinions expressed in text, and can be approached in multiple ways. A common approach is to use **sentiment lexicons**, lists of words that have been curated and scored in some way. Lexicons are typically created by NLP researchers and some have licenses that restrict their use, for example in commercial settings. The `"bing"` lexicon of Hu and Liu (2004) is a general purpose English lexicon (which can be used in commercial settings with attribution) that categorizes words as either positive or negative.
```{r lexicon, exercise=TRUE}
get_sentiments("bing")
```
Sentiment lexicons include many words, but some words are unlikely to be found in a sentiment lexicon or dictionary.
```{r sentiment-quiz}
question('Which of the following words is most unlikely to be found in a sentiment lexicon?',
answer("pessimism"),
answer("and", correct = TRUE),
answer("peace"),
answer("merry"),
answer("repulsion"),
allow_retry = TRUE,
random_answer_order = TRUE,
incorrect = "Incorrect. Sentiment lexicons include words that express emotions and opinions, like this one. Which of these words is the most neutral?",
correct = "Correct! A word like \"and\" is neutral and unlikely to be included in a sentiment lexicon."
)
```
### Sentiment analysis via an `inner_join()`
When text data is in a tidy data structure like we are working with in this tutorial, with **one word per row**, you can perform sentiment analysis using an inner join, like you may have worked with in SQL or in [dplyr](https://dplyr.tidyverse.org/). Remember that the sentiment lexicon you just saw is a data frame with one word per row. When your text data is also stored in that form, with one word per row, then the result of an inner join between those two data sets will give you all the words that are both in the text you are analyzing and the lexicon.
<div class = "row">
<div class = "col-md-6">
Let's say you have the following text data set, available in your environment as `text`:
```{r}
tibble(word = c("chocolate", "makes", "me", "happy",
"but", "this", "movie", "is", "sad"))
```
</div>
<div class = "col-md-6">
And the following sentiment lexicon, in your environment as `lexicon`:
```{r}
tibble(word = c("happy", "sad", "chocolate", "broccoli"))
```
</div>
</div>
```{r chocolate-quiz}
question('Which word will be in the output of an inner join between this text data set and this sentiment lexicon?',
answer("movie"),
answer("chocolate", correct = TRUE),
answer("broccoli"),
answer("me"),
allow_retry = TRUE,
random_answer_order = TRUE,
incorrect = "Incorrect. This word will not be in the output of an inner join.",
correct = "Correct! The word \"chocolate\" is in both the text data set and the lexicon."
)
```
Using tidy data principles for sentiment analysis is convenient not only for implementing the analysis itself, but for understanding the results in detail. Let's walk through what that can look like with the Shakespearean plays.
### Sentiment analysis of Shakespeare
After transforming the text of these Shakespearean plays to a tidy text data set in a previous exercise, the resulting data frame `tidy_shakespeare` is ready for sentiment analysis. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.
- Use the correct kind of join to implement sentiment analysis.
- Add the `"bing"` lexicon as the argument to the join function.
- Find how many positive and negative words each play has by using two arguments in `count()`.
```{r shakespeare-sentiment, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
shakespeare_sentiment <- tidy_shakespeare %>%
# implement sentiment analysis with the "bing" lexicon
___(get_sentiments(___))
shakespeare_sentiment %>%
# find how many positive/negative words each play has
___(___)
```
```{r shakespeare-sentiment-solution}
shakespeare_sentiment <- tidy_shakespeare %>%
# implement sentiment analysis with the "bing" lexicon
inner_join(get_sentiments("bing"))
shakespeare_sentiment %>%
# find how many positive/negative words each play has
count(title, sentiment)
```
```{r shakespeare-sentiment-check}
grade_code(correct = 'When text is in a tidy data structure, sentiment analysis can be implemented with an inner join.')
```
### Tragedy or comedy?
> But soft! What light through yonder window breaks? It is the number of sentiment words you just calculated, and you are on the path to comparing which plays have proportionally more positive or negative words.
Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies?
**First,** calculate how many negative and positive words each play used.
- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find the number of words for each combination of `title`, `genre,` and `sentiment`.
**Next,** find the percentage of negative words for each play.
- Group by the titles of the plays.
- Find the total number of words in each play using `sum()`.
- Calculate a `percent` for each play that is the number of words of each sentiment divided by the total words in that play.
- Filter the results for only `"negative"` sentiment.
```{r shakespeare-percent, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
sentiment_counts <- tidy_shakespeare %>%
# implement sentiment analysis using the "bing" lexicon
___ %>%
# count the number of words by title, genre, and sentiment
___
sentiment_counts %>%
group_by(___) %>%
# find the total number of words in each play
mutate(total = ___,
percent = n / total) %>%
# filter the results for only negative sentiment
filter(___) %>%
arrange(percent)
```
```{r shakespeare-percent-solution}
sentiment_counts <- tidy_shakespeare %>%
# implement sentiment analysis using the "bing" lexicon
inner_join(get_sentiments("bing")) %>%
# count the number of words by title, genre, and sentiment
count(title, genre, sentiment)
sentiment_counts %>%
group_by(title) %>%
# find the total number of words in each play
mutate(total = sum(n),
percent = n / total) %>%
# filter the results for only negative sentiment
filter(sentiment == "negative") %>%
arrange(percent)
```
```{r shakespeare-percent-check}
grade_code(correct = "Looking at the `percent` column of your output, you can see that tragedies do in fact have a higher percentage of negative words.")
```
### Most common positive and negative words
You found in the previous exercise that Shakespeare's tragedies use proportionally more negative words than the comedies. Now you can explore which specific words are driving these sentiment scores. Which are the most common **positive** and **negative** words in these plays?
There are **three** steps in the code in this exercise. The first step counts how many times each word is used, the second step takes the top 10 most-used positive and negative words, and the final step makes a plot to visualize this result.
- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find word counts by `sentiment`.
- Group by `sentiment` so you can take the top 10 words in each sentiment.
- Notice what the line `mutate(word = reorder(word, n))` does; it converts `word` from a character that would be plotted in alphabetical order to a factor that will be plotted in order of `n`.
Now you can make a visualization of `top_words` using [ggplot2](https://ggplot2.tidyverse.org/) to see these results.
- Put `n` on the x-axis and `word` on the y-axis.
- Use `geom_col()` to make a bar chart.
```{r shakespeare-common-plot, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
word_counts <- tidy_shakespeare %>%
___ %>%
# count by word and sentiment
___
top_words <- word_counts %>%
# group by sentiment
___ %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n))
ggplot(___, aes(___, ___, fill = sentiment)) +
___(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free")
```
```{r shakespeare-common-plot-solution}
word_counts <- tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
# count by word and sentiment
count(word, sentiment)
top_words <- word_counts %>%
# group by sentiment
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n))
ggplot(top_words, aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free")
```
```{r shakespeare-common-plot-check}
grade_code(correct = "Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare's time? Do you see a word that the lexicon has misidentified?")
```
### Which word was misidentified?
In the last exercise, you found the top 10 words that contributed most to negative sentiment in these Shakespearean plays, but lexicons are not always foolproof tools for use with all kinds of text.
```{r shakespeare-quiz}
question('Which of those top 10 "negative"" words used by Shakespeare was misidentified as negative by the sentiment lexicon?',
answer("death"),
answer("wilt", correct = TRUE),
answer("poor"),
answer("fear"),
answer("die"),
allow_retry = TRUE,
random_answer_order = TRUE,
incorrect = "Incorrect. That is a negative word correctly identified by the sentiment lexicon. Which one is an archaic version of a different word used today?",
correct = "The word \"wilt\" was used differently in Shakespeare's time and was not negative; the lexicon has misidentified it. For example, from *Romeo and Juliet*, \"For thou wilt lie upon the wings of night\". It is important to explore the details of how words were scored when performing sentiment analyses."
)
```
### Sentiment changes through a play
In the last set of exercises in this case study, you will examine how sentiment changes through the narrative arcs of these Shakespearean plays. We will start by first implementing sentiment analysis using `inner_join()`, and then use `count()` with **four** arguments:
- `title`,
- `genre`,
- an `index` that will section together lines of the play, and
- `sentiment`.
After these lines of code, you will have the number of positive and negative words used in each `index`-ed section of the play. These sections will be 70 lines long in your analysis here.
_You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays._
- Implement sentiment analysis using the `"bing"` lexicon.
- Use `count()` to find the number of words for each sentiment used in each play in sections, using four arguments.
- The first argument for `count()` maps to the plays themselves.
- The second argument keeps track of whether the play is a comedy or tragedy.
- The third argument is defined by you; call it `index` and set it equal to `linenumber %/% 70`. This `index` makes chunks of text that are 70 lines long using integer division (`%/%`).
- The fourth argument maps to the different `sentiment` categories.
```{r shakespeare-count-four, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare %>%
# implement sentiment analysis using "bing" lexicon
___ %>%
# count using four arguments
___
```
```{r shakespeare-count-four-solution}
tidy_shakespeare %>%
# implement sentiment analysis using "bing" lexicon
inner_join(get_sentiments("bing")) %>%
# count using four arguments
count(title, genre, index = linenumber %/% 70, sentiment)
```
```{r shakespeare-count-four-check}
grade_code(correct = "This is the first step in looking at narrative arcs. Once more unto the breach, dear friends!")
```
### Visualizing narrative arcs
Now you will build on the code from the previous exercise and continue to move forward to see how sentiment changes through these Shakespearean plays.
- Use `pivot_wider()` from [tidyr](https://tidyr.tidyverse.org/) to pivot `sentiment` and `n` to multiple columns.
- Make a new column using `mutate()` that computes the net sentiment by subtracting `negative` sentiment from `positive`.
- Put `index` on the x-axis, `sentiment` on the y-axis, and use `genre` for `fill`.
- Use `geom_col()` to make a bar chart.
- Call `facet_wrap()` to make a separate panel for each title.
***SPOILER ALERT:*** *It doesn't look like Romeo and Juliet are going to have a happy ending!!!*
```{r shakespeare-arcs, out.width='100%', fig.width=8, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
count(title, genre, index = linenumber %/% 70, sentiment) %>%
# pivot sentiment and n wider
___(names_from = sentiment, values_from = n, values_fill = 0) %>%
# use mutate to compute net sentiment
___(sentiment = ___ - ___) %>%
# put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
ggplot(aes(___)) +
# make a bar chart with geom_col()
___ +
# make small multiples for each title with facet_wrap()
___(~ title, scales = "free_x")
```
```{r shakespeare-arcs-solution}
tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
count(title, genre, index = linenumber %/% 70, sentiment) %>%
# pivot sentiment and n wider
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# use mutate to compute net sentiment
mutate(sentiment = positive - negative) %>%
# put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
ggplot(aes(index, sentiment, fill = genre)) +
# make a bar chart with geom_col()
geom_col() +
# make small multiples for each title with facet_wrap()
facet_wrap(~ title, scales = "free_x")
```
```{r shakespeare-arcs-check}
grade_code(correct = "These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment overall than the tragedies.")
```
## 4. Newspaper headlines {data-progressive=TRUE}
<iframe src="https://giphy.com/embed/7IW6Jnw29TYmgkuu3M" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/usnationalarchives-paper-newspaper-printing-press-7IW6Jnw29TYmgkuu3M">via GIPHY</a></p>
You have already learned a lot in the first two case studies of this tutorial, from how to transform text to a tidy format and how to use sentiment lexicons. In this third case study, we will learn some new ways to analyze text, using a data set of newspaper headlines. These particular headlines are from the *New York Times* during November 2020.
[](https://developer.nytimes.com/)
The *New York Times* is a busy paper; there were over 5,000 headlines during this period! The data set we'll be working with has two variables:
- `headline`, the text of the headline as published by the NYT
- `section`, the main section of the paper that the article with this headline was published in
Headlines are interesting for many reasons. One important one is that news organizations in general and the *New York Times* in particular have come under criticism for their practices in crafting headlines, from excessive use of passive voice when reporting killings by the police to how outright lies by politicians are handled in headlines. Text data is often fraught and complex, and that is certainly true for newspaper headlines!
Use `count()` with `sort = TRUE` so you can see how many different values for `section` there are in this data set from November 2020 and which had the most headlines.
```{r count-section, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
nyt_headlines %>%
___
```
```{r count-section-solution}
nyt_headlines %>%
count(section, sort = TRUE)
```
```{r count-section-check}
grade_code("November 2020 was a busy month in the United States! The next most used categories in this data set are headlines about world news and the opinion page.")
```
### Tidying newspaper headlines
This data set contains over 5000 newspaper headlines, from November 2020. The headlines are all in one column, so they are not yet in a tidy format, ready for our tidy set of tools. It's time for you to tidy these headlines!
- Pipe the `nyt_headlines` object to the next line.
- Use `unnest_tokens()` to tokenize and tidy the `headline` column into a new `word` column.
We also have added a headline `id` column via `mutate()` so we can keep track of which word belongs to which headline.
```{r tidy-headlines, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
# pipe `nyt_headlines` to the next line
tidy_nyt <- ___ %>%
mutate(id = row_number()) %>%
# transform the `headline` column to a `word` column
___
tidy_nyt
```
```{r tidy-headlines-solution}
# pipe `nyt_headlines` to the next line
tidy_nyt <- nyt_headlines %>%
mutate(id = row_number()) %>%
# transform the `headline` column to a `word` column
unnest_tokens(word, headline)
tidy_nyt
```
```{r tidy-headlines-check}
grade_code("This is the third time you have tidied a text data set in this tutorial. Notice that you use whatever column name the text happens to be in. You also can give the output whatever name is most convenient for you; it does not necessarily have to be `word`.")
```
### Most common headline words
What are the most common words in these headlines? Analyzing headlines is not quite like analyzing other kinds of text, but to be honest, that turns out to be true a lot of the time. Many text data sets have **specific characteristics** (document length, vocabulary, semantic patterns, etc) that make them stand out compared to other data sets, sometimes even when you look at just the most common words.
- Use `count()` to tally up the words.
- Set `sort = TRUE` to order by the count, instead of alphabetically.
```{r top-nyt, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_nyt %>%
___
```
```{r top-nyt-solution}
tidy_nyt %>%
count(word, sort = TRUE)
```
```{r top-nyt-check}
grade_code(correct = "The word \"election\" is used more in this data set than the word \"and\"! This is pretty wild.")
```
### Removing stop words
The overall top words tell us a lot about this corpus of headlines, but let's remove the default list of stop words to get a better idea of what was being reported at the NYT in November 2020. Remember from the first case study that you remove stop words using an `anti_join()`.
```{r nyt-stop-words, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
tidy_nyt %>%
___ %>%
count(word, sort = TRUE)
```
```{r nyt-stop-words-solution}
tidy_nyt %>%
anti_join(get_stopwords()) %>%
count(word, sort = TRUE)
```
```{r nyt-stop-words-check}
grade_code(correct = "In November 2020, election results were in headlines at a much higher rate than coronavirus.")
```
### What are the newspaper sections about?
A central question in text mining is how to quantify what different documents or categories of documents are about. One approach to measure this is to use _tf-idf_, term frequency-inverse document frequency. A word's **term frequency** is how frequently it occurs in a document. A word's **inverse document frequency** is a weight, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents:
$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
If you multiply the two together, you get tf-idf, a statistic intended to measure how important a word is to a document in a collection (or corpus) of documents. The [`bind_tf_idf()` function from tidytext](https://juliasilge.github.io/tidytext/reference/bind_tf_idf.html) can compute tf-idf for a text data in a tidy format.
Let's look at this in the context of the NYT headlines. In our case study here, each headline is a document, and the corpus is the whole month of headlines.
- Start by counting up the words in the `tidy_nyt` data, using three arguments: `section` (to keep track of the newspaper sections), `id` for each headline, and `word`.
- Specify the appropriate three arguments for `bind_tf_idf()`:
- The first is the token or term, `word`, and the second is the document, the headline `id`.
- The last argument is the column that contains the document-term counts, which is `n`, the output of `count()`.
Because of the line `arrange(-tf_idf)`, the output shows the _highest_ tf-idf words in the whole data set of headlines, the most distinctive words found.
```{r nyt-tf-idf, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE}
nyt_tf_idf <- tidy_nyt %>%
# count with three arguments
___ %>%
# there are also three arguments for `bind_tf_idf()`
bind_tf_idf(___) %>%
arrange(-tf_idf)
nyt_tf_idf
```
```{r nyt-tf-idf-solution}
nyt_tf_idf <- tidy_nyt %>%
# count with three arguments
count(section, id, word) %>%
# there are also three arguments for `bind_tf_idf()`
bind_tf_idf(word, id, n) %>%
arrange(-tf_idf)
nyt_tf_idf
```
```{r nyt-tf-idf-check}
grade_code(correct = "The highest tf-idf words also have both `n` and term frequency equal to 1, meaning that these were headlines that only had a single word in them. Not all the headlines in this data set are what we would think of as traditional headlines.")
```
### Visualize tf-idf
Let's use visualization to understand the high tf-idf words better. Let's connect the highest tf-idf words for each headline to the **sections** of the newspaper and look at the top 10 highest tf-idf headline words for four different newspaper sections.
The following code looks at the food, opinion, and technology sections, along with [The Upshot](https://www.nytimes.com/section/upshot).
_Once you complete the code, you can switch these out for any of the other sections you found in the first exercise in this case study (just use "Run Code" instead of "Submit Answer")._
- Put `tf_idf` on the x-axis, and `word` on the y-axis.
- Use `facet_wrap()` to make small multiples for each newspaper section.
```{r nyt-tf-idf-prep}
nyt_tf_idf <- tidy_nyt %>%
count(section, id, word) %>%
bind_tf_idf(word, id, n) %>%
arrange(-tf_idf)
```
```{r nyt-tf-idf-viz, out.width='100%', fig.height=6, fig.width=9, exercise=TRUE, warning=FALSE, message=FALSE, error=FALSE, exercise.setup = "nyt-tf-idf-prep"}
sections <- c("Food", "Opinion", "Technology", "The Upshot")
nyt_tf_idf %>%
filter(!near(tf, 1)) %>%
filter(section %in% sections) %>%
group_by(section) %>%
slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
# put `tf_idf` on the x-axis, `word` on the y-axis
ggplot(aes(___, fill = section)) +
geom_col(show.legend = FALSE) +
# make small multiples
___(~ section, ncol = 2, scales = "free")
```
```{r nyt-tf-idf-viz-solution}
sections <- c("Food", "Opinion", "Technology", "The Upshot")
nyt_tf_idf %>%
filter(!near(tf, 1)) %>%
filter(section %in% sections) %>%
group_by(section) %>%
slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
# put `tf_idf` on the x-axis, `word on the y-axis
ggplot(aes(tf_idf, word, fill = section)) +
geom_col(show.legend = FALSE) +
# make small multiples
facet_wrap(~ section, ncol = 2, scales = "free")
```
```{r nyt-tf-idf-viz-check}
grade_code(correct = "Words about IPOs were more characteristic of the technology section while words about stews and pasta were more characteristic of the food section. Also notice that we filtered out those headlines with one word.")
```
### Tokenize headlines to bigrams