add tutorial materials

justmarkham · justmarkham · commit 0e0fd0423735 · 2016-05-23T21:01:35.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,4 @@
 .DS_Store
 *.pyc
 extras/
+*.tpl
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ Presented by [Kevin Markham](http://www.dataschool.io/about/) at PyCon 2016 (Por
 
 ### Welcome!
 
-This repository will contain the data files and the notebooks/scripts that you will need for the tutorial. I will email enrolled students as soon as those files have been added to the repository.
+This repository contains the data files and the notebooks/scripts that you will need for the tutorial.
 
 A detailed description of the tutorial is below, including a list of **required software** and **knowledge prerequisites**. If you need a refresher on any of the prerequisite material, I have listed my recommended resources.
 
@@ -54,3 +54,48 @@ In this tutorial, we'll answer all of those questions, and more! We'll start by
 ### About the Instructor
 
 Kevin Markham is the founder of [Data School](http://www.dataschool.io/) and the former lead instructor for [General Assembly's Data Science course](https://github.com/justmarkham/DAT8) in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) for Kaggle. He has a degree in Computer Engineering from Vanderbilt University.
+
+### Tutorial Introduction
+
+* Required files for today:
+    * Clone or download this repository: [http://bit.ly/pycon2016](http://bit.ly/pycon2016)
+    * IPython/Jupyter notebooks ([tutorial.ipynb](tutorial.ipynb), [exercise.ipynb](exercise.ipynb)) or Python scripts ([tutorial.py](tutorial.py), [exercise.py](exercise.py))
+    * Datasets in the `data` subdirectory ([sms.tsv](data/sms.tsv), [yelp.csv](data/yelp.csv))
+* Required software for today:
+    * [scikit-learn](http://scikit-learn.org/stable/install.html) and [pandas](http://pandas.pydata.org/pandas-docs/stable/install.html) (and their dependencies)
+    * [Anaconda distribution of Python](https://www.continuum.io/downloads) is an easy way to install both of these
+    * Both Python 2 and 3 are welcome
+    * Flash drives are available with Anaconda installers and tutorial files
+* About me:
+    * Founder of Data School: [blog](http://www.dataschool.io/), [YouTube](https://youtube.com/user/dataschool)
+    * Twitter: [@justmarkham](https://twitter.com/justmarkham)
+    * Email: [kevin@dataschool.io](mailto:kevin@dataschool.io)
+* How the tutorial will work
+* What we'll be learning today
+* What I expect you already know
+* Agenda
+
+### Related Resources
+
+**Text classification:**
+* Read Paul Graham's classic post, [A Plan for Spam](http://www.paulgraham.com/spam.html), for an overview of a basic text classification system using a Bayesian approach. (He also wrote a [follow-up post](http://www.paulgraham.com/better.html) about how he improved his spam filter.)
+* Coursera's Natural Language Processing (NLP) course has [video lectures](https://class.coursera.org/nlp/lecture) on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) used in all of the videos.)
+* [Automatically Categorizing Yelp Businesses](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
+* [How to Read the Mind of a Supreme Court Justice](http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/) discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining [how it works](https://sciencecowboy.wordpress.com/2015/03/05/predicting-the-supreme-court-from-oral-arguments/), and the [Python code](https://github.com/nasrallah/CourtCast) is available on GitHub.)
+* [Identifying Humorous Cartoon Captions](http://www.cs.huji.ac.il/~dshahaf/pHumor.pdf) is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
+* In this [PyData video](https://www.youtube.com/watch?v=y3ZTKFZ-1QQ) (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.
+
+**Naive Bayes and logistic regression:**
+* Read this brief Quora post on [airport security](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt) for an intuitive explanation of how Naive Bayes classification works.
+* For a longer introduction to Naive Bayes, read Sebastian Raschka's article on [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html). As well, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
+* My [guide to an in-depth understanding of logistic regression](http://www.dataschool.io/guide-to-logistic-regression/) includes a lesson notebook and a curated list of resources for going deeper into this topic.
+* [Comparison of Machine Learning Models](https://github.com/justmarkham/DAT8/blob/master/other/model_comparison.md) lists the advantages and disadvantages of Naive Bayes, logistic regression, and other classification and regression models.
+
+**scikit-learn:**
+* The scikit-learn user guide includes an excellent section on [text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) that includes many details not covered in today's tutorial.
+* The user guide also describes the [performance trade-offs](http://scikit-learn.org/stable/modules/computational_performance.html#influence-of-the-input-data-representation) involved when choosing between sparse and dense input data representations.
+* To learn more about evaluating classification models, watch video #9 from my [scikit-learn video series](https://github.com/justmarkham/scikit-learn-videos) (or just read the associated [notebook](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb)).
+
+**pandas:**
+* Here are my [top 8 resources for learning data analysis with pandas](http://www.dataschool.io/best-python-pandas-resources/).
+* As well, I have a new [pandas Q&A video series](http://www.dataschool.io/easier-data-analysis-with-pandas/) targeted at beginners that includes two new videos every week.
diff --git a/exercise.ipynb b/exercise.ipynb
@@ -0,0 +1,156 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tutorial Exercise: Yelp reviews"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.\n",
+    "\n",
+    "**Description of the data:**\n",
+    "\n",
+    "- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.\n",
+    "- Each observation (row) in this dataset is a review of a particular business by a particular user.\n",
+    "- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.\n",
+    "- The **text** column is the text of the review.\n",
+    "\n",
+    "**Goal:** Predict the star rating of a review using **only** the review text.\n",
+    "\n",
+    "**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 1\n",
+    "\n",
+    "Read **`yelp.csv`** into a pandas DataFrame and examine it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 2\n",
+    "\n",
+    "Create a new DataFrame that only contains the **5-star** and **1-star** reviews.\n",
+    "\n",
+    "- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 3\n",
+    "\n",
+    "Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.\n",
+    "\n",
+    "- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 4\n",
+    "\n",
+    "Use CountVectorizer to create **document-term matrices** from X_train and X_test."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 5\n",
+    "\n",
+    "Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.\n",
+    "\n",
+    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 6 (Challenge)\n",
+    "\n",
+    "Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.\n",
+    "\n",
+    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 7 (Challenge)\n",
+    "\n",
+    "Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?\n",
+    "\n",
+    "- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of \"false positives\" and \"false negatives\".\n",
+    "- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the \"positive class\"?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 8 (Challenge)\n",
+    "\n",
+    "Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.\n",
+    "\n",
+    "- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 9 (Challenge)\n",
+    "\n",
+    "Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.\n",
+    "\n",
+    "Here are the steps:\n",
+    "\n",
+    "- Define X and y using the original DataFrame. (y should contain 5 different classes.)\n",
+    "- Split X and y into training and testing sets.\n",
+    "- Create document-term matrices using CountVectorizer.\n",
+    "- Calculate the testing accuracy of a Multinomial Naive Bayes model.\n",
+    "- Compare the testing accuracy with the null accuracy, and comment on the results.\n",
+    "- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)\n",
+    "- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/exercise.py b/exercise.py
@@ -0,0 +1,75 @@
+# # Tutorial Exercise: Yelp reviews
+
+# ## Introduction
+# 
+# This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.
+# 
+# **Description of the data:**
+# 
+# - **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
+# - Each observation (row) in this dataset is a review of a particular business by a particular user.
+# - The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
+# - The **text** column is the text of the review.
+# 
+# **Goal:** Predict the star rating of a review using **only** the review text.
+# 
+# **Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
+
+# ## Task 1
+# 
+# Read **`yelp.csv`** into a pandas DataFrame and examine it.
+
+# ## Task 2
+# 
+# Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
+# 
+# - **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
+
+# ## Task 3
+# 
+# Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
+# 
+# - **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
+
+# ## Task 4
+# 
+# Use CountVectorizer to create **document-term matrices** from X_train and X_test.
+
+# ## Task 5
+# 
+# Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
+# 
+# - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
+
+# ## Task 6 (Challenge)
+# 
+# Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
+# 
+# - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
+
+# ## Task 7 (Challenge)
+# 
+# Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
+# 
+# - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
+# - **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
+
+# ## Task 8 (Challenge)
+# 
+# Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
+# 
+# - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
+
+# ## Task 9 (Challenge)
+# 
+# Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.
+# 
+# Here are the steps:
+# 
+# - Define X and y using the original DataFrame. (y should contain 5 different classes.)
+# - Split X and y into training and testing sets.
+# - Create document-term matrices using CountVectorizer.
+# - Calculate the testing accuracy of a Multinomial Naive Bayes model.
+# - Compare the testing accuracy with the null accuracy, and comment on the results.
+# - Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
+# - Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!
diff --git a/tutorial.ipynb b/tutorial.ipynb
diff --git a/tutorial.py b/tutorial.py

-Original file line number
+Diff line change
 .DS_Store
 *.pyc
 extras/
 +*.tpl