|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "## USA Baby Names 1880-2016\n", |
| 8 | + "\n", |
| 9 | + "The United States Social Security Administration maintains an interesting data set of (almost) all names given to babies born in the United States, by sex and year, going back to 1880. This data set is available at [https://www.ssa.gov/oact/babynames/limits.html](https://www.ssa.gov/oact/babynames/limits.html)\n", |
| 10 | + "\n", |
| 11 | + "This data set is interesting and fun to explore and we'll use it as the basis of a simple data analysis project with the end goal of creating a script that can be called to output a plot of a single name's popularity over time.\n", |
| 12 | + "\n", |
| 13 | + "To start, we will assume that this dataset has already been downloaded and unzipped." |
| 14 | + ] |
| 15 | + }, |
| 16 | + { |
| 17 | + "cell_type": "code", |
| 18 | + "execution_count": null, |
| 19 | + "metadata": { |
| 20 | + "collapsed": true |
| 21 | + }, |
| 22 | + "outputs": [], |
| 23 | + "source": [ |
| 24 | + "import pandas as pd\n", |
| 25 | + "import matplotlib.pyplot as plt" |
| 26 | + ] |
| 27 | + }, |
| 28 | + { |
| 29 | + "cell_type": "code", |
| 30 | + "execution_count": null, |
| 31 | + "metadata": { |
| 32 | + "collapsed": true |
| 33 | + }, |
| 34 | + "outputs": [], |
| 35 | + "source": [ |
| 36 | + "%matplotlib inline" |
| 37 | + ] |
| 38 | + }, |
| 39 | + { |
| 40 | + "cell_type": "code", |
| 41 | + "execution_count": null, |
| 42 | + "metadata": { |
| 43 | + "collapsed": true |
| 44 | + }, |
| 45 | + "outputs": [], |
| 46 | + "source": [ |
| 47 | + "# Set up some variables for use later\n", |
| 48 | + "dataset_path = 'data\\\\names'\n", |
| 49 | + "begin_year = 1880\n", |
| 50 | + "end_year = 2016" |
| 51 | + ] |
| 52 | + }, |
| 53 | + { |
| 54 | + "cell_type": "markdown", |
| 55 | + "metadata": {}, |
| 56 | + "source": [ |
| 57 | + "Let's first examine the data files to see what we're working with. Note the `type` command on Windows is equivalent to `cat` on MacOS or Linux." |
| 58 | + ] |
| 59 | + }, |
| 60 | + { |
| 61 | + "cell_type": "code", |
| 62 | + "execution_count": null, |
| 63 | + "metadata": {}, |
| 64 | + "outputs": [], |
| 65 | + "source": [ |
| 66 | + "!dir $dataset_path" |
| 67 | + ] |
| 68 | + }, |
| 69 | + { |
| 70 | + "cell_type": "code", |
| 71 | + "execution_count": null, |
| 72 | + "metadata": {}, |
| 73 | + "outputs": [], |
| 74 | + "source": [ |
| 75 | + "# Read a single file into a python variable and print out the first five lines\n", |
| 76 | + "sample = !type $dataset_path\\\\yob1880.txt\n", |
| 77 | + "sample[:5]" |
| 78 | + ] |
| 79 | + }, |
| 80 | + { |
| 81 | + "cell_type": "markdown", |
| 82 | + "metadata": {}, |
| 83 | + "source": [ |
| 84 | + "We will need a function to read in all of these files one by one and combine them into a single dataframe." |
| 85 | + ] |
| 86 | + }, |
| 87 | + { |
| 88 | + "cell_type": "code", |
| 89 | + "execution_count": null, |
| 90 | + "metadata": {}, |
| 91 | + "outputs": [], |
| 92 | + "source": [ |
| 93 | + "def create_dataframe():\n", |
| 94 | + " columns = ('name', 'sex', 'births')\n", |
| 95 | + " pieces = []\n", |
| 96 | + " for year in range(begin_year, end_year + 1):\n", |
| 97 | + " filename = '%s/yob%d.txt' % (dataset_path, year)\n", |
| 98 | + " piece = pd.read_csv(filename, names=columns)\n", |
| 99 | + " piece['year'] = year\n", |
| 100 | + " pieces.append(piece)\n", |
| 101 | + " \n", |
| 102 | + " return pd.concat(pieces, ignore_index=True)" |
| 103 | + ] |
| 104 | + }, |
| 105 | + { |
| 106 | + "cell_type": "code", |
| 107 | + "execution_count": null, |
| 108 | + "metadata": {}, |
| 109 | + "outputs": [], |
| 110 | + "source": [ |
| 111 | + "# Now call our new function to get the dataset loaded into a Dataframe.\n", |
| 112 | + "df = create_dataframe()\n", |
| 113 | + "df.head()" |
| 114 | + ] |
| 115 | + }, |
| 116 | + { |
| 117 | + "cell_type": "code", |
| 118 | + "execution_count": null, |
| 119 | + "metadata": {}, |
| 120 | + "outputs": [], |
| 121 | + "source": [ |
| 122 | + "# How many records do we have?\n", |
| 123 | + "len(df)" |
| 124 | + ] |
| 125 | + }, |
| 126 | + { |
| 127 | + "cell_type": "markdown", |
| 128 | + "metadata": {}, |
| 129 | + "source": [ |
| 130 | + "Now that we have the data in a dataframe, we want to move the year and sex columns into the index, leaving only columns for name and birth count. We can use the `set_index` method of the dataframe for this." |
| 131 | + ] |
| 132 | + }, |
| 133 | + { |
| 134 | + "cell_type": "code", |
| 135 | + "execution_count": null, |
| 136 | + "metadata": {}, |
| 137 | + "outputs": [], |
| 138 | + "source": [ |
| 139 | + "df = df.set_index(keys=['year', 'sex'])\n", |
| 140 | + "df.head()" |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "metadata": {}, |
| 146 | + "source": [ |
| 147 | + "Now we need a function that, given a sex and a year, returns a series containing the number of births by year." |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "code", |
| 152 | + "execution_count": null, |
| 153 | + "metadata": { |
| 154 | + "collapsed": true |
| 155 | + }, |
| 156 | + "outputs": [], |
| 157 | + "source": [ |
| 158 | + "def get_births_series(df, name, sex):\n", |
| 159 | + " single_sex_df = df.xs(sex, level='sex')\n", |
| 160 | + " return single_sex_df[single_sex_df.name == name]['births']" |
| 161 | + ] |
| 162 | + }, |
| 163 | + { |
| 164 | + "cell_type": "code", |
| 165 | + "execution_count": null, |
| 166 | + "metadata": {}, |
| 167 | + "outputs": [], |
| 168 | + "source": [ |
| 169 | + "matthews = get_births_series(df, 'Matthew', 'M')\n", |
| 170 | + "matthews.head()" |
| 171 | + ] |
| 172 | + }, |
| 173 | + { |
| 174 | + "cell_type": "code", |
| 175 | + "execution_count": null, |
| 176 | + "metadata": {}, |
| 177 | + "outputs": [], |
| 178 | + "source": [ |
| 179 | + "plt.style.use('seaborn')\n", |
| 180 | + "matthews.plot(title='Annual count of births for name %s' % 'Matthew')" |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | + "cell_type": "markdown", |
| 185 | + "metadata": {}, |
| 186 | + "source": [ |
| 187 | + "Now one last function to output a plot of the series. Just the bare minimum for now." |
| 188 | + ] |
| 189 | + }, |
| 190 | + { |
| 191 | + "cell_type": "code", |
| 192 | + "execution_count": null, |
| 193 | + "metadata": {}, |
| 194 | + "outputs": [], |
| 195 | + "source": [ |
| 196 | + "def create_births_figure(s, sex, name):\n", |
| 197 | + " plt.style.use('seaborn')\n", |
| 198 | + " sex_full = 'female'\n", |
| 199 | + " if sex == 'M':\n", |
| 200 | + " sex_full = 'male'\n", |
| 201 | + " plot = s.plot(title='Annual count of US %s births for name %s' % (sex_full, name))\n", |
| 202 | + " return plot.get_figure()" |
| 203 | + ] |
| 204 | + }, |
| 205 | + { |
| 206 | + "cell_type": "code", |
| 207 | + "execution_count": null, |
| 208 | + "metadata": {}, |
| 209 | + "outputs": [], |
| 210 | + "source": [ |
| 211 | + "fig = create_births_figure(matthews, 'M', 'Matthew')" |
| 212 | + ] |
| 213 | + }, |
| 214 | + { |
| 215 | + "cell_type": "code", |
| 216 | + "execution_count": null, |
| 217 | + "metadata": { |
| 218 | + "collapsed": true |
| 219 | + }, |
| 220 | + "outputs": [], |
| 221 | + "source": [] |
| 222 | + } |
| 223 | + ], |
| 224 | + "metadata": { |
| 225 | + "kernelspec": { |
| 226 | + "display_name": "Python 3", |
| 227 | + "language": "python", |
| 228 | + "name": "python3" |
| 229 | + }, |
| 230 | + "language_info": { |
| 231 | + "codemirror_mode": { |
| 232 | + "name": "ipython", |
| 233 | + "version": 3 |
| 234 | + }, |
| 235 | + "file_extension": ".py", |
| 236 | + "mimetype": "text/x-python", |
| 237 | + "name": "python", |
| 238 | + "nbconvert_exporter": "python", |
| 239 | + "pygments_lexer": "ipython3", |
| 240 | + "version": "3.6.1" |
| 241 | + } |
| 242 | + }, |
| 243 | + "nbformat": 4, |
| 244 | + "nbformat_minor": 2 |
| 245 | +} |
0 commit comments