Skip to content

Commit 306de58

Browse files
authored
Updates for 2019 course
added seaborn for plotting, mapping and apply functions, and extended the example project
1 parent a82ab50 commit 306de58

File tree

3 files changed

+202
-272
lines changed

3 files changed

+202
-272
lines changed

Example Project - Baby Names.ipynb

Lines changed: 86 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -27,16 +27,8 @@
2727
"import platform # some of the subsequent code depends on operating system\n",
2828
"\n",
2929
"import pandas as pd\n",
30-
"import matplotlib.pyplot as plt"
31-
]
32-
},
33-
{
34-
"cell_type": "code",
35-
"execution_count": null,
36-
"metadata": {},
37-
"outputs": [],
38-
"source": [
39-
"%matplotlib inline"
30+
"import matplotlib.pyplot as plt\n",
31+
"import seaborn as sns"
4032
]
4133
},
4234
{
@@ -147,100 +139,159 @@
147139
"df.head()"
148140
]
149141
},
142+
{
143+
"cell_type": "markdown",
144+
"metadata": {},
145+
"source": [
146+
"Now lets explore this data a little, first, how many records do we have?"
147+
]
148+
},
150149
{
151150
"cell_type": "code",
152151
"execution_count": null,
153152
"metadata": {},
154153
"outputs": [],
154+
"source": []
155+
},
156+
{
157+
"cell_type": "markdown",
158+
"metadata": {},
155159
"source": [
156-
"# How many records do we have?\n",
157-
"len(df)"
160+
"Now lets look at a specific name, lets make a new dataframe that includes only your name and look at the first 5 rows"
158161
]
159162
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"metadata": {
167+
"scrolled": true
168+
},
169+
"outputs": [],
170+
"source": []
171+
},
160172
{
161173
"cell_type": "markdown",
162174
"metadata": {},
163175
"source": [
164-
"Now that we have the data in a dataframe, we want to move the year and sex columns into the index, leaving only columns for name and birth count. We can use the `set_index` method of the dataframe for this."
176+
"Lets now look at some stats for your name"
165177
]
166178
},
167179
{
168180
"cell_type": "code",
169181
"execution_count": null,
170182
"metadata": {},
171183
"outputs": [],
172-
"source": [
173-
"df = df.set_index(keys=['year', 'sex'])\n",
174-
"df.head()"
175-
]
184+
"source": []
176185
},
177186
{
178187
"cell_type": "markdown",
179188
"metadata": {},
180189
"source": [
181-
"Now we need a function that, given a name and a sex, returns a series containing the number of births by year."
190+
"When was your name at peak popularity?"
182191
]
183192
},
184193
{
185194
"cell_type": "code",
186195
"execution_count": null,
187196
"metadata": {},
188197
"outputs": [],
198+
"source": []
199+
},
200+
{
201+
"cell_type": "markdown",
202+
"metadata": {},
189203
"source": [
190-
"def get_births_series(df, name, sex):\n",
191-
" single_sex_df = df.xs(sex, level='sex')\n",
192-
" return single_sex_df[single_sex_df.name == name]['births']"
204+
"How can we convert the raw birth numbers into percent of births that year? Lets make a new column for that"
193205
]
194206
},
195207
{
196208
"cell_type": "code",
197209
"execution_count": null,
198210
"metadata": {},
199211
"outputs": [],
212+
"source": []
213+
},
214+
{
215+
"cell_type": "markdown",
216+
"metadata": {},
200217
"source": [
201-
"matthews = get_births_series(df, 'Matthew', 'M')\n",
202-
"matthews.head()"
218+
"Wow, some of these percentages are really small, why dont we change it to number of births of a given name per million births that year"
203219
]
204220
},
205221
{
206222
"cell_type": "code",
207223
"execution_count": null,
208224
"metadata": {},
209225
"outputs": [],
226+
"source": []
227+
},
228+
{
229+
"cell_type": "markdown",
230+
"metadata": {},
210231
"source": [
211-
"plt.style.use('seaborn')\n",
212-
"matthews.plot(title='Annual count of births for name %s' % 'Matthew')"
232+
"Why dont we make a graph of how common your name is over the years"
213233
]
214234
},
235+
{
236+
"cell_type": "code",
237+
"execution_count": null,
238+
"metadata": {},
239+
"outputs": [],
240+
"source": []
241+
},
215242
{
216243
"cell_type": "markdown",
217244
"metadata": {},
218245
"source": [
219-
"Now one last function to output a plot of the series. Just the bare minimum for now."
246+
"If your name is like mine, there is actually a bunch of shading indicating variance, why would that be?\n",
247+
"\n",
248+
"\n",
249+
"Its because this data is also split on gender, so there is a chance to have the name listed twice because of gender. The gender split could be interesting though, so lets look at it graphically"
220250
]
221251
},
222252
{
223253
"cell_type": "code",
224254
"execution_count": null,
225255
"metadata": {},
226256
"outputs": [],
257+
"source": []
258+
},
259+
{
260+
"cell_type": "markdown",
261+
"metadata": {},
262+
"source": [
263+
"There is a actually a really good breakdown of different name trends by Tim Urban at https://waitbutwhy.com/2013/12/how-to-name-baby.html\n",
264+
"\n",
265+
"so lets look quickly at a couple of the interesting trends he found with our code"
266+
]
267+
},
268+
{
269+
"cell_type": "markdown",
270+
"metadata": {},
227271
"source": [
228-
"def create_births_figure(s, sex, name):\n",
229-
" plt.style.use('seaborn')\n",
230-
" sex_full = 'female'\n",
231-
" if sex == 'M':\n",
232-
" sex_full = 'male'\n",
233-
" plot = s.plot(title='Annual count of US %s births for name %s' % (sex_full, name))\n",
234-
" return plot.get_figure()"
272+
"### Name Fads\n",
273+
"\n",
274+
"A name fad is when a specific name gets really popular for a specific generation, causing a person's age to be reasonable guessed based on their name alone.\n",
275+
"\n",
276+
"Check out Jennifer, Ashley, or Shirley for some examples"
235277
]
236278
},
237279
{
238280
"cell_type": "code",
239281
"execution_count": null,
240282
"metadata": {},
241283
"outputs": [],
284+
"source": []
285+
},
286+
{
287+
"cell_type": "markdown",
288+
"metadata": {},
242289
"source": [
243-
"fig = create_births_figure(matthews, 'M', 'Matthew')"
290+
"### Gender Takeovers\n",
291+
"\n",
292+
"Sometimes a name that is uncommon but solely one gender becomes extremely popular for the other gender, to the point that the original gender stops using it\n",
293+
"\n",
294+
"Check out Lynn or Aubrey"
244295
]
245296
},
246297
{
@@ -267,7 +318,7 @@
267318
"name": "python",
268319
"nbconvert_exporter": "python",
269320
"pygments_lexer": "ipython3",
270-
"version": "3.6.5"
321+
"version": "3.7.3"
271322
}
272323
},
273324
"nbformat": 4,

Part 1 - Basics.ipynb

Lines changed: 34 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"\n",
1111
"It is well suited to handle \"tabular\" data (that might be found in a spreadsheet), time series data, or pretty much anything you care to put in a matrix with rows and named columns.\n",
1212
"\n",
13-
"It contains two primary data structures, the `Series` (1-dimensional) and the `DataFrame` (2-dimensional) as well as a host of convenience methods for loading and plotting data.\n",
13+
"It contains two primary data structures, the `Series` (1-dimensional) and the `DataFrame` (2-dimensional) as well as a host of convenience methods for loading and working with data.\n",
1414
"\n",
1515
"The main thing that makes pandas pandas is that all data is *intrinsically aligned*. That means each data structure, `DataFrame` or `Series` has something called an **Index** that links data values with a label. That link will always be there (unless you explicitly break or change it) and it's what allows pandas to quickly and efficiently \"do the right thing\" when working with data."
1616
]
@@ -38,7 +38,9 @@
3838
{
3939
"cell_type": "code",
4040
"execution_count": null,
41-
"metadata": {},
41+
"metadata": {
42+
"scrolled": true
43+
},
4244
"outputs": [],
4345
"source": [
4446
"data = pd.Series([0.1, 0.2, 0.3, 0.4])\n",
@@ -58,15 +60,7 @@
5860
"metadata": {},
5961
"outputs": [],
6062
"source": [
61-
"data.values"
62-
]
63-
},
64-
{
65-
"cell_type": "code",
66-
"execution_count": null,
67-
"metadata": {},
68-
"outputs": [],
69-
"source": [
63+
"print(data.values)\n",
7064
"type(data.values)"
7165
]
7266
},
@@ -143,7 +137,17 @@
143137
"source": [
144138
"# Item access works just like before, with square brackets, \n",
145139
"# even though the index values are strings\n",
146-
"data['a']"
140+
"data['a']\n"
141+
]
142+
},
143+
{
144+
"cell_type": "code",
145+
"execution_count": null,
146+
"metadata": {},
147+
"outputs": [],
148+
"source": [
149+
"#once you have labels, you can also access them this way (assuming no spaces in name)\n",
150+
"data.a"
147151
]
148152
},
149153
{
@@ -200,7 +204,7 @@
200204
"cell_type": "markdown",
201205
"metadata": {},
202206
"source": [
203-
"Above we see the critical difference between numpy arrays, which are always ordered sequentially and have an implicit integer index, and `Series` objects, which have an index that maps *labels* to *values*.\n",
207+
"Remember that the values command (data.values) is converting the column into a numpy array. That means any indexing follows the numpy rules (which are based on position), not the pandas rules (which are based on index)\n",
204208
"\n",
205209
"`Series` are in fact a cross between a numpy array and a python dictionary. You can think of them as a dictionary with *typed* keys and *typed* values."
206210
]
@@ -211,6 +215,7 @@
211215
"metadata": {},
212216
"outputs": [],
213217
"source": [
218+
"# in fact it is easy to convert a dictionary into a series\n",
214219
"max_depths_dict = {\n",
215220
" 'Erie': 64,\n",
216221
" 'Huron': 229,\n",
@@ -252,13 +257,6 @@
252257
"max_depths.index"
253258
]
254259
},
255-
{
256-
"cell_type": "markdown",
257-
"metadata": {},
258-
"source": [
259-
"We can think of an `Index` as an *immutable*, n-dimensional array. "
260-
]
261-
},
262260
{
263261
"cell_type": "markdown",
264262
"metadata": {},
@@ -327,6 +325,16 @@
327325
"max_depths.mean()"
328326
]
329327
},
328+
{
329+
"cell_type": "code",
330+
"execution_count": null,
331+
"metadata": {},
332+
"outputs": [],
333+
"source": [
334+
"#and if you are lazy and just want a bunch of standard stats\n",
335+
"max_depths.describe()"
336+
]
337+
},
330338
{
331339
"cell_type": "markdown",
332340
"metadata": {},
@@ -743,8 +751,10 @@
743751
"# There is a potential problem with non-sequential integer indexes:\n",
744752
"data_implicit = pd.Series([100, 200, 300, 400])\n",
745753
"data_explicit = pd.Series([100, 200, 300, 400], index=[4, 9, 8, 1])\n",
754+
"print('data_implicit')\n",
746755
"print(data_implicit)\n",
747756
"print()\n",
757+
"print('data_explicit')\n",
748758
"print(data_explicit)"
749759
]
750760
},
@@ -1020,7 +1030,7 @@
10201030
"metadata": {},
10211031
"outputs": [],
10221032
"source": [
1023-
"max_depths.max()"
1033+
"lakes['Max Depth (m)'].max()"
10241034
]
10251035
},
10261036
{
@@ -1036,7 +1046,7 @@
10361046
"metadata": {},
10371047
"outputs": [],
10381048
"source": [
1039-
"max_depths.sort_values(ascending=False).head(2)"
1049+
"lakes['Max Depth (m)'].sort_values(ascending=False).head(2)"
10401050
]
10411051
},
10421052
{
@@ -1106,7 +1116,7 @@
11061116
"metadata": {},
11071117
"outputs": [],
11081118
"source": [
1109-
"df.head(10)"
1119+
"df.head(5)"
11101120
]
11111121
},
11121122
{
@@ -1220,13 +1230,6 @@
12201230
"df_no_nans = df.dropna(axis=0, how=\"any\")\n",
12211231
"df_no_nans.head()"
12221232
]
1223-
},
1224-
{
1225-
"cell_type": "code",
1226-
"execution_count": null,
1227-
"metadata": {},
1228-
"outputs": [],
1229-
"source": []
12301233
}
12311234
],
12321235
"metadata": {
@@ -1245,7 +1248,7 @@
12451248
"name": "python",
12461249
"nbconvert_exporter": "python",
12471250
"pygments_lexer": "ipython3",
1248-
"version": "3.6.5"
1251+
"version": "3.7.3"
12491252
}
12501253
},
12511254
"nbformat": 4,

0 commit comments

Comments
 (0)