Updates for 2019 course

zanderman12 · web-flow · commit 306de5879c43 · 2019-07-12T17:45:36.000-05:00
added seaborn for plotting, mapping and apply functions, and extended the example project
diff --git a/Example Project - Baby Names.ipynb b/Example Project - Baby Names.ipynb
@@ -27,16 +27,8 @@
     "import platform # some of the subsequent code depends on operating system\n",
     "\n",
     "import pandas as pd\n",
-    "import matplotlib.pyplot as plt"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%matplotlib inline"
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns"
    ]
   },
   {
@@ -147,100 +139,159 @@
     "df.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now lets explore this data a little, first, how many records do we have?"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "# How many records do we have?\n",
-    "len(df)"
+    "Now lets look at a specific name, lets make a new dataframe that includes only your name and look at the first 5 rows"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now that we have the data in a dataframe, we want to move the year and sex columns into the index, leaving only columns for name and birth count. We can use the `set_index` method of the dataframe for this."
+    "Lets now look at some stats for your name"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "df = df.set_index(keys=['year', 'sex'])\n",
-    "df.head()"
-   ]
+   "source": []
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we need a function that, given a name and a sex, returns a series containing the number of births by year."
+    "When was your name at peak popularity?"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "def get_births_series(df, name, sex):\n",
-    "    single_sex_df = df.xs(sex, level='sex')\n",
-    "    return single_sex_df[single_sex_df.name == name]['births']"
+    "How can we convert the raw birth numbers into percent of births that year? Lets make a new column for that"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "matthews = get_births_series(df, 'Matthew', 'M')\n",
-    "matthews.head()"
+    "Wow, some of these percentages are really small, why dont we change it to number of births of a given name per million births that year"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "plt.style.use('seaborn')\n",
-    "matthews.plot(title='Annual count of births for name %s' % 'Matthew')"
+    "Why dont we make a graph of how common your name is over the years"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now one last function to output a plot of the series. Just the bare minimum for now."
+    "If your name is like mine, there is actually a bunch of shading indicating variance, why would that be?\n",
+    "\n",
+    "\n",
+    "Its because this data is also split on gender, so there is a chance to have the name listed twice because of gender. The gender split could be interesting though, so lets look at it graphically"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There is a actually a really good breakdown of different name trends by Tim Urban at https://waitbutwhy.com/2013/12/how-to-name-baby.html\n",
+    "\n",
+    "so lets look quickly at a couple of the interesting trends he found with our code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "def create_births_figure(s, sex, name):\n",
-    "    plt.style.use('seaborn')\n",
-    "    sex_full = 'female'\n",
-    "    if sex == 'M':\n",
-    "        sex_full = 'male'\n",
-    "    plot = s.plot(title='Annual count of US %s births for name %s' % (sex_full, name))\n",
-    "    return plot.get_figure()"
+    "### Name Fads\n",
+    "\n",
+    "A name fad is when a specific name gets really popular for a specific generation, causing a person's age to be reasonable guessed based on their name alone.\n",
+    "\n",
+    "Check out Jennifer, Ashley, or Shirley for some examples"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
    "source": [
-    "fig = create_births_figure(matthews, 'M', 'Matthew')"
+    "### Gender Takeovers\n",
+    "\n",
+    "Sometimes a name that is uncommon but solely one gender becomes extremely popular for the other gender, to the point that the original gender stops using it\n",
+    "\n",
+    "Check out Lynn or Aubrey"
    ]
   },
   {
@@ -267,7 +318,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
diff --git a/Part 1 - Basics.ipynb b/Part 1 - Basics.ipynb
@@ -10,7 +10,7 @@
     "\n",
     "It is well suited to handle \"tabular\" data (that might be found in a spreadsheet), time series data, or pretty much anything you care to put in a matrix with rows and named columns.\n",
     "\n",
-    "It contains two primary data structures, the `Series` (1-dimensional) and the `DataFrame` (2-dimensional) as well as a host of convenience methods for loading and plotting data.\n",
+    "It contains two primary data structures, the `Series` (1-dimensional) and the `DataFrame` (2-dimensional) as well as a host of convenience methods for loading and working with data.\n",
     "\n",
     "The main thing that makes pandas pandas is that all data is *intrinsically aligned*. That means each data structure, `DataFrame` or `Series` has something called an **Index** that links data values with a label. That link will always be there (unless you explicitly break or change it) and it's what allows pandas to quickly and efficiently \"do the right thing\" when working with data."
    ]
@@ -38,7 +38,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
    "outputs": [],
    "source": [
     "data = pd.Series([0.1, 0.2, 0.3, 0.4])\n",
@@ -58,15 +60,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data.values"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "print(data.values)\n",
     "type(data.values)"
    ]
   },
@@ -143,7 +137,17 @@
    "source": [
     "# Item access works just like before, with square brackets, \n",
     "# even though the index values are strings\n",
-    "data['a']"
+    "data['a']\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#once you have labels, you can also access them this way (assuming no spaces in name)\n",
+    "data.a"
    ]
   },
   {
@@ -200,7 +204,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Above we see the critical difference between numpy arrays, which are always ordered sequentially and have an implicit integer index, and `Series` objects, which have an index that maps *labels* to *values*.\n",
+    "Remember that the values command (data.values) is converting the column into a numpy array.  That means any indexing follows the numpy rules (which are based on position), not the pandas rules (which are based on index)\n",
     "\n",
     "`Series` are in fact a cross between a numpy array and a python dictionary. You can think of them as a dictionary with *typed* keys and *typed* values."
    ]
@@ -211,6 +215,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# in fact it is easy to convert a dictionary into a series\n",
     "max_depths_dict = {\n",
     "    'Erie': 64,\n",
     "    'Huron': 229,\n",
@@ -252,13 +257,6 @@
     "max_depths.index"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can think of an `Index` as an *immutable*, n-dimensional array. "
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -327,6 +325,16 @@
     "max_depths.mean()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#and if you are lazy and just want a bunch of standard stats\n",
+    "max_depths.describe()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -743,8 +751,10 @@
     "# There is a potential problem with non-sequential integer indexes:\n",
     "data_implicit = pd.Series([100, 200, 300, 400])\n",
     "data_explicit = pd.Series([100, 200, 300, 400], index=[4, 9, 8, 1])\n",
+    "print('data_implicit')\n",
     "print(data_implicit)\n",
     "print()\n",
+    "print('data_explicit')\n",
     "print(data_explicit)"
    ]
   },
@@ -1020,7 +1030,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "max_depths.max()"
+    "lakes['Max Depth (m)'].max()"
    ]
   },
   {
@@ -1036,7 +1046,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "max_depths.sort_values(ascending=False).head(2)"
+    "lakes['Max Depth (m)'].sort_values(ascending=False).head(2)"
    ]
   },
   {
@@ -1106,7 +1116,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df.head(10)"
+    "df.head(5)"
    ]
   },
   {
@@ -1220,13 +1230,6 @@
     "df_no_nans = df.dropna(axis=0, how=\"any\")\n",
     "df_no_nans.head()"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
@@ -1245,7 +1248,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
diff --git a/Part 2 - Grouping, Plotting, & Merging.ipynb b/Part 2 - Grouping, Plotting, & Merging.ipynb