diff --git a/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb b/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb
new file mode 100644
index 00000000..b549c8aa
--- /dev/null
+++ b/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb
@@ -0,0 +1,3211 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "view-in-github"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "myiDKzkSIAUs"
+ },
+ "source": [
+ "Copyright 2025 Google LLC.\n",
+ "SPDX-License-Identifier: Apache-2.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sbPgBMt01mSB"
+ },
+ "source": [
+ "# Regression: Evaluation and Interpretation\n",
+ "In the [previous notebook](https://github.com/datacommonsorg/api-python/blob/master/notebooks/v2/intro_data_science/Regression_Basics_and_Prediction.ipynb), we saw how powerful regression can be as a tool for prediction. In this Colab, we'll take that exploration one step further: what can regression models tell us about the statistical relationships between variables?\n",
+ "\n",
+ "In particular, this colab will take a more rigorous statistical approach to regressions. We'll look at how to evaluate and interpret our regression models using statistical methods.\n",
+ "\n",
+ "## Learning objectives:\n",
+ "* Hypothesis testing with regression\n",
+ "* Regression tables\n",
+ "* Pearson correlation coefficient, $r$\n",
+ "* $R^2$ and adjusted $R^2$\n",
+ "* Interpreting weights and intercepts\n",
+ "* How correlated variables affect models\n",
+ "---\n",
+ "**Need extra help?**\n",
+ "\n",
+ "If you're new to Google Colab, take a look at [this getting started tutorial](https://colab.research.google.com/notebooks/intro.ipynb).\n",
+ "\n",
+ "To build more familiarity with the Data Commons API, check out these [Data Commons tutorials](https://docs.datacommons.org/api/python/v2/tutorials.md).\n",
+ "\n",
+ "And for help with Pandas and manipulating data frames, take a look at the [Pandas documentation](https://pandas.pydata.org/docs/reference/index.html).\n",
+ "\n",
+ "We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html).\n",
+ "\n",
+ "As usual, if you have any other questions, please reach out to your course staff!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gnoowEYUIQS-"
+ },
+ "source": [
+ "## Getting set up\n",
+ "\n",
+ "\n",
+ "Run the following code boxes to load the Python libraries and data we'll be using today."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YkuB0EIS59qX"
+ },
+ "outputs": [],
+ "source": [
+ "# Setup/Imports\n",
+ "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "u2oFQ7-v8sxY"
+ },
+ "outputs": [],
+ "source": [
+ "# Data Commons Python and Pandas APIs\n",
+ "from datacommons_client.client import DataCommonsClient\n",
+ "client = DataCommonsClient(api_key=\"your API key\")\n",
+ "\n",
+ "# For manipulating data\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "# For implementing models and evaluation methods\n",
+ "from sklearn import linear_model\n",
+ "from sklearn.metrics import r2_score, mean_squared_error\n",
+ "from statsmodels import api as sm\n",
+ "\n",
+ "\n",
+ "# For plotting/printing\n",
+ "from matplotlib import pyplot as plt\n",
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WQlAj-wYjI9L"
+ },
+ "source": [
+ "### The data\n",
+ "\n",
+ "In this assignment, we'll be returning to the scenario we started in the previous notebook. As a refresher, we'll be exploring how obesity rates vary with different health or societal factors across US cities.\n",
+ "\n",
+ "Our data science question: **What can we learn about the relationship of those health and lifestyle factors to obesity rates?**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 492
+ },
+ "id": "VuA3xgSQXhK6",
+ "outputId": "6c43f4e1-6424-4058-f681-6cf31b6fb05a"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "summary": "{\n \"name\": \"df\",\n \"rows\": 498,\n \"fields\": [\n {\n \"column\": \"place\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 498,\n \"samples\": [\n \"geoId/5363000\",\n \"geoId/0639892\",\n \"geoId/1714351\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"City Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 475,\n \"samples\": [\n \"Memphis\",\n \"Plano\",\n \"Avondale\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 471628.1194557695,\n \"min\": 76212.0,\n \"max\": 8258035.0,\n \"num_unique_values\": 498,\n \"samples\": [\n 755078.0,\n 78135.0,\n 81004.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.3675302626575006,\n \"min\": 14.1,\n \"max\": 48.9,\n \"num_unique_values\": 220,\n \"samples\": [\n 33.4,\n 41.6,\n 23.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.79401923130349,\n \"min\": 11.2,\n \"max\": 41.8,\n \"num_unique_values\": 209,\n \"samples\": [\n 25.0,\n 39.5,\n 18.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.3313227959895455,\n \"min\": 24.9,\n \"max\": 49.5,\n \"num_unique_values\": 166,\n \"samples\": [\n 42.5,\n 25.6,\n 28.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.56550336745893,\n \"min\": 21.3,\n \"max\": 45.7,\n \"num_unique_values\": 170,\n \"samples\": [\n 40.3,\n 25.6,\n 41.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.089525781869934,\n \"min\": 24.6,\n \"max\": 35.6,\n \"num_unique_values\": 95,\n \"samples\": [\n 25.9,\n 34.1,\n 27.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.1019573952670365,\n \"min\": 11.5,\n \"max\": 23.3,\n \"num_unique_values\": 103,\n \"samples\": [\n 15.8,\n 16.3,\n 12.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
+ "type": "dataframe",
+ "variable_name": "df"
+ },
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | variable | \n",
+ " City Name | \n",
+ " Count_Person | \n",
+ " Percent_Person_Obesity | \n",
+ " Percent_Person_PhysicalInactivity | \n",
+ " Percent_Person_SleepLessThan7Hours | \n",
+ " Percent_Person_WithHighBloodPressure | \n",
+ " Percent_Person_WithHighCholesterol | \n",
+ " Percent_Person_WithMentalHealthNotGood | \n",
+ "
\n",
+ " \n",
+ " | place | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | geoId/0103076 | \n",
+ " Auburn | \n",
+ " 82025.0 | \n",
+ " 33.0 | \n",
+ " 23.6 | \n",
+ " 36.0 | \n",
+ " 34.3 | \n",
+ " 30.6 | \n",
+ " 17.8 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0107000 | \n",
+ " Birmingham | \n",
+ " 196644.0 | \n",
+ " 44.9 | \n",
+ " 32.9 | \n",
+ " 42.9 | \n",
+ " 45.0 | \n",
+ " 31.6 | \n",
+ " 19.7 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0135896 | \n",
+ " Hoover | \n",
+ " 92448.0 | \n",
+ " 32.5 | \n",
+ " 19.7 | \n",
+ " 33.6 | \n",
+ " 32.6 | \n",
+ " 31.0 | \n",
+ " 15.4 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0137000 | \n",
+ " Huntsville | \n",
+ " 225564.0 | \n",
+ " 37.5 | \n",
+ " 24.0 | \n",
+ " 40.0 | \n",
+ " 36.5 | \n",
+ " 31.6 | \n",
+ " 18.0 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0150000 | \n",
+ " Mobile | \n",
+ " 182595.0 | \n",
+ " 44.2 | \n",
+ " 28.7 | \n",
+ " 43.4 | \n",
+ " 39.8 | \n",
+ " 32.5 | \n",
+ " 19.9 | \n",
+ "
\n",
+ " \n",
+ " | ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " | geoId/5531000 | \n",
+ " Green Bay | \n",
+ " 105744.0 | \n",
+ " 38.9 | \n",
+ " 26.7 | \n",
+ " 33.1 | \n",
+ " 28.1 | \n",
+ " 30.7 | \n",
+ " 17.9 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5539225 | \n",
+ " Kenosha | \n",
+ " 98211.0 | \n",
+ " 43.7 | \n",
+ " 23.8 | \n",
+ " 36.6 | \n",
+ " 29.9 | \n",
+ " 30.0 | \n",
+ " 18.6 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5548000 | \n",
+ " Madison | \n",
+ " 280305.0 | \n",
+ " 32.1 | \n",
+ " 18.7 | \n",
+ " 29.9 | \n",
+ " 26.6 | \n",
+ " 28.5 | \n",
+ " 15.6 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5553000 | \n",
+ " Milwaukee | \n",
+ " 561385.0 | \n",
+ " 43.4 | \n",
+ " 28.8 | \n",
+ " 40.0 | \n",
+ " 36.7 | \n",
+ " 30.1 | \n",
+ " 19.0 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5566000 | \n",
+ " Racine | \n",
+ " 76602.0 | \n",
+ " 42.9 | \n",
+ " 28.2 | \n",
+ " 39.2 | \n",
+ " 32.3 | \n",
+ " 32.0 | \n",
+ " 18.4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
498 rows × 8 columns
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "variable City Name Count_Person Percent_Person_Obesity \\\n",
+ "place \n",
+ "geoId/0103076 Auburn 82025.0 33.0 \n",
+ "geoId/0107000 Birmingham 196644.0 44.9 \n",
+ "geoId/0135896 Hoover 92448.0 32.5 \n",
+ "geoId/0137000 Huntsville 225564.0 37.5 \n",
+ "geoId/0150000 Mobile 182595.0 44.2 \n",
+ "... ... ... ... \n",
+ "geoId/5531000 Green Bay 105744.0 38.9 \n",
+ "geoId/5539225 Kenosha 98211.0 43.7 \n",
+ "geoId/5548000 Madison 280305.0 32.1 \n",
+ "geoId/5553000 Milwaukee 561385.0 43.4 \n",
+ "geoId/5566000 Racine 76602.0 42.9 \n",
+ "\n",
+ "variable Percent_Person_PhysicalInactivity \\\n",
+ "place \n",
+ "geoId/0103076 23.6 \n",
+ "geoId/0107000 32.9 \n",
+ "geoId/0135896 19.7 \n",
+ "geoId/0137000 24.0 \n",
+ "geoId/0150000 28.7 \n",
+ "... ... \n",
+ "geoId/5531000 26.7 \n",
+ "geoId/5539225 23.8 \n",
+ "geoId/5548000 18.7 \n",
+ "geoId/5553000 28.8 \n",
+ "geoId/5566000 28.2 \n",
+ "\n",
+ "variable Percent_Person_SleepLessThan7Hours \\\n",
+ "place \n",
+ "geoId/0103076 36.0 \n",
+ "geoId/0107000 42.9 \n",
+ "geoId/0135896 33.6 \n",
+ "geoId/0137000 40.0 \n",
+ "geoId/0150000 43.4 \n",
+ "... ... \n",
+ "geoId/5531000 33.1 \n",
+ "geoId/5539225 36.6 \n",
+ "geoId/5548000 29.9 \n",
+ "geoId/5553000 40.0 \n",
+ "geoId/5566000 39.2 \n",
+ "\n",
+ "variable Percent_Person_WithHighBloodPressure \\\n",
+ "place \n",
+ "geoId/0103076 34.3 \n",
+ "geoId/0107000 45.0 \n",
+ "geoId/0135896 32.6 \n",
+ "geoId/0137000 36.5 \n",
+ "geoId/0150000 39.8 \n",
+ "... ... \n",
+ "geoId/5531000 28.1 \n",
+ "geoId/5539225 29.9 \n",
+ "geoId/5548000 26.6 \n",
+ "geoId/5553000 36.7 \n",
+ "geoId/5566000 32.3 \n",
+ "\n",
+ "variable Percent_Person_WithHighCholesterol \\\n",
+ "place \n",
+ "geoId/0103076 30.6 \n",
+ "geoId/0107000 31.6 \n",
+ "geoId/0135896 31.0 \n",
+ "geoId/0137000 31.6 \n",
+ "geoId/0150000 32.5 \n",
+ "... ... \n",
+ "geoId/5531000 30.7 \n",
+ "geoId/5539225 30.0 \n",
+ "geoId/5548000 28.5 \n",
+ "geoId/5553000 30.1 \n",
+ "geoId/5566000 32.0 \n",
+ "\n",
+ "variable Percent_Person_WithMentalHealthNotGood \n",
+ "place \n",
+ "geoId/0103076 17.8 \n",
+ "geoId/0107000 19.7 \n",
+ "geoId/0135896 15.4 \n",
+ "geoId/0137000 18.0 \n",
+ "geoId/0150000 19.9 \n",
+ "... ... \n",
+ "geoId/5531000 17.9 \n",
+ "geoId/5539225 18.6 \n",
+ "geoId/5548000 15.6 \n",
+ "geoId/5553000 19.0 \n",
+ "geoId/5566000 18.4 \n",
+ "\n",
+ "[498 rows x 8 columns]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Load the data we'll be using\n",
+ "\n",
+ "# Fetch the population of the US cities\n",
+ "city_pop = client.observation.fetch_observations_by_entity_type(\n",
+ " date=\"latest\",\n",
+ " parent_entity=\"country/USA\",\n",
+ " entity_type=\"City\",\n",
+ " variable_dcids=\"Count_Person\",\n",
+ " filter_facet_ids=\"2176550201\" # USCensusPEP_Annual_Population\n",
+ ").byVariable[\"Count_Person\"].byEntity\n",
+ "city_pop_dict = {\n",
+ " city: data[\"orderedFacets\"][0].observations[0].value\n",
+ " for city, data in city_pop.items()\n",
+ " }\n",
+ "\n",
+ "# Filter to the top 500 cities\n",
+ "cities = [\n",
+ " item[0]\n",
+ " for item in sorted(\n",
+ " city_pop_dict.items(),\n",
+ " key=lambda item: item[1],\n",
+ " reverse=True)[:500]\n",
+ " ]\n",
+ "\n",
+ "# We've compiled a list of some nice Data Commons Statistical Variables\n",
+ "# to use as features for you\n",
+ "stat_vars_to_query = [\n",
+ " \"Count_Person\",\n",
+ " \"Percent_Person_PhysicalInactivity\",\n",
+ " \"Percent_Person_SleepLessThan7Hours\",\n",
+ " \"Percent_Person_WithHighBloodPressure\",\n",
+ " \"Percent_Person_WithMentalHealthNotGood\",\n",
+ " \"Percent_Person_WithHighCholesterol\",\n",
+ " \"Percent_Person_Obesity\"\n",
+ "\n",
+ "]\n",
+ "\n",
+ "# Query Data Commons for the data\n",
+ "raw_features_df = client.observations_dataframe(\n",
+ " variable_dcids=stat_vars_to_query,\n",
+ " date=\"latest\",\n",
+ " entity_dcids=cities)\n",
+ "\n",
+ "# Filter to highest ranked facet for each entity and variable\n",
+ "df = raw_features_df.copy(deep=True)\n",
+ "df = df.groupby([\"entity\", \"entity_name\", \"variable\"]).first().reset_index()\n",
+ "\n",
+ "# Select required columns and pivot by variable\n",
+ "df = df[[\"entity\", \"entity_name\", \"variable\", \"value\"]]\n",
+ "df = df.pivot(index=[\"entity\", \"entity_name\"], columns=\"variable\", values=\"value\")\n",
+ "df = df.dropna()\n",
+ "\n",
+ "# Rename columns and order alphabetically\n",
+ "df = df.reset_index()\n",
+ "df.rename(columns={\"entity\":\"place\", \"entity_name\": \"City Name\"}, inplace=True)\n",
+ "df.set_index(\"place\", inplace=True)\n",
+ "df = df.reindex(sorted(df.columns), axis=1)\n",
+ "\n",
+ "# Display results\n",
+ "display(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TbvPpqYDiGmY"
+ },
+ "source": [
+ "### The model\n",
+ "\n",
+ "Run the following code box to fit an [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression model to our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "V9x5v10LwYZG"
+ },
+ "outputs": [],
+ "source": [
+ "# Fit a regression model\n",
+ "dep_var = \"Percent_Person_Obesity\"\n",
+ "y = df[dep_var].to_numpy().reshape(-1, 1)\n",
+ "x = df.loc[:, ~df.columns.isin([dep_var, \"City Name\"])]\n",
+ "x = sm.add_constant(x)\n",
+ "\n",
+ "\n",
+ "model = sm.OLS(y, x)\n",
+ "results = model.fit()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_h32-nQChkE6"
+ },
+ "source": [
+ "## 0) Regression tables\n",
+ "\n",
+ "When performing regression analyses, statistical packages will usually provide a _**regression table**_, which summarizes the results of the analysis.\n",
+ "\n",
+ "Run the following codebox to display the regression table for our original model. In this Colab, we'll go over some of the statistics included in the table.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "KvCfUveghpcJ",
+ "outputId": "be61942c-174a-405d-b317-43439e1363b0"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " OLS Regression Results \n",
+ "==============================================================================\n",
+ "Dep. Variable: y R-squared: 0.758\n",
+ "Model: OLS Adj. R-squared: 0.755\n",
+ "Method: Least Squares F-statistic: 256.0\n",
+ "Date: Thu, 22 May 2025 Prob (F-statistic): 1.18e-147\n",
+ "Time: 16:02:06 Log-Likelihood: -1275.0\n",
+ "No. Observations: 498 AIC: 2564.\n",
+ "Df Residuals: 491 BIC: 2593.\n",
+ "Df Model: 6 \n",
+ "Covariance Type: nonrobust \n",
+ "==========================================================================================================\n",
+ " coef std err t P>|t| [0.025 0.975]\n",
+ "----------------------------------------------------------------------------------------------------------\n",
+ "const -0.1937 2.656 -0.073 0.942 -5.412 5.024\n",
+ "Count_Person -7.183e-07 3.02e-07 -2.381 0.018 -1.31e-06 -1.25e-07\n",
+ "Percent_Person_PhysicalInactivity 0.3053 0.046 6.577 0.000 0.214 0.396\n",
+ "Percent_Person_SleepLessThan7Hours -0.1246 0.057 -2.192 0.029 -0.236 -0.013\n",
+ "Percent_Person_WithHighBloodPressure 0.7572 0.054 14.062 0.000 0.651 0.863\n",
+ "Percent_Person_WithHighCholesterol -0.1352 0.077 -1.756 0.080 -0.286 0.016\n",
+ "Percent_Person_WithMentalHealthNotGood 0.6901 0.103 6.704 0.000 0.488 0.892\n",
+ "==============================================================================\n",
+ "Omnibus: 2.835 Durbin-Watson: 1.396\n",
+ "Prob(Omnibus): 0.242 Jarque-Bera (JB): 2.651\n",
+ "Skew: 0.133 Prob(JB): 0.266\n",
+ "Kurtosis: 3.239 Cond. No. 9.89e+06\n",
+ "==============================================================================\n",
+ "\n",
+ "Notes:\n",
+ "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
+ "[2] The condition number is large, 9.89e+06. This might indicate that there are\n",
+ "strong multicollinearity or other numerical problems.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(results.summary())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aFi0LWX0OlwA"
+ },
+ "source": [
+ "## 1) Hypothesis testing\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DrYP7lZ9cyHL"
+ },
+ "source": [
+ "### 1.1) Null hypotheses\n",
+ "\n",
+ "When performing statistical analyses, one usually starts with a statement of the null hypothesis. Typically for regression models, these take the form of the coefficient for a variable equaling zero.\n",
+ "\n",
+ "**1.1)** Write out the null hypotheses for each of our independent variables."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wAOpaTGXeiQb"
+ },
+ "source": [
+ "### 1.2) T-test\n",
+ "\n",
+ "So how do we test our null hypotheses? We use the [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Slope_of_a_regression_line).\n",
+ "\n",
+ "Take a look at the regression table above to answer the following questions\n",
+ "\n",
+ "**Q1.2A)** According to the t-test, which variables are statistically significant?\n",
+ "\n",
+ "**Q1.2B)** For variables that are not statistically significant, should we keep them in our model? Why or why not?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "exV8u07Bek4z"
+ },
+ "source": [
+ "### 1.3) F-test\n",
+ "\n",
+ "Beyond testing the significance of our individual variables independently, we can also test the significance of our model overall using the [F-test](https://en.wikipedia.org/wiki/F-test#Regression_problems). In particular, the F-test compares our model to one without predictors (aka, just an intercept). In other words, can our model do statistically better than just predicting the mean?\n",
+ "\n",
+ "Again use the regression table above to answer the following questions:\n",
+ "\n",
+ "**1.3A)** What is the null hypothesis for the F-test?\n",
+ "\n",
+ "**1.3B)** Can we reject the null hypothesis for our model?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gpk0RO17VJXz"
+ },
+ "source": [
+ "## 2) Statistical measures"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RS-U1bdrl-c2"
+ },
+ "source": [
+ "### 2.1) Correlation coefficient $r$\n",
+ "\n",
+ "We can quantify predictiveness of variables using a _correlation coefficient_, a number that represents the degree to which two variables have a statistical relationship. The most common correlation coefficient used is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient), also known as _Pearson's r_, which measures the strength of linear relationships between variables.\n",
+ "\n",
+ "Mathematically, the correlation coefficient is defined as:\n",
+ "$$ r = \\frac{\\sum_i (x_i - \\bar{x})(y_i - \\bar{y})}{\\sqrt{\\sum_i (x_i - \\bar{x})^2}\\sqrt{\\sum_i (y_i - \\bar{y})^2}}\n",
+ "$$\n",
+ "\n",
+ "where $x$ and $y$ are the two variables.\n",
+ "\n",
+ "Those of you with a statistics background might recognize this as the ratio of covariance to the product of their standard deviations.\n",
+ "\n",
+ "**2.1A)** Either using the mathematical definition or by exploring with code, explain what the correlation coefficient would be in the following cases:\n",
+ "\n",
+ "A) $x = y$\n",
+ "\n",
+ "B) $x = -y$\n",
+ "\n",
+ "C) $x$ and $y$ are both normally distributed variables with mean 0 and variance 1, randomly sampled independently from each other."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "id": "FwEnQEWjMQv5",
+ "outputId": "9803eadf-dc00-4c69-b337-ef0b5ed92fad"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ },
+ "text/plain": [
+ "'\\nOptional cell for 2.1A\\n'"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\"\"\n",
+ "Optional cell for 2.1A\n",
+ "\"\"\"\n",
+ "\n",
+ "# Hint: Try writing code to generate values for x and y, then either write or import\n",
+ "# a function to calculate the correlation coefficient\n",
+ "\n",
+ "# Your code here"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mPxYc8tNMq52"
+ },
+ "source": [
+ "Now run the following code box to use the Pandas `.corr()` function to calculate the correlation coefficient between our variables. Note that pandas outputs the results as a matrix."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 320
+ },
+ "id": "TKrIjyt657ir",
+ "outputId": "80ea1cfc-18ec-49bd-8c21-5a0e33502bdf"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "summary": "{\n \"name\": \"df[stat_vars_to_query]\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"variable\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Count_Person\",\n \"Percent_Person_PhysicalInactivity\",\n \"Percent_Person_WithHighCholesterol\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3692768160533883,\n \"min\": -0.032606435112257866,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 1.0,\n 0.05966842357472978,\n 0.04809963828980299\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3039891315555219,\n \"min\": 0.05966842357472978,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.05966842357472978,\n 1.0,\n 0.4366429779497951\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3018122804213848,\n \"min\": 0.07380691021769277,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.07380691021769277,\n 0.7788343765257514,\n 0.3694331027620301\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.32460607938114927,\n \"min\": 0.025619158392611367,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.025619158392611367,\n 0.7446432625492557,\n 0.381625664582876\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.34286073197902867,\n \"min\": -0.006579247299365092,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n -0.006579247299365092,\n 0.7007758800234068,\n 0.21400402260098858\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.29745289241877604,\n \"min\": 0.04809963828980299,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.04809963828980299,\n 0.4366429779497951,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3527430213065325,\n \"min\": -0.032606435112257866,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n -0.032606435112257866,\n 0.7531559280354309,\n 0.29900147207085953\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
+ "type": "dataframe"
+ },
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | variable | \n",
+ " Count_Person | \n",
+ " Percent_Person_PhysicalInactivity | \n",
+ " Percent_Person_SleepLessThan7Hours | \n",
+ " Percent_Person_WithHighBloodPressure | \n",
+ " Percent_Person_WithMentalHealthNotGood | \n",
+ " Percent_Person_WithHighCholesterol | \n",
+ " Percent_Person_Obesity | \n",
+ "
\n",
+ " \n",
+ " | variable | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | Count_Person | \n",
+ " 1.000000 | \n",
+ " 0.059668 | \n",
+ " 0.073807 | \n",
+ " 0.025619 | \n",
+ " -0.006579 | \n",
+ " 0.048100 | \n",
+ " -0.032606 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_PhysicalInactivity | \n",
+ " 0.059668 | \n",
+ " 1.000000 | \n",
+ " 0.778834 | \n",
+ " 0.744643 | \n",
+ " 0.700776 | \n",
+ " 0.436643 | \n",
+ " 0.753156 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_SleepLessThan7Hours | \n",
+ " 0.073807 | \n",
+ " 0.778834 | \n",
+ " 1.000000 | \n",
+ " 0.745474 | \n",
+ " 0.619343 | \n",
+ " 0.369433 | \n",
+ " 0.657111 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighBloodPressure | \n",
+ " 0.025619 | \n",
+ " 0.744643 | \n",
+ " 0.745474 | \n",
+ " 1.000000 | \n",
+ " 0.690294 | \n",
+ " 0.381626 | \n",
+ " 0.825544 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithMentalHealthNotGood | \n",
+ " -0.006579 | \n",
+ " 0.700776 | \n",
+ " 0.619343 | \n",
+ " 0.690294 | \n",
+ " 1.000000 | \n",
+ " 0.214004 | \n",
+ " 0.735612 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighCholesterol | \n",
+ " 0.048100 | \n",
+ " 0.436643 | \n",
+ " 0.369433 | \n",
+ " 0.381626 | \n",
+ " 0.214004 | \n",
+ " 1.000000 | \n",
+ " 0.299001 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_Obesity | \n",
+ " -0.032606 | \n",
+ " 0.753156 | \n",
+ " 0.657111 | \n",
+ " 0.825544 | \n",
+ " 0.735612 | \n",
+ " 0.299001 | \n",
+ " 1.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "variable Count_Person \\\n",
+ "variable \n",
+ "Count_Person 1.000000 \n",
+ "Percent_Person_PhysicalInactivity 0.059668 \n",
+ "Percent_Person_SleepLessThan7Hours 0.073807 \n",
+ "Percent_Person_WithHighBloodPressure 0.025619 \n",
+ "Percent_Person_WithMentalHealthNotGood -0.006579 \n",
+ "Percent_Person_WithHighCholesterol 0.048100 \n",
+ "Percent_Person_Obesity -0.032606 \n",
+ "\n",
+ "variable Percent_Person_PhysicalInactivity \\\n",
+ "variable \n",
+ "Count_Person 0.059668 \n",
+ "Percent_Person_PhysicalInactivity 1.000000 \n",
+ "Percent_Person_SleepLessThan7Hours 0.778834 \n",
+ "Percent_Person_WithHighBloodPressure 0.744643 \n",
+ "Percent_Person_WithMentalHealthNotGood 0.700776 \n",
+ "Percent_Person_WithHighCholesterol 0.436643 \n",
+ "Percent_Person_Obesity 0.753156 \n",
+ "\n",
+ "variable Percent_Person_SleepLessThan7Hours \\\n",
+ "variable \n",
+ "Count_Person 0.073807 \n",
+ "Percent_Person_PhysicalInactivity 0.778834 \n",
+ "Percent_Person_SleepLessThan7Hours 1.000000 \n",
+ "Percent_Person_WithHighBloodPressure 0.745474 \n",
+ "Percent_Person_WithMentalHealthNotGood 0.619343 \n",
+ "Percent_Person_WithHighCholesterol 0.369433 \n",
+ "Percent_Person_Obesity 0.657111 \n",
+ "\n",
+ "variable Percent_Person_WithHighBloodPressure \\\n",
+ "variable \n",
+ "Count_Person 0.025619 \n",
+ "Percent_Person_PhysicalInactivity 0.744643 \n",
+ "Percent_Person_SleepLessThan7Hours 0.745474 \n",
+ "Percent_Person_WithHighBloodPressure 1.000000 \n",
+ "Percent_Person_WithMentalHealthNotGood 0.690294 \n",
+ "Percent_Person_WithHighCholesterol 0.381626 \n",
+ "Percent_Person_Obesity 0.825544 \n",
+ "\n",
+ "variable Percent_Person_WithMentalHealthNotGood \\\n",
+ "variable \n",
+ "Count_Person -0.006579 \n",
+ "Percent_Person_PhysicalInactivity 0.700776 \n",
+ "Percent_Person_SleepLessThan7Hours 0.619343 \n",
+ "Percent_Person_WithHighBloodPressure 0.690294 \n",
+ "Percent_Person_WithMentalHealthNotGood 1.000000 \n",
+ "Percent_Person_WithHighCholesterol 0.214004 \n",
+ "Percent_Person_Obesity 0.735612 \n",
+ "\n",
+ "variable Percent_Person_WithHighCholesterol \\\n",
+ "variable \n",
+ "Count_Person 0.048100 \n",
+ "Percent_Person_PhysicalInactivity 0.436643 \n",
+ "Percent_Person_SleepLessThan7Hours 0.369433 \n",
+ "Percent_Person_WithHighBloodPressure 0.381626 \n",
+ "Percent_Person_WithMentalHealthNotGood 0.214004 \n",
+ "Percent_Person_WithHighCholesterol 1.000000 \n",
+ "Percent_Person_Obesity 0.299001 \n",
+ "\n",
+ "variable Percent_Person_Obesity \n",
+ "variable \n",
+ "Count_Person -0.032606 \n",
+ "Percent_Person_PhysicalInactivity 0.753156 \n",
+ "Percent_Person_SleepLessThan7Hours 0.657111 \n",
+ "Percent_Person_WithHighBloodPressure 0.825544 \n",
+ "Percent_Person_WithMentalHealthNotGood 0.735612 \n",
+ "Percent_Person_WithHighCholesterol 0.299001 \n",
+ "Percent_Person_Obesity 1.000000 "
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# calculate correlation\n",
+ "df[stat_vars_to_query].corr()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XHsUlBdXNC1X"
+ },
+ "source": [
+ "\n",
+ "**2.1B)** Explain why the diagonals of the matrix have the value 1.\n",
+ "\n",
+ "**2.1C)** What is the correlation coefficient between `Count_Person` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between population and obesity rate?\n",
+ "\n",
+ "**2.1D)** What is the correlation coefficient between `Percent_Person_PhysicalInactivity` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between physical inactivity and obesity rate?\n",
+ "\n",
+ "**2.1E)** In general, would you prefer to include features that correlate strongly with the dependent variable, or features with no correlation in a regression model?\n",
+ "\n",
+ "**2.1F)** You find a new feature with correlation coefficient $r=-0.97$ between it and obesity rates. Would it be a good idea to add this new feature to your model?\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HXG9__t8YAqy"
+ },
+ "source": [
+ "### 2.2) $R^2$ score\n",
+ "\n",
+ "To quantify how predictive a linear regression model is overall, we can use the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), $R^2$ (pronounced \"R squared\").\n",
+ "\n",
+ "Mathematically, the $R^2$ score is defined as:\n",
+ "\n",
+ "$$S_{residuals} = \\sum_i{(y_i - f_i)^2} \\\\\n",
+ "S_{total} = \\sum_i{(y_i - \\bar{y})^2}\\\\\n",
+ "R^2 = 1 - \\frac{S_{residuals}}{S_{total}}$$\n",
+ "\n",
+ "where $y_i$s are the actual dependent variable values, $f_i$ are the predicted dependent variable values, and $\\bar{y}$ is the average of the $y_i$'s.\n",
+ "\n",
+ "Conceptually, the $R^2$ score is a measure of explained variance. If $R^2=0.75$, that means that 75% of the variance in the dependent variable has been accounted for by our model, while 25% of the remaining variability has not.\n",
+ "\n",
+ "**2.2A)** Based on the mathematic definition, what is the range of values possible for R^2?\n",
+ "\n",
+ "**2.2B)** Come up with a situation (e.g. what would the data look like) where:\n",
+ "\n",
+ "A) $R^2 = 1.0$\n",
+ "\n",
+ "B) $R^2 = 0.0$\n",
+ "\n",
+ "Let's now analyze what the $R^2$ value is for our model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "-rnvtExD5_U1",
+ "outputId": "51ee0f86-162f-4542-eeb6-809deb556b88"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Model R^2 = 0.7577718062114178\n"
+ ]
+ }
+ ],
+ "source": [
+ "# calculate R^2\n",
+ "print(\"Model R^2 =\", results.rsquared)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "k_L-1iVbXKtc"
+ },
+ "source": [
+ "**2.2C)** Is the model's $R^2$ a \"good\" score?\n",
+ "\n",
+ "**2.2D)** Can you think of any ways we can change our model that would improve the $R^2$ score?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "t_Eieuedizkv"
+ },
+ "source": [
+ "### 2.3) Adjusted $R^2$\n",
+ "\n",
+ "There's an issue with $R^2$ scores that one needs to be aware of when working with multiple independent variables: namely, that the number of independent variables used can affect the $R^2$ score.\n",
+ "\n",
+ "Let's see this in practice. Let's create a new dataframe with an extra 100 dummy variables (randomly sampled from a 0-mean 1-variance normal distribution) tacked on."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 510
+ },
+ "id": "iF9B9dPJ1P8G",
+ "outputId": "66d7dae3-11cb-4b46-a15e-846812c9f5b7"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "variable_name": "df_padded"
+ },
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " City Name | \n",
+ " Count_Person | \n",
+ " Percent_Person_Obesity | \n",
+ " Percent_Person_PhysicalInactivity | \n",
+ " Percent_Person_SleepLessThan7Hours | \n",
+ " Percent_Person_WithHighBloodPressure | \n",
+ " Percent_Person_WithHighCholesterol | \n",
+ " Percent_Person_WithMentalHealthNotGood | \n",
+ " Random Variable 0 | \n",
+ " Random Variable 1 | \n",
+ " ... | \n",
+ " Random Variable 90 | \n",
+ " Random Variable 91 | \n",
+ " Random Variable 92 | \n",
+ " Random Variable 93 | \n",
+ " Random Variable 94 | \n",
+ " Random Variable 95 | \n",
+ " Random Variable 96 | \n",
+ " Random Variable 97 | \n",
+ " Random Variable 98 | \n",
+ " Random Variable 99 | \n",
+ "
\n",
+ " \n",
+ " | place | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | geoId/0103076 | \n",
+ " Auburn | \n",
+ " 82025.0 | \n",
+ " 33.0 | \n",
+ " 23.6 | \n",
+ " 36.0 | \n",
+ " 34.3 | \n",
+ " 30.6 | \n",
+ " 17.8 | \n",
+ " -1.564312 | \n",
+ " 0.288515 | \n",
+ " ... | \n",
+ " 0.762521 | \n",
+ " 0.251051 | \n",
+ " -0.697129 | \n",
+ " -1.697195 | \n",
+ " 0.399706 | \n",
+ " -0.557155 | \n",
+ " 0.444760 | \n",
+ " 1.787642 | \n",
+ " 0.340410 | \n",
+ " 1.535658 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0107000 | \n",
+ " Birmingham | \n",
+ " 196644.0 | \n",
+ " 44.9 | \n",
+ " 32.9 | \n",
+ " 42.9 | \n",
+ " 45.0 | \n",
+ " 31.6 | \n",
+ " 19.7 | \n",
+ " -0.580159 | \n",
+ " 0.849181 | \n",
+ " ... | \n",
+ " 0.215843 | \n",
+ " 1.553184 | \n",
+ " -1.766115 | \n",
+ " 1.152941 | \n",
+ " 0.712426 | \n",
+ " 0.936660 | \n",
+ " 0.576485 | \n",
+ " -0.127241 | \n",
+ " -0.543845 | \n",
+ " 1.536037 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0135896 | \n",
+ " Hoover | \n",
+ " 92448.0 | \n",
+ " 32.5 | \n",
+ " 19.7 | \n",
+ " 33.6 | \n",
+ " 32.6 | \n",
+ " 31.0 | \n",
+ " 15.4 | \n",
+ " -0.322616 | \n",
+ " -1.748737 | \n",
+ " ... | \n",
+ " 2.036116 | \n",
+ " 0.993741 | \n",
+ " -1.786077 | \n",
+ " -0.264808 | \n",
+ " -1.922278 | \n",
+ " -1.227397 | \n",
+ " -1.723762 | \n",
+ " 0.847944 | \n",
+ " -0.446194 | \n",
+ " -0.320127 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0137000 | \n",
+ " Huntsville | \n",
+ " 225564.0 | \n",
+ " 37.5 | \n",
+ " 24.0 | \n",
+ " 40.0 | \n",
+ " 36.5 | \n",
+ " 31.6 | \n",
+ " 18.0 | \n",
+ " 0.768514 | \n",
+ " -0.534476 | \n",
+ " ... | \n",
+ " 0.950064 | \n",
+ " 0.730344 | \n",
+ " 0.007471 | \n",
+ " 3.514180 | \n",
+ " 0.145648 | \n",
+ " -1.254448 | \n",
+ " 0.275048 | \n",
+ " -1.241024 | \n",
+ " -0.163577 | \n",
+ " 0.376057 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0150000 | \n",
+ " Mobile | \n",
+ " 182595.0 | \n",
+ " 44.2 | \n",
+ " 28.7 | \n",
+ " 43.4 | \n",
+ " 39.8 | \n",
+ " 32.5 | \n",
+ " 19.9 | \n",
+ " 0.207217 | \n",
+ " 1.028760 | \n",
+ " ... | \n",
+ " -0.775507 | \n",
+ " 1.338210 | \n",
+ " -0.395432 | \n",
+ " -0.830337 | \n",
+ " -0.558512 | \n",
+ " -0.367606 | \n",
+ " -1.049303 | \n",
+ " -3.161325 | \n",
+ " -0.586668 | \n",
+ " 0.934307 | \n",
+ "
\n",
+ " \n",
+ " | ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " | geoId/5531000 | \n",
+ " Green Bay | \n",
+ " 105744.0 | \n",
+ " 38.9 | \n",
+ " 26.7 | \n",
+ " 33.1 | \n",
+ " 28.1 | \n",
+ " 30.7 | \n",
+ " 17.9 | \n",
+ " -0.104983 | \n",
+ " -0.856795 | \n",
+ " ... | \n",
+ " -0.945322 | \n",
+ " -0.219595 | \n",
+ " -2.113165 | \n",
+ " 0.614379 | \n",
+ " 0.110795 | \n",
+ " -0.250010 | \n",
+ " 0.926896 | \n",
+ " -0.526254 | \n",
+ " -0.359181 | \n",
+ " -1.424956 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5539225 | \n",
+ " Kenosha | \n",
+ " 98211.0 | \n",
+ " 43.7 | \n",
+ " 23.8 | \n",
+ " 36.6 | \n",
+ " 29.9 | \n",
+ " 30.0 | \n",
+ " 18.6 | \n",
+ " -0.355349 | \n",
+ " 0.348573 | \n",
+ " ... | \n",
+ " -0.789575 | \n",
+ " 0.590118 | \n",
+ " -0.193587 | \n",
+ " 0.502188 | \n",
+ " 0.124404 | \n",
+ " -0.376209 | \n",
+ " -0.331331 | \n",
+ " 0.697165 | \n",
+ " 1.029427 | \n",
+ " -1.143744 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5548000 | \n",
+ " Madison | \n",
+ " 280305.0 | \n",
+ " 32.1 | \n",
+ " 18.7 | \n",
+ " 29.9 | \n",
+ " 26.6 | \n",
+ " 28.5 | \n",
+ " 15.6 | \n",
+ " -0.648119 | \n",
+ " 0.025662 | \n",
+ " ... | \n",
+ " -0.151965 | \n",
+ " 0.835380 | \n",
+ " -1.381286 | \n",
+ " 0.303114 | \n",
+ " 0.540398 | \n",
+ " -0.359988 | \n",
+ " 0.007904 | \n",
+ " 0.010788 | \n",
+ " -0.276071 | \n",
+ " 0.979319 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5553000 | \n",
+ " Milwaukee | \n",
+ " 561385.0 | \n",
+ " 43.4 | \n",
+ " 28.8 | \n",
+ " 40.0 | \n",
+ " 36.7 | \n",
+ " 30.1 | \n",
+ " 19.0 | \n",
+ " -0.154089 | \n",
+ " -0.339432 | \n",
+ " ... | \n",
+ " 2.255458 | \n",
+ " 1.357828 | \n",
+ " 0.692794 | \n",
+ " 0.924034 | \n",
+ " 0.951688 | \n",
+ " -0.071096 | \n",
+ " 0.097582 | \n",
+ " 0.952135 | \n",
+ " -1.019633 | \n",
+ " -0.778193 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5566000 | \n",
+ " Racine | \n",
+ " 76602.0 | \n",
+ " 42.9 | \n",
+ " 28.2 | \n",
+ " 39.2 | \n",
+ " 32.3 | \n",
+ " 32.0 | \n",
+ " 18.4 | \n",
+ " 1.638922 | \n",
+ " -0.543906 | \n",
+ " ... | \n",
+ " 0.370553 | \n",
+ " -0.606273 | \n",
+ " 1.066660 | \n",
+ " 0.022132 | \n",
+ " 0.039135 | \n",
+ " 1.102639 | \n",
+ " -0.438601 | \n",
+ " -1.744647 | \n",
+ " 1.245214 | \n",
+ " 2.216294 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
498 rows × 108 columns
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "text/plain": [
+ " City Name Count_Person Percent_Person_Obesity \\\n",
+ "place \n",
+ "geoId/0103076 Auburn 82025.0 33.0 \n",
+ "geoId/0107000 Birmingham 196644.0 44.9 \n",
+ "geoId/0135896 Hoover 92448.0 32.5 \n",
+ "geoId/0137000 Huntsville 225564.0 37.5 \n",
+ "geoId/0150000 Mobile 182595.0 44.2 \n",
+ "... ... ... ... \n",
+ "geoId/5531000 Green Bay 105744.0 38.9 \n",
+ "geoId/5539225 Kenosha 98211.0 43.7 \n",
+ "geoId/5548000 Madison 280305.0 32.1 \n",
+ "geoId/5553000 Milwaukee 561385.0 43.4 \n",
+ "geoId/5566000 Racine 76602.0 42.9 \n",
+ "\n",
+ " Percent_Person_PhysicalInactivity \\\n",
+ "place \n",
+ "geoId/0103076 23.6 \n",
+ "geoId/0107000 32.9 \n",
+ "geoId/0135896 19.7 \n",
+ "geoId/0137000 24.0 \n",
+ "geoId/0150000 28.7 \n",
+ "... ... \n",
+ "geoId/5531000 26.7 \n",
+ "geoId/5539225 23.8 \n",
+ "geoId/5548000 18.7 \n",
+ "geoId/5553000 28.8 \n",
+ "geoId/5566000 28.2 \n",
+ "\n",
+ " Percent_Person_SleepLessThan7Hours \\\n",
+ "place \n",
+ "geoId/0103076 36.0 \n",
+ "geoId/0107000 42.9 \n",
+ "geoId/0135896 33.6 \n",
+ "geoId/0137000 40.0 \n",
+ "geoId/0150000 43.4 \n",
+ "... ... \n",
+ "geoId/5531000 33.1 \n",
+ "geoId/5539225 36.6 \n",
+ "geoId/5548000 29.9 \n",
+ "geoId/5553000 40.0 \n",
+ "geoId/5566000 39.2 \n",
+ "\n",
+ " Percent_Person_WithHighBloodPressure \\\n",
+ "place \n",
+ "geoId/0103076 34.3 \n",
+ "geoId/0107000 45.0 \n",
+ "geoId/0135896 32.6 \n",
+ "geoId/0137000 36.5 \n",
+ "geoId/0150000 39.8 \n",
+ "... ... \n",
+ "geoId/5531000 28.1 \n",
+ "geoId/5539225 29.9 \n",
+ "geoId/5548000 26.6 \n",
+ "geoId/5553000 36.7 \n",
+ "geoId/5566000 32.3 \n",
+ "\n",
+ " Percent_Person_WithHighCholesterol \\\n",
+ "place \n",
+ "geoId/0103076 30.6 \n",
+ "geoId/0107000 31.6 \n",
+ "geoId/0135896 31.0 \n",
+ "geoId/0137000 31.6 \n",
+ "geoId/0150000 32.5 \n",
+ "... ... \n",
+ "geoId/5531000 30.7 \n",
+ "geoId/5539225 30.0 \n",
+ "geoId/5548000 28.5 \n",
+ "geoId/5553000 30.1 \n",
+ "geoId/5566000 32.0 \n",
+ "\n",
+ " Percent_Person_WithMentalHealthNotGood Random Variable 0 \\\n",
+ "place \n",
+ "geoId/0103076 17.8 -1.564312 \n",
+ "geoId/0107000 19.7 -0.580159 \n",
+ "geoId/0135896 15.4 -0.322616 \n",
+ "geoId/0137000 18.0 0.768514 \n",
+ "geoId/0150000 19.9 0.207217 \n",
+ "... ... ... \n",
+ "geoId/5531000 17.9 -0.104983 \n",
+ "geoId/5539225 18.6 -0.355349 \n",
+ "geoId/5548000 15.6 -0.648119 \n",
+ "geoId/5553000 19.0 -0.154089 \n",
+ "geoId/5566000 18.4 1.638922 \n",
+ "\n",
+ " Random Variable 1 ... Random Variable 90 Random Variable 91 \\\n",
+ "place ... \n",
+ "geoId/0103076 0.288515 ... 0.762521 0.251051 \n",
+ "geoId/0107000 0.849181 ... 0.215843 1.553184 \n",
+ "geoId/0135896 -1.748737 ... 2.036116 0.993741 \n",
+ "geoId/0137000 -0.534476 ... 0.950064 0.730344 \n",
+ "geoId/0150000 1.028760 ... -0.775507 1.338210 \n",
+ "... ... ... ... ... \n",
+ "geoId/5531000 -0.856795 ... -0.945322 -0.219595 \n",
+ "geoId/5539225 0.348573 ... -0.789575 0.590118 \n",
+ "geoId/5548000 0.025662 ... -0.151965 0.835380 \n",
+ "geoId/5553000 -0.339432 ... 2.255458 1.357828 \n",
+ "geoId/5566000 -0.543906 ... 0.370553 -0.606273 \n",
+ "\n",
+ " Random Variable 92 Random Variable 93 Random Variable 94 \\\n",
+ "place \n",
+ "geoId/0103076 -0.697129 -1.697195 0.399706 \n",
+ "geoId/0107000 -1.766115 1.152941 0.712426 \n",
+ "geoId/0135896 -1.786077 -0.264808 -1.922278 \n",
+ "geoId/0137000 0.007471 3.514180 0.145648 \n",
+ "geoId/0150000 -0.395432 -0.830337 -0.558512 \n",
+ "... ... ... ... \n",
+ "geoId/5531000 -2.113165 0.614379 0.110795 \n",
+ "geoId/5539225 -0.193587 0.502188 0.124404 \n",
+ "geoId/5548000 -1.381286 0.303114 0.540398 \n",
+ "geoId/5553000 0.692794 0.924034 0.951688 \n",
+ "geoId/5566000 1.066660 0.022132 0.039135 \n",
+ "\n",
+ " Random Variable 95 Random Variable 96 Random Variable 97 \\\n",
+ "place \n",
+ "geoId/0103076 -0.557155 0.444760 1.787642 \n",
+ "geoId/0107000 0.936660 0.576485 -0.127241 \n",
+ "geoId/0135896 -1.227397 -1.723762 0.847944 \n",
+ "geoId/0137000 -1.254448 0.275048 -1.241024 \n",
+ "geoId/0150000 -0.367606 -1.049303 -3.161325 \n",
+ "... ... ... ... \n",
+ "geoId/5531000 -0.250010 0.926896 -0.526254 \n",
+ "geoId/5539225 -0.376209 -0.331331 0.697165 \n",
+ "geoId/5548000 -0.359988 0.007904 0.010788 \n",
+ "geoId/5553000 -0.071096 0.097582 0.952135 \n",
+ "geoId/5566000 1.102639 -0.438601 -1.744647 \n",
+ "\n",
+ " Random Variable 98 Random Variable 99 \n",
+ "place \n",
+ "geoId/0103076 0.340410 1.535658 \n",
+ "geoId/0107000 -0.543845 1.536037 \n",
+ "geoId/0135896 -0.446194 -0.320127 \n",
+ "geoId/0137000 -0.163577 0.376057 \n",
+ "geoId/0150000 -0.586668 0.934307 \n",
+ "... ... ... \n",
+ "geoId/5531000 -0.359181 -1.424956 \n",
+ "geoId/5539225 1.029427 -1.143744 \n",
+ "geoId/5548000 -0.276071 0.979319 \n",
+ "geoId/5553000 -1.019633 -0.778193 \n",
+ "geoId/5566000 1.245214 2.216294 \n",
+ "\n",
+ "[498 rows x 108 columns]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Pad our dataframe with more random variables\n",
+ "num_rows = len(df.index)\n",
+ "num_new_columns = 100\n",
+ "random_data = np.random.normal(loc=0, scale=1, size=(num_rows, num_new_columns))\n",
+ "new_column_names = [f\"Random Variable {i}\" for i in range(num_new_columns)]\n",
+ "random_data_df = pd.DataFrame(\n",
+ " random_data,\n",
+ " columns=new_column_names,\n",
+ " index=df.index\n",
+ ")\n",
+ "df_padded = pd.concat([df, random_data_df], axis=1)\n",
+ "display(df_padded)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Q5f22xUmvoN_"
+ },
+ "source": [
+ "Now let's fit a new model to the data and compare R^2 scores."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "rn57oEF82dju",
+ "outputId": "27dd8be5-fae5-45b4-a31f-5858a087d3d5"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Original Model R^2 = 0.7577718062114178\n",
+ "Padded Model R^2 = 0.7988444670439291\n"
+ ]
+ }
+ ],
+ "source": [
+ "# New R^2\n",
+ "y_padded = df_padded[dep_var].to_numpy().reshape(-1, 1)\n",
+ "x_padded = df_padded.loc[:, ~df_padded.columns.isin([dep_var, \"City Name\"])]\n",
+ "x_padded = sm.add_constant(x_padded)\n",
+ "\n",
+ "padded_model = sm.OLS(y_padded, x_padded)\n",
+ "padded_results = padded_model.fit()\n",
+ "\n",
+ "print(\"Original Model R^2 = \", results.rsquared)\n",
+ "print(\"Padded Model R^2 =\", padded_results.rsquared)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "j-j4IbOtwFj8"
+ },
+ "source": [
+ "**2.3A)** Which model had a better $R^2$ score?\n",
+ "\n",
+ "**2.3B)** Think about the variables used in each model. Should one model be much more predictive than another?\n",
+ "\n",
+ "**2.3B)** In general, how would you expect $R^2$ to change as we increase the number of independent variables?\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2Ipg_orhxOF_"
+ },
+ "source": [
+ "So how do we fix this? We can adjust our $R^2$ metric to account for the number of variables. The most popular way to defined the _**adjusted $R^2$**_ score is as follows:\n",
+ "\n",
+ "$$R^{2}_{adj}=1-(1-R^{2}){n-1 \\over n-p-1}$$\n",
+ "\n",
+ "where $n$ is the number of data points and $p$ is the number of independent variables.\n",
+ "\n",
+ "Now let's compare the adjusted $R^2$ of our models."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "7pZ9_NmZisGi",
+ "outputId": "bfa5cddd-dddf-45c9-8082-ab58dbe5c286"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Original Model Adjusted R^2 = 0.7548117875500502\n",
+ "Padded Model Adjusted R^2 = 0.7443112535059662\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Adjusted R^2\n",
+ "print(\"Original Model Adjusted R^2 = \", results.rsquared_adj)\n",
+ "print(\"Padded Model Adjusted R^2 =\", padded_results.rsquared_adj)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qU9VwLsNHcKD"
+ },
+ "source": [
+ "**2.3D)** Which model had a better adjusted $R^2$ score?\n",
+ "\n",
+ "**2.3E)** When would you prefer to use adjusted R^2 over R^2 to evaluate model fit?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1tiopX7PWHiu"
+ },
+ "source": [
+ "## 3) Interpreting regression models\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qC7aC0y-O3_D"
+ },
+ "source": [
+ "### 3.1) Analyzing weights and intercepts\n",
+ "The parameters of the regression model itself can also yield important insights.\n",
+ "\n",
+ "Run the following code box to display the weights and intercept of our original model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 304
+ },
+ "id": "_y0xeWysPIm6",
+ "outputId": "6ede53a7-bbc0-474e-bc4a-4567735c8d75"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | const | \n",
+ " -0.19367 | \n",
+ "
\n",
+ " \n",
+ " | Count_Person | \n",
+ " -0.00000 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_PhysicalInactivity | \n",
+ " 0.30528 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_SleepLessThan7Hours | \n",
+ " -0.12455 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighBloodPressure | \n",
+ " 0.75717 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighCholesterol | \n",
+ " -0.13520 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithMentalHealthNotGood | \n",
+ " 0.69012 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "const -0.19367\n",
+ "Count_Person -0.00000\n",
+ "Percent_Person_PhysicalInactivity 0.30528\n",
+ "Percent_Person_SleepLessThan7Hours -0.12455\n",
+ "Percent_Person_WithHighBloodPressure 0.75717\n",
+ "Percent_Person_WithHighCholesterol -0.13520\n",
+ "Percent_Person_WithMentalHealthNotGood 0.69012\n",
+ "dtype: float64"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Display weights/coefficients\n",
+ "display(results.params.round(5))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dvpGBohWPymA"
+ },
+ "source": [
+ "**3.1A)** What is the intercept of our model? What are its units?\n",
+ "\n",
+ "**3.1B)** What are the units on each of the model weights (aka coefficients)?\n",
+ "\n",
+ "**3.1C)** Which variables matter most to our model?\n",
+ "\n",
+ "**3.1D)** In words, describe what a weight/coefficient in a linear regression means.\n",
+ "\n",
+ "**3.1E)** Our model is used to generate a predicted obesity rate for a fictional city named Dataopolis. If we increased `Percent_Person_WithMentalHealthNotGood` for Dataopolis by 1 unit, _while keeping the values for all remaining variables constant_, by how much would we expect our predicted obesity rate to change?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2ZoaATgWh-fR"
+ },
+ "source": [
+ "### 3.2) The effect of correlated variables\n",
+ "\n",
+ "When interpreting weights, one thing to look out for is if we have independent variables that are highly correlated with each other.\n",
+ "\n",
+ "Let's illustrate why this might be a problem, by adding a variable that is correlated with one of the existing variables"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 845
+ },
+ "id": "uP4XtXkfLB1U",
+ "outputId": "f55a1573-1c38-49ad-ea26-9675b434e7aa"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "New dataframe to fit:\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "summary": "{\n \"name\": \"correlated_df\",\n \"rows\": 498,\n \"fields\": [\n {\n \"column\": \"place\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 498,\n \"samples\": [\n \"geoId/5363000\",\n \"geoId/0639892\",\n \"geoId/1714351\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"City Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 475,\n \"samples\": [\n \"Memphis\",\n \"Plano\",\n \"Avondale\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 471628.1194557695,\n \"min\": 76212.0,\n \"max\": 8258035.0,\n \"num_unique_values\": 498,\n \"samples\": [\n 755078.0,\n 78135.0,\n 81004.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.3675302626575006,\n \"min\": 14.1,\n \"max\": 48.9,\n \"num_unique_values\": 220,\n \"samples\": [\n 33.4,\n 41.6,\n 23.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.79401923130349,\n \"min\": 11.2,\n \"max\": 41.8,\n \"num_unique_values\": 209,\n \"samples\": [\n 25.0,\n 39.5,\n 18.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.3313227959895455,\n \"min\": 24.9,\n \"max\": 49.5,\n \"num_unique_values\": 166,\n \"samples\": [\n 42.5,\n 25.6,\n 28.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.56550336745893,\n \"min\": 21.3,\n \"max\": 45.7,\n \"num_unique_values\": 170,\n \"samples\": [\n 40.3,\n 25.6,\n 41.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.089525781869934,\n \"min\": 24.6,\n \"max\": 35.6,\n \"num_unique_values\": 95,\n \"samples\": [\n 25.9,\n 34.1,\n 27.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.1019573952670365,\n \"min\": 11.5,\n \"max\": 23.3,\n \"num_unique_values\": 103,\n \"samples\": [\n 15.8,\n 16.3,\n 12.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Correlated Variable\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.2936580170071315,\n \"min\": 9.748686615105605,\n \"max\": 23.243226154344537,\n \"num_unique_values\": 498,\n \"samples\": [\n 16.440967097355323,\n 17.12077873386613,\n 17.01999484946955\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
+ "type": "dataframe",
+ "variable_name": "correlated_df"
+ },
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | variable | \n",
+ " City Name | \n",
+ " Count_Person | \n",
+ " Percent_Person_Obesity | \n",
+ " Percent_Person_PhysicalInactivity | \n",
+ " Percent_Person_SleepLessThan7Hours | \n",
+ " Percent_Person_WithHighBloodPressure | \n",
+ " Percent_Person_WithHighCholesterol | \n",
+ " Percent_Person_WithMentalHealthNotGood | \n",
+ " Correlated Variable | \n",
+ "
\n",
+ " \n",
+ " | place | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | geoId/0103076 | \n",
+ " Auburn | \n",
+ " 82025.0 | \n",
+ " 33.0 | \n",
+ " 23.6 | \n",
+ " 36.0 | \n",
+ " 34.3 | \n",
+ " 30.6 | \n",
+ " 17.8 | \n",
+ " 18.761300 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0107000 | \n",
+ " Birmingham | \n",
+ " 196644.0 | \n",
+ " 44.9 | \n",
+ " 32.9 | \n",
+ " 42.9 | \n",
+ " 45.0 | \n",
+ " 31.6 | \n",
+ " 19.7 | \n",
+ " 17.655787 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0135896 | \n",
+ " Hoover | \n",
+ " 92448.0 | \n",
+ " 32.5 | \n",
+ " 19.7 | \n",
+ " 33.6 | \n",
+ " 32.6 | \n",
+ " 31.0 | \n",
+ " 15.4 | \n",
+ " 14.736255 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0137000 | \n",
+ " Huntsville | \n",
+ " 225564.0 | \n",
+ " 37.5 | \n",
+ " 24.0 | \n",
+ " 40.0 | \n",
+ " 36.5 | \n",
+ " 31.6 | \n",
+ " 18.0 | \n",
+ " 16.549451 | \n",
+ "
\n",
+ " \n",
+ " | geoId/0150000 | \n",
+ " Mobile | \n",
+ " 182595.0 | \n",
+ " 44.2 | \n",
+ " 28.7 | \n",
+ " 43.4 | \n",
+ " 39.8 | \n",
+ " 32.5 | \n",
+ " 19.9 | \n",
+ " 20.277958 | \n",
+ "
\n",
+ " \n",
+ " | ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " | geoId/5531000 | \n",
+ " Green Bay | \n",
+ " 105744.0 | \n",
+ " 38.9 | \n",
+ " 26.7 | \n",
+ " 33.1 | \n",
+ " 28.1 | \n",
+ " 30.7 | \n",
+ " 17.9 | \n",
+ " 18.645080 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5539225 | \n",
+ " Kenosha | \n",
+ " 98211.0 | \n",
+ " 43.7 | \n",
+ " 23.8 | \n",
+ " 36.6 | \n",
+ " 29.9 | \n",
+ " 30.0 | \n",
+ " 18.6 | \n",
+ " 17.067335 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5548000 | \n",
+ " Madison | \n",
+ " 280305.0 | \n",
+ " 32.1 | \n",
+ " 18.7 | \n",
+ " 29.9 | \n",
+ " 26.6 | \n",
+ " 28.5 | \n",
+ " 15.6 | \n",
+ " 15.665917 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5553000 | \n",
+ " Milwaukee | \n",
+ " 561385.0 | \n",
+ " 43.4 | \n",
+ " 28.8 | \n",
+ " 40.0 | \n",
+ " 36.7 | \n",
+ " 30.1 | \n",
+ " 19.0 | \n",
+ " 19.073143 | \n",
+ "
\n",
+ " \n",
+ " | geoId/5566000 | \n",
+ " Racine | \n",
+ " 76602.0 | \n",
+ " 42.9 | \n",
+ " 28.2 | \n",
+ " 39.2 | \n",
+ " 32.3 | \n",
+ " 32.0 | \n",
+ " 18.4 | \n",
+ " 17.106196 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
498 rows × 9 columns
\n",
+ "
\n",
+ "
\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "variable City Name Count_Person Percent_Person_Obesity \\\n",
+ "place \n",
+ "geoId/0103076 Auburn 82025.0 33.0 \n",
+ "geoId/0107000 Birmingham 196644.0 44.9 \n",
+ "geoId/0135896 Hoover 92448.0 32.5 \n",
+ "geoId/0137000 Huntsville 225564.0 37.5 \n",
+ "geoId/0150000 Mobile 182595.0 44.2 \n",
+ "... ... ... ... \n",
+ "geoId/5531000 Green Bay 105744.0 38.9 \n",
+ "geoId/5539225 Kenosha 98211.0 43.7 \n",
+ "geoId/5548000 Madison 280305.0 32.1 \n",
+ "geoId/5553000 Milwaukee 561385.0 43.4 \n",
+ "geoId/5566000 Racine 76602.0 42.9 \n",
+ "\n",
+ "variable Percent_Person_PhysicalInactivity \\\n",
+ "place \n",
+ "geoId/0103076 23.6 \n",
+ "geoId/0107000 32.9 \n",
+ "geoId/0135896 19.7 \n",
+ "geoId/0137000 24.0 \n",
+ "geoId/0150000 28.7 \n",
+ "... ... \n",
+ "geoId/5531000 26.7 \n",
+ "geoId/5539225 23.8 \n",
+ "geoId/5548000 18.7 \n",
+ "geoId/5553000 28.8 \n",
+ "geoId/5566000 28.2 \n",
+ "\n",
+ "variable Percent_Person_SleepLessThan7Hours \\\n",
+ "place \n",
+ "geoId/0103076 36.0 \n",
+ "geoId/0107000 42.9 \n",
+ "geoId/0135896 33.6 \n",
+ "geoId/0137000 40.0 \n",
+ "geoId/0150000 43.4 \n",
+ "... ... \n",
+ "geoId/5531000 33.1 \n",
+ "geoId/5539225 36.6 \n",
+ "geoId/5548000 29.9 \n",
+ "geoId/5553000 40.0 \n",
+ "geoId/5566000 39.2 \n",
+ "\n",
+ "variable Percent_Person_WithHighBloodPressure \\\n",
+ "place \n",
+ "geoId/0103076 34.3 \n",
+ "geoId/0107000 45.0 \n",
+ "geoId/0135896 32.6 \n",
+ "geoId/0137000 36.5 \n",
+ "geoId/0150000 39.8 \n",
+ "... ... \n",
+ "geoId/5531000 28.1 \n",
+ "geoId/5539225 29.9 \n",
+ "geoId/5548000 26.6 \n",
+ "geoId/5553000 36.7 \n",
+ "geoId/5566000 32.3 \n",
+ "\n",
+ "variable Percent_Person_WithHighCholesterol \\\n",
+ "place \n",
+ "geoId/0103076 30.6 \n",
+ "geoId/0107000 31.6 \n",
+ "geoId/0135896 31.0 \n",
+ "geoId/0137000 31.6 \n",
+ "geoId/0150000 32.5 \n",
+ "... ... \n",
+ "geoId/5531000 30.7 \n",
+ "geoId/5539225 30.0 \n",
+ "geoId/5548000 28.5 \n",
+ "geoId/5553000 30.1 \n",
+ "geoId/5566000 32.0 \n",
+ "\n",
+ "variable Percent_Person_WithMentalHealthNotGood Correlated Variable \n",
+ "place \n",
+ "geoId/0103076 17.8 18.761300 \n",
+ "geoId/0107000 19.7 17.655787 \n",
+ "geoId/0135896 15.4 14.736255 \n",
+ "geoId/0137000 18.0 16.549451 \n",
+ "geoId/0150000 19.9 20.277958 \n",
+ "... ... ... \n",
+ "geoId/5531000 17.9 18.645080 \n",
+ "geoId/5539225 18.6 17.067335 \n",
+ "geoId/5548000 15.6 15.665917 \n",
+ "geoId/5553000 19.0 19.073143 \n",
+ "geoId/5566000 18.4 17.106196 \n",
+ "\n",
+ "[498 rows x 9 columns]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Correlated Model Weights and Intercept:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | const | \n",
+ " -0.28192 | \n",
+ "
\n",
+ " \n",
+ " | Count_Person | \n",
+ " -0.00000 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_PhysicalInactivity | \n",
+ " 0.30604 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_SleepLessThan7Hours | \n",
+ " -0.12529 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighBloodPressure | \n",
+ " 0.75756 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithHighCholesterol | \n",
+ " -0.13345 | \n",
+ "
\n",
+ " \n",
+ " | Percent_Person_WithMentalHealthNotGood | \n",
+ " 0.55372 | \n",
+ "
\n",
+ " \n",
+ " | Correlated Variable | \n",
+ " 0.13921 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "const -0.28192\n",
+ "Count_Person -0.00000\n",
+ "Percent_Person_PhysicalInactivity 0.30604\n",
+ "Percent_Person_SleepLessThan7Hours -0.12529\n",
+ "Percent_Person_WithHighBloodPressure 0.75756\n",
+ "Percent_Person_WithHighCholesterol -0.13345\n",
+ "Percent_Person_WithMentalHealthNotGood 0.55372\n",
+ "Correlated Variable 0.13921\n",
+ "dtype: float64"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# New variable correlated with Percent_Person_WithMentalHealthNotGood\n",
+ "correlated_df = df.copy()\n",
+ "target_var = \"Percent_Person_WithMentalHealthNotGood\"\n",
+ "noise = np.random.normal(size=(len(correlated_df.index),))\n",
+ "correlated_df[\"Correlated Variable\"] = correlated_df[target_var] + noise\n",
+ "\n",
+ "# show new data frame\n",
+ "print(\"New dataframe to fit:\")\n",
+ "display(correlated_df)\n",
+ "\n",
+ "# Create a new model\n",
+ "y_corr = correlated_df[dep_var].to_numpy().reshape(-1, 1)\n",
+ "x_corr = correlated_df.loc[:, ~correlated_df.columns.isin([dep_var, \"City Name\"])]\n",
+ "x_corr = sm.add_constant(x_corr)\n",
+ "\n",
+ "correlated_model = sm.OLS(y_corr, x_corr)\n",
+ "correlated_results = correlated_model.fit()\n",
+ "\n",
+ "print(\"Correlated Model Weights and Intercept:\")\n",
+ "display(correlated_results.params.round(5))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HEHJWPxibiY3"
+ },
+ "source": [
+ "**3.2A)** Compare the new weights of the correlated model with the weights of our original model. What happened to the weights corresponding to `Percent_Person_WithMentalHealthNotGood`?\n",
+ "\n",
+ "**3.2B)** Thinking back to your answers for Q3.1C-E, how might correlated variables affect the interpretation of model weights?"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "include_colab_link": true,
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}