diff --git a/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb b/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb new file mode 100644 index 00000000..b549c8aa --- /dev/null +++ b/notebooks/v2/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb @@ -0,0 +1,3211 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "myiDKzkSIAUs" + }, + "source": [ + "Copyright 2025 Google LLC.\n", + "SPDX-License-Identifier: Apache-2.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sbPgBMt01mSB" + }, + "source": [ + "# Regression: Evaluation and Interpretation\n", + "In the [previous notebook](https://github.com/datacommonsorg/api-python/blob/master/notebooks/v2/intro_data_science/Regression_Basics_and_Prediction.ipynb), we saw how powerful regression can be as a tool for prediction. In this Colab, we'll take that exploration one step further: what can regression models tell us about the statistical relationships between variables?\n", + "\n", + "In particular, this colab will take a more rigorous statistical approach to regressions. We'll look at how to evaluate and interpret our regression models using statistical methods.\n", + "\n", + "## Learning objectives:\n", + "* Hypothesis testing with regression\n", + "* Regression tables\n", + "* Pearson correlation coefficient, $r$\n", + "* $R^2$ and adjusted $R^2$\n", + "* Interpreting weights and intercepts\n", + "* How correlated variables affect models\n", + "---\n", + "**Need extra help?**\n", + "\n", + "If you're new to Google Colab, take a look at [this getting started tutorial](https://colab.research.google.com/notebooks/intro.ipynb).\n", + "\n", + "To build more familiarity with the Data Commons API, check out these [Data Commons tutorials](https://docs.datacommons.org/api/python/v2/tutorials.md).\n", + "\n", + "And for help with Pandas and manipulating data frames, take a look at the [Pandas documentation](https://pandas.pydata.org/docs/reference/index.html).\n", + "\n", + "We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html).\n", + "\n", + "As usual, if you have any other questions, please reach out to your course staff!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gnoowEYUIQS-" + }, + "source": [ + "## Getting set up\n", + "\n", + "\n", + "Run the following code boxes to load the Python libraries and data we'll be using today." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YkuB0EIS59qX" + }, + "outputs": [], + "source": [ + "# Setup/Imports\n", + "!pip install \"datacommons-client[Pandas]\" --upgrade --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u2oFQ7-v8sxY" + }, + "outputs": [], + "source": [ + "# Data Commons Python and Pandas APIs\n", + "from datacommons_client.client import DataCommonsClient\n", + "client = DataCommonsClient(api_key=\"your API key\")\n", + "\n", + "# For manipulating data\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# For implementing models and evaluation methods\n", + "from sklearn import linear_model\n", + "from sklearn.metrics import r2_score, mean_squared_error\n", + "from statsmodels import api as sm\n", + "\n", + "\n", + "# For plotting/printing\n", + "from matplotlib import pyplot as plt\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WQlAj-wYjI9L" + }, + "source": [ + "### The data\n", + "\n", + "In this assignment, we'll be returning to the scenario we started in the previous notebook. As a refresher, we'll be exploring how obesity rates vary with different health or societal factors across US cities.\n", + "\n", + "Our data science question: **What can we learn about the relationship of those health and lifestyle factors to obesity rates?**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 492 + }, + "id": "VuA3xgSQXhK6", + "outputId": "6c43f4e1-6424-4058-f681-6cf31b6fb05a" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"df\",\n \"rows\": 498,\n \"fields\": [\n {\n \"column\": \"place\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 498,\n \"samples\": [\n \"geoId/5363000\",\n \"geoId/0639892\",\n \"geoId/1714351\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"City Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 475,\n \"samples\": [\n \"Memphis\",\n \"Plano\",\n \"Avondale\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 471628.1194557695,\n \"min\": 76212.0,\n \"max\": 8258035.0,\n \"num_unique_values\": 498,\n \"samples\": [\n 755078.0,\n 78135.0,\n 81004.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.3675302626575006,\n \"min\": 14.1,\n \"max\": 48.9,\n \"num_unique_values\": 220,\n \"samples\": [\n 33.4,\n 41.6,\n 23.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.79401923130349,\n \"min\": 11.2,\n \"max\": 41.8,\n \"num_unique_values\": 209,\n \"samples\": [\n 25.0,\n 39.5,\n 18.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.3313227959895455,\n \"min\": 24.9,\n \"max\": 49.5,\n \"num_unique_values\": 166,\n \"samples\": [\n 42.5,\n 25.6,\n 28.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.56550336745893,\n \"min\": 21.3,\n \"max\": 45.7,\n \"num_unique_values\": 170,\n \"samples\": [\n 40.3,\n 25.6,\n 41.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.089525781869934,\n \"min\": 24.6,\n \"max\": 35.6,\n \"num_unique_values\": 95,\n \"samples\": [\n 25.9,\n 34.1,\n 27.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.1019573952670365,\n \"min\": 11.5,\n \"max\": 23.3,\n \"num_unique_values\": 103,\n \"samples\": [\n 15.8,\n 16.3,\n 12.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variableCity NameCount_PersonPercent_Person_ObesityPercent_Person_PhysicalInactivityPercent_Person_SleepLessThan7HoursPercent_Person_WithHighBloodPressurePercent_Person_WithHighCholesterolPercent_Person_WithMentalHealthNotGood
place
geoId/0103076Auburn82025.033.023.636.034.330.617.8
geoId/0107000Birmingham196644.044.932.942.945.031.619.7
geoId/0135896Hoover92448.032.519.733.632.631.015.4
geoId/0137000Huntsville225564.037.524.040.036.531.618.0
geoId/0150000Mobile182595.044.228.743.439.832.519.9
...........................
geoId/5531000Green Bay105744.038.926.733.128.130.717.9
geoId/5539225Kenosha98211.043.723.836.629.930.018.6
geoId/5548000Madison280305.032.118.729.926.628.515.6
geoId/5553000Milwaukee561385.043.428.840.036.730.119.0
geoId/5566000Racine76602.042.928.239.232.332.018.4
\n", + "

498 rows × 8 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + "variable City Name Count_Person Percent_Person_Obesity \\\n", + "place \n", + "geoId/0103076 Auburn 82025.0 33.0 \n", + "geoId/0107000 Birmingham 196644.0 44.9 \n", + "geoId/0135896 Hoover 92448.0 32.5 \n", + "geoId/0137000 Huntsville 225564.0 37.5 \n", + "geoId/0150000 Mobile 182595.0 44.2 \n", + "... ... ... ... \n", + "geoId/5531000 Green Bay 105744.0 38.9 \n", + "geoId/5539225 Kenosha 98211.0 43.7 \n", + "geoId/5548000 Madison 280305.0 32.1 \n", + "geoId/5553000 Milwaukee 561385.0 43.4 \n", + "geoId/5566000 Racine 76602.0 42.9 \n", + "\n", + "variable Percent_Person_PhysicalInactivity \\\n", + "place \n", + "geoId/0103076 23.6 \n", + "geoId/0107000 32.9 \n", + "geoId/0135896 19.7 \n", + "geoId/0137000 24.0 \n", + "geoId/0150000 28.7 \n", + "... ... \n", + "geoId/5531000 26.7 \n", + "geoId/5539225 23.8 \n", + "geoId/5548000 18.7 \n", + "geoId/5553000 28.8 \n", + "geoId/5566000 28.2 \n", + "\n", + "variable Percent_Person_SleepLessThan7Hours \\\n", + "place \n", + "geoId/0103076 36.0 \n", + "geoId/0107000 42.9 \n", + "geoId/0135896 33.6 \n", + "geoId/0137000 40.0 \n", + "geoId/0150000 43.4 \n", + "... ... \n", + "geoId/5531000 33.1 \n", + "geoId/5539225 36.6 \n", + "geoId/5548000 29.9 \n", + "geoId/5553000 40.0 \n", + "geoId/5566000 39.2 \n", + "\n", + "variable Percent_Person_WithHighBloodPressure \\\n", + "place \n", + "geoId/0103076 34.3 \n", + "geoId/0107000 45.0 \n", + "geoId/0135896 32.6 \n", + "geoId/0137000 36.5 \n", + "geoId/0150000 39.8 \n", + "... ... \n", + "geoId/5531000 28.1 \n", + "geoId/5539225 29.9 \n", + "geoId/5548000 26.6 \n", + "geoId/5553000 36.7 \n", + "geoId/5566000 32.3 \n", + "\n", + "variable Percent_Person_WithHighCholesterol \\\n", + "place \n", + "geoId/0103076 30.6 \n", + "geoId/0107000 31.6 \n", + "geoId/0135896 31.0 \n", + "geoId/0137000 31.6 \n", + "geoId/0150000 32.5 \n", + "... ... \n", + "geoId/5531000 30.7 \n", + "geoId/5539225 30.0 \n", + "geoId/5548000 28.5 \n", + "geoId/5553000 30.1 \n", + "geoId/5566000 32.0 \n", + "\n", + "variable Percent_Person_WithMentalHealthNotGood \n", + "place \n", + "geoId/0103076 17.8 \n", + "geoId/0107000 19.7 \n", + "geoId/0135896 15.4 \n", + "geoId/0137000 18.0 \n", + "geoId/0150000 19.9 \n", + "... ... \n", + "geoId/5531000 17.9 \n", + "geoId/5539225 18.6 \n", + "geoId/5548000 15.6 \n", + "geoId/5553000 19.0 \n", + "geoId/5566000 18.4 \n", + "\n", + "[498 rows x 8 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Load the data we'll be using\n", + "\n", + "# Fetch the population of the US cities\n", + "city_pop = client.observation.fetch_observations_by_entity_type(\n", + " date=\"latest\",\n", + " parent_entity=\"country/USA\",\n", + " entity_type=\"City\",\n", + " variable_dcids=\"Count_Person\",\n", + " filter_facet_ids=\"2176550201\" # USCensusPEP_Annual_Population\n", + ").byVariable[\"Count_Person\"].byEntity\n", + "city_pop_dict = {\n", + " city: data[\"orderedFacets\"][0].observations[0].value\n", + " for city, data in city_pop.items()\n", + " }\n", + "\n", + "# Filter to the top 500 cities\n", + "cities = [\n", + " item[0]\n", + " for item in sorted(\n", + " city_pop_dict.items(),\n", + " key=lambda item: item[1],\n", + " reverse=True)[:500]\n", + " ]\n", + "\n", + "# We've compiled a list of some nice Data Commons Statistical Variables\n", + "# to use as features for you\n", + "stat_vars_to_query = [\n", + " \"Count_Person\",\n", + " \"Percent_Person_PhysicalInactivity\",\n", + " \"Percent_Person_SleepLessThan7Hours\",\n", + " \"Percent_Person_WithHighBloodPressure\",\n", + " \"Percent_Person_WithMentalHealthNotGood\",\n", + " \"Percent_Person_WithHighCholesterol\",\n", + " \"Percent_Person_Obesity\"\n", + "\n", + "]\n", + "\n", + "# Query Data Commons for the data\n", + "raw_features_df = client.observations_dataframe(\n", + " variable_dcids=stat_vars_to_query,\n", + " date=\"latest\",\n", + " entity_dcids=cities)\n", + "\n", + "# Filter to highest ranked facet for each entity and variable\n", + "df = raw_features_df.copy(deep=True)\n", + "df = df.groupby([\"entity\", \"entity_name\", \"variable\"]).first().reset_index()\n", + "\n", + "# Select required columns and pivot by variable\n", + "df = df[[\"entity\", \"entity_name\", \"variable\", \"value\"]]\n", + "df = df.pivot(index=[\"entity\", \"entity_name\"], columns=\"variable\", values=\"value\")\n", + "df = df.dropna()\n", + "\n", + "# Rename columns and order alphabetically\n", + "df = df.reset_index()\n", + "df.rename(columns={\"entity\":\"place\", \"entity_name\": \"City Name\"}, inplace=True)\n", + "df.set_index(\"place\", inplace=True)\n", + "df = df.reindex(sorted(df.columns), axis=1)\n", + "\n", + "# Display results\n", + "display(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TbvPpqYDiGmY" + }, + "source": [ + "### The model\n", + "\n", + "Run the following code box to fit an [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression model to our data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V9x5v10LwYZG" + }, + "outputs": [], + "source": [ + "# Fit a regression model\n", + "dep_var = \"Percent_Person_Obesity\"\n", + "y = df[dep_var].to_numpy().reshape(-1, 1)\n", + "x = df.loc[:, ~df.columns.isin([dep_var, \"City Name\"])]\n", + "x = sm.add_constant(x)\n", + "\n", + "\n", + "model = sm.OLS(y, x)\n", + "results = model.fit()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_h32-nQChkE6" + }, + "source": [ + "## 0) Regression tables\n", + "\n", + "When performing regression analyses, statistical packages will usually provide a _**regression table**_, which summarizes the results of the analysis.\n", + "\n", + "Run the following codebox to display the regression table for our original model. In this Colab, we'll go over some of the statistics included in the table.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KvCfUveghpcJ", + "outputId": "be61942c-174a-405d-b317-43439e1363b0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " OLS Regression Results \n", + "==============================================================================\n", + "Dep. Variable: y R-squared: 0.758\n", + "Model: OLS Adj. R-squared: 0.755\n", + "Method: Least Squares F-statistic: 256.0\n", + "Date: Thu, 22 May 2025 Prob (F-statistic): 1.18e-147\n", + "Time: 16:02:06 Log-Likelihood: -1275.0\n", + "No. Observations: 498 AIC: 2564.\n", + "Df Residuals: 491 BIC: 2593.\n", + "Df Model: 6 \n", + "Covariance Type: nonrobust \n", + "==========================================================================================================\n", + " coef std err t P>|t| [0.025 0.975]\n", + "----------------------------------------------------------------------------------------------------------\n", + "const -0.1937 2.656 -0.073 0.942 -5.412 5.024\n", + "Count_Person -7.183e-07 3.02e-07 -2.381 0.018 -1.31e-06 -1.25e-07\n", + "Percent_Person_PhysicalInactivity 0.3053 0.046 6.577 0.000 0.214 0.396\n", + "Percent_Person_SleepLessThan7Hours -0.1246 0.057 -2.192 0.029 -0.236 -0.013\n", + "Percent_Person_WithHighBloodPressure 0.7572 0.054 14.062 0.000 0.651 0.863\n", + "Percent_Person_WithHighCholesterol -0.1352 0.077 -1.756 0.080 -0.286 0.016\n", + "Percent_Person_WithMentalHealthNotGood 0.6901 0.103 6.704 0.000 0.488 0.892\n", + "==============================================================================\n", + "Omnibus: 2.835 Durbin-Watson: 1.396\n", + "Prob(Omnibus): 0.242 Jarque-Bera (JB): 2.651\n", + "Skew: 0.133 Prob(JB): 0.266\n", + "Kurtosis: 3.239 Cond. No. 9.89e+06\n", + "==============================================================================\n", + "\n", + "Notes:\n", + "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", + "[2] The condition number is large, 9.89e+06. This might indicate that there are\n", + "strong multicollinearity or other numerical problems.\n" + ] + } + ], + "source": [ + "print(results.summary())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aFi0LWX0OlwA" + }, + "source": [ + "## 1) Hypothesis testing\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DrYP7lZ9cyHL" + }, + "source": [ + "### 1.1) Null hypotheses\n", + "\n", + "When performing statistical analyses, one usually starts with a statement of the null hypothesis. Typically for regression models, these take the form of the coefficient for a variable equaling zero.\n", + "\n", + "**1.1)** Write out the null hypotheses for each of our independent variables." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wAOpaTGXeiQb" + }, + "source": [ + "### 1.2) T-test\n", + "\n", + "So how do we test our null hypotheses? We use the [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Slope_of_a_regression_line).\n", + "\n", + "Take a look at the regression table above to answer the following questions\n", + "\n", + "**Q1.2A)** According to the t-test, which variables are statistically significant?\n", + "\n", + "**Q1.2B)** For variables that are not statistically significant, should we keep them in our model? Why or why not?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "exV8u07Bek4z" + }, + "source": [ + "### 1.3) F-test\n", + "\n", + "Beyond testing the significance of our individual variables independently, we can also test the significance of our model overall using the [F-test](https://en.wikipedia.org/wiki/F-test#Regression_problems). In particular, the F-test compares our model to one without predictors (aka, just an intercept). In other words, can our model do statistically better than just predicting the mean?\n", + "\n", + "Again use the regression table above to answer the following questions:\n", + "\n", + "**1.3A)** What is the null hypothesis for the F-test?\n", + "\n", + "**1.3B)** Can we reject the null hypothesis for our model?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gpk0RO17VJXz" + }, + "source": [ + "## 2) Statistical measures" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RS-U1bdrl-c2" + }, + "source": [ + "### 2.1) Correlation coefficient $r$\n", + "\n", + "We can quantify predictiveness of variables using a _correlation coefficient_, a number that represents the degree to which two variables have a statistical relationship. The most common correlation coefficient used is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient), also known as _Pearson's r_, which measures the strength of linear relationships between variables.\n", + "\n", + "Mathematically, the correlation coefficient is defined as:\n", + "$$ r = \\frac{\\sum_i (x_i - \\bar{x})(y_i - \\bar{y})}{\\sqrt{\\sum_i (x_i - \\bar{x})^2}\\sqrt{\\sum_i (y_i - \\bar{y})^2}}\n", + "$$\n", + "\n", + "where $x$ and $y$ are the two variables.\n", + "\n", + "Those of you with a statistics background might recognize this as the ratio of covariance to the product of their standard deviations.\n", + "\n", + "**2.1A)** Either using the mathematical definition or by exploring with code, explain what the correlation coefficient would be in the following cases:\n", + "\n", + "A) $x = y$\n", + "\n", + "B) $x = -y$\n", + "\n", + "C) $x$ and $y$ are both normally distributed variables with mean 0 and variance 1, randomly sampled independently from each other." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "FwEnQEWjMQv5", + "outputId": "9803eadf-dc00-4c69-b337-ef0b5ed92fad" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'\\nOptional cell for 2.1A\\n'" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\"\"\n", + "Optional cell for 2.1A\n", + "\"\"\"\n", + "\n", + "# Hint: Try writing code to generate values for x and y, then either write or import\n", + "# a function to calculate the correlation coefficient\n", + "\n", + "# Your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mPxYc8tNMq52" + }, + "source": [ + "Now run the following code box to use the Pandas `.corr()` function to calculate the correlation coefficient between our variables. Note that pandas outputs the results as a matrix." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 320 + }, + "id": "TKrIjyt657ir", + "outputId": "80ea1cfc-18ec-49bd-8c21-5a0e33502bdf" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"df[stat_vars_to_query]\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"variable\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"Count_Person\",\n \"Percent_Person_PhysicalInactivity\",\n \"Percent_Person_WithHighCholesterol\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3692768160533883,\n \"min\": -0.032606435112257866,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 1.0,\n 0.05966842357472978,\n 0.04809963828980299\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3039891315555219,\n \"min\": 0.05966842357472978,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.05966842357472978,\n 1.0,\n 0.4366429779497951\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3018122804213848,\n \"min\": 0.07380691021769277,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.07380691021769277,\n 0.7788343765257514,\n 0.3694331027620301\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.32460607938114927,\n \"min\": 0.025619158392611367,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.025619158392611367,\n 0.7446432625492557,\n 0.381625664582876\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.34286073197902867,\n \"min\": -0.006579247299365092,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n -0.006579247299365092,\n 0.7007758800234068,\n 0.21400402260098858\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.29745289241877604,\n \"min\": 0.04809963828980299,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.04809963828980299,\n 0.4366429779497951,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3527430213065325,\n \"min\": -0.032606435112257866,\n \"max\": 1.0,\n \"num_unique_values\": 7,\n \"samples\": [\n -0.032606435112257866,\n 0.7531559280354309,\n 0.29900147207085953\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variableCount_PersonPercent_Person_PhysicalInactivityPercent_Person_SleepLessThan7HoursPercent_Person_WithHighBloodPressurePercent_Person_WithMentalHealthNotGoodPercent_Person_WithHighCholesterolPercent_Person_Obesity
variable
Count_Person1.0000000.0596680.0738070.025619-0.0065790.048100-0.032606
Percent_Person_PhysicalInactivity0.0596681.0000000.7788340.7446430.7007760.4366430.753156
Percent_Person_SleepLessThan7Hours0.0738070.7788341.0000000.7454740.6193430.3694330.657111
Percent_Person_WithHighBloodPressure0.0256190.7446430.7454741.0000000.6902940.3816260.825544
Percent_Person_WithMentalHealthNotGood-0.0065790.7007760.6193430.6902941.0000000.2140040.735612
Percent_Person_WithHighCholesterol0.0481000.4366430.3694330.3816260.2140041.0000000.299001
Percent_Person_Obesity-0.0326060.7531560.6571110.8255440.7356120.2990011.000000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + "variable Count_Person \\\n", + "variable \n", + "Count_Person 1.000000 \n", + "Percent_Person_PhysicalInactivity 0.059668 \n", + "Percent_Person_SleepLessThan7Hours 0.073807 \n", + "Percent_Person_WithHighBloodPressure 0.025619 \n", + "Percent_Person_WithMentalHealthNotGood -0.006579 \n", + "Percent_Person_WithHighCholesterol 0.048100 \n", + "Percent_Person_Obesity -0.032606 \n", + "\n", + "variable Percent_Person_PhysicalInactivity \\\n", + "variable \n", + "Count_Person 0.059668 \n", + "Percent_Person_PhysicalInactivity 1.000000 \n", + "Percent_Person_SleepLessThan7Hours 0.778834 \n", + "Percent_Person_WithHighBloodPressure 0.744643 \n", + "Percent_Person_WithMentalHealthNotGood 0.700776 \n", + "Percent_Person_WithHighCholesterol 0.436643 \n", + "Percent_Person_Obesity 0.753156 \n", + "\n", + "variable Percent_Person_SleepLessThan7Hours \\\n", + "variable \n", + "Count_Person 0.073807 \n", + "Percent_Person_PhysicalInactivity 0.778834 \n", + "Percent_Person_SleepLessThan7Hours 1.000000 \n", + "Percent_Person_WithHighBloodPressure 0.745474 \n", + "Percent_Person_WithMentalHealthNotGood 0.619343 \n", + "Percent_Person_WithHighCholesterol 0.369433 \n", + "Percent_Person_Obesity 0.657111 \n", + "\n", + "variable Percent_Person_WithHighBloodPressure \\\n", + "variable \n", + "Count_Person 0.025619 \n", + "Percent_Person_PhysicalInactivity 0.744643 \n", + "Percent_Person_SleepLessThan7Hours 0.745474 \n", + "Percent_Person_WithHighBloodPressure 1.000000 \n", + "Percent_Person_WithMentalHealthNotGood 0.690294 \n", + "Percent_Person_WithHighCholesterol 0.381626 \n", + "Percent_Person_Obesity 0.825544 \n", + "\n", + "variable Percent_Person_WithMentalHealthNotGood \\\n", + "variable \n", + "Count_Person -0.006579 \n", + "Percent_Person_PhysicalInactivity 0.700776 \n", + "Percent_Person_SleepLessThan7Hours 0.619343 \n", + "Percent_Person_WithHighBloodPressure 0.690294 \n", + "Percent_Person_WithMentalHealthNotGood 1.000000 \n", + "Percent_Person_WithHighCholesterol 0.214004 \n", + "Percent_Person_Obesity 0.735612 \n", + "\n", + "variable Percent_Person_WithHighCholesterol \\\n", + "variable \n", + "Count_Person 0.048100 \n", + "Percent_Person_PhysicalInactivity 0.436643 \n", + "Percent_Person_SleepLessThan7Hours 0.369433 \n", + "Percent_Person_WithHighBloodPressure 0.381626 \n", + "Percent_Person_WithMentalHealthNotGood 0.214004 \n", + "Percent_Person_WithHighCholesterol 1.000000 \n", + "Percent_Person_Obesity 0.299001 \n", + "\n", + "variable Percent_Person_Obesity \n", + "variable \n", + "Count_Person -0.032606 \n", + "Percent_Person_PhysicalInactivity 0.753156 \n", + "Percent_Person_SleepLessThan7Hours 0.657111 \n", + "Percent_Person_WithHighBloodPressure 0.825544 \n", + "Percent_Person_WithMentalHealthNotGood 0.735612 \n", + "Percent_Person_WithHighCholesterol 0.299001 \n", + "Percent_Person_Obesity 1.000000 " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# calculate correlation\n", + "df[stat_vars_to_query].corr()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XHsUlBdXNC1X" + }, + "source": [ + "\n", + "**2.1B)** Explain why the diagonals of the matrix have the value 1.\n", + "\n", + "**2.1C)** What is the correlation coefficient between `Count_Person` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between population and obesity rate?\n", + "\n", + "**2.1D)** What is the correlation coefficient between `Percent_Person_PhysicalInactivity` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between physical inactivity and obesity rate?\n", + "\n", + "**2.1E)** In general, would you prefer to include features that correlate strongly with the dependent variable, or features with no correlation in a regression model?\n", + "\n", + "**2.1F)** You find a new feature with correlation coefficient $r=-0.97$ between it and obesity rates. Would it be a good idea to add this new feature to your model?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HXG9__t8YAqy" + }, + "source": [ + "### 2.2) $R^2$ score\n", + "\n", + "To quantify how predictive a linear regression model is overall, we can use the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), $R^2$ (pronounced \"R squared\").\n", + "\n", + "Mathematically, the $R^2$ score is defined as:\n", + "\n", + "$$S_{residuals} = \\sum_i{(y_i - f_i)^2} \\\\\n", + "S_{total} = \\sum_i{(y_i - \\bar{y})^2}\\\\\n", + "R^2 = 1 - \\frac{S_{residuals}}{S_{total}}$$\n", + "\n", + "where $y_i$s are the actual dependent variable values, $f_i$ are the predicted dependent variable values, and $\\bar{y}$ is the average of the $y_i$'s.\n", + "\n", + "Conceptually, the $R^2$ score is a measure of explained variance. If $R^2=0.75$, that means that 75% of the variance in the dependent variable has been accounted for by our model, while 25% of the remaining variability has not.\n", + "\n", + "**2.2A)** Based on the mathematic definition, what is the range of values possible for R^2?\n", + "\n", + "**2.2B)** Come up with a situation (e.g. what would the data look like) where:\n", + "\n", + "A) $R^2 = 1.0$\n", + "\n", + "B) $R^2 = 0.0$\n", + "\n", + "Let's now analyze what the $R^2$ value is for our model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-rnvtExD5_U1", + "outputId": "51ee0f86-162f-4542-eeb6-809deb556b88" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model R^2 = 0.7577718062114178\n" + ] + } + ], + "source": [ + "# calculate R^2\n", + "print(\"Model R^2 =\", results.rsquared)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_L-1iVbXKtc" + }, + "source": [ + "**2.2C)** Is the model's $R^2$ a \"good\" score?\n", + "\n", + "**2.2D)** Can you think of any ways we can change our model that would improve the $R^2$ score?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t_Eieuedizkv" + }, + "source": [ + "### 2.3) Adjusted $R^2$\n", + "\n", + "There's an issue with $R^2$ scores that one needs to be aware of when working with multiple independent variables: namely, that the number of independent variables used can affect the $R^2$ score.\n", + "\n", + "Let's see this in practice. Let's create a new dataframe with an extra 100 dummy variables (randomly sampled from a 0-mean 1-variance normal distribution) tacked on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 510 + }, + "id": "iF9B9dPJ1P8G", + "outputId": "66d7dae3-11cb-4b46-a15e-846812c9f5b7" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "df_padded" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
City NameCount_PersonPercent_Person_ObesityPercent_Person_PhysicalInactivityPercent_Person_SleepLessThan7HoursPercent_Person_WithHighBloodPressurePercent_Person_WithHighCholesterolPercent_Person_WithMentalHealthNotGoodRandom Variable 0Random Variable 1...Random Variable 90Random Variable 91Random Variable 92Random Variable 93Random Variable 94Random Variable 95Random Variable 96Random Variable 97Random Variable 98Random Variable 99
place
geoId/0103076Auburn82025.033.023.636.034.330.617.8-1.5643120.288515...0.7625210.251051-0.697129-1.6971950.399706-0.5571550.4447601.7876420.3404101.535658
geoId/0107000Birmingham196644.044.932.942.945.031.619.7-0.5801590.849181...0.2158431.553184-1.7661151.1529410.7124260.9366600.576485-0.127241-0.5438451.536037
geoId/0135896Hoover92448.032.519.733.632.631.015.4-0.322616-1.748737...2.0361160.993741-1.786077-0.264808-1.922278-1.227397-1.7237620.847944-0.446194-0.320127
geoId/0137000Huntsville225564.037.524.040.036.531.618.00.768514-0.534476...0.9500640.7303440.0074713.5141800.145648-1.2544480.275048-1.241024-0.1635770.376057
geoId/0150000Mobile182595.044.228.743.439.832.519.90.2072171.028760...-0.7755071.338210-0.395432-0.830337-0.558512-0.367606-1.049303-3.161325-0.5866680.934307
..................................................................
geoId/5531000Green Bay105744.038.926.733.128.130.717.9-0.104983-0.856795...-0.945322-0.219595-2.1131650.6143790.110795-0.2500100.926896-0.526254-0.359181-1.424956
geoId/5539225Kenosha98211.043.723.836.629.930.018.6-0.3553490.348573...-0.7895750.590118-0.1935870.5021880.124404-0.376209-0.3313310.6971651.029427-1.143744
geoId/5548000Madison280305.032.118.729.926.628.515.6-0.6481190.025662...-0.1519650.835380-1.3812860.3031140.540398-0.3599880.0079040.010788-0.2760710.979319
geoId/5553000Milwaukee561385.043.428.840.036.730.119.0-0.154089-0.339432...2.2554581.3578280.6927940.9240340.951688-0.0710960.0975820.952135-1.019633-0.778193
geoId/5566000Racine76602.042.928.239.232.332.018.41.638922-0.543906...0.370553-0.6062731.0666600.0221320.0391351.102639-0.438601-1.7446471.2452142.216294
\n", + "

498 rows × 108 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " City Name Count_Person Percent_Person_Obesity \\\n", + "place \n", + "geoId/0103076 Auburn 82025.0 33.0 \n", + "geoId/0107000 Birmingham 196644.0 44.9 \n", + "geoId/0135896 Hoover 92448.0 32.5 \n", + "geoId/0137000 Huntsville 225564.0 37.5 \n", + "geoId/0150000 Mobile 182595.0 44.2 \n", + "... ... ... ... \n", + "geoId/5531000 Green Bay 105744.0 38.9 \n", + "geoId/5539225 Kenosha 98211.0 43.7 \n", + "geoId/5548000 Madison 280305.0 32.1 \n", + "geoId/5553000 Milwaukee 561385.0 43.4 \n", + "geoId/5566000 Racine 76602.0 42.9 \n", + "\n", + " Percent_Person_PhysicalInactivity \\\n", + "place \n", + "geoId/0103076 23.6 \n", + "geoId/0107000 32.9 \n", + "geoId/0135896 19.7 \n", + "geoId/0137000 24.0 \n", + "geoId/0150000 28.7 \n", + "... ... \n", + "geoId/5531000 26.7 \n", + "geoId/5539225 23.8 \n", + "geoId/5548000 18.7 \n", + "geoId/5553000 28.8 \n", + "geoId/5566000 28.2 \n", + "\n", + " Percent_Person_SleepLessThan7Hours \\\n", + "place \n", + "geoId/0103076 36.0 \n", + "geoId/0107000 42.9 \n", + "geoId/0135896 33.6 \n", + "geoId/0137000 40.0 \n", + "geoId/0150000 43.4 \n", + "... ... \n", + "geoId/5531000 33.1 \n", + "geoId/5539225 36.6 \n", + "geoId/5548000 29.9 \n", + "geoId/5553000 40.0 \n", + "geoId/5566000 39.2 \n", + "\n", + " Percent_Person_WithHighBloodPressure \\\n", + "place \n", + "geoId/0103076 34.3 \n", + "geoId/0107000 45.0 \n", + "geoId/0135896 32.6 \n", + "geoId/0137000 36.5 \n", + "geoId/0150000 39.8 \n", + "... ... \n", + "geoId/5531000 28.1 \n", + "geoId/5539225 29.9 \n", + "geoId/5548000 26.6 \n", + "geoId/5553000 36.7 \n", + "geoId/5566000 32.3 \n", + "\n", + " Percent_Person_WithHighCholesterol \\\n", + "place \n", + "geoId/0103076 30.6 \n", + "geoId/0107000 31.6 \n", + "geoId/0135896 31.0 \n", + "geoId/0137000 31.6 \n", + "geoId/0150000 32.5 \n", + "... ... \n", + "geoId/5531000 30.7 \n", + "geoId/5539225 30.0 \n", + "geoId/5548000 28.5 \n", + "geoId/5553000 30.1 \n", + "geoId/5566000 32.0 \n", + "\n", + " Percent_Person_WithMentalHealthNotGood Random Variable 0 \\\n", + "place \n", + "geoId/0103076 17.8 -1.564312 \n", + "geoId/0107000 19.7 -0.580159 \n", + "geoId/0135896 15.4 -0.322616 \n", + "geoId/0137000 18.0 0.768514 \n", + "geoId/0150000 19.9 0.207217 \n", + "... ... ... \n", + "geoId/5531000 17.9 -0.104983 \n", + "geoId/5539225 18.6 -0.355349 \n", + "geoId/5548000 15.6 -0.648119 \n", + "geoId/5553000 19.0 -0.154089 \n", + "geoId/5566000 18.4 1.638922 \n", + "\n", + " Random Variable 1 ... Random Variable 90 Random Variable 91 \\\n", + "place ... \n", + "geoId/0103076 0.288515 ... 0.762521 0.251051 \n", + "geoId/0107000 0.849181 ... 0.215843 1.553184 \n", + "geoId/0135896 -1.748737 ... 2.036116 0.993741 \n", + "geoId/0137000 -0.534476 ... 0.950064 0.730344 \n", + "geoId/0150000 1.028760 ... -0.775507 1.338210 \n", + "... ... ... ... ... \n", + "geoId/5531000 -0.856795 ... -0.945322 -0.219595 \n", + "geoId/5539225 0.348573 ... -0.789575 0.590118 \n", + "geoId/5548000 0.025662 ... -0.151965 0.835380 \n", + "geoId/5553000 -0.339432 ... 2.255458 1.357828 \n", + "geoId/5566000 -0.543906 ... 0.370553 -0.606273 \n", + "\n", + " Random Variable 92 Random Variable 93 Random Variable 94 \\\n", + "place \n", + "geoId/0103076 -0.697129 -1.697195 0.399706 \n", + "geoId/0107000 -1.766115 1.152941 0.712426 \n", + "geoId/0135896 -1.786077 -0.264808 -1.922278 \n", + "geoId/0137000 0.007471 3.514180 0.145648 \n", + "geoId/0150000 -0.395432 -0.830337 -0.558512 \n", + "... ... ... ... \n", + "geoId/5531000 -2.113165 0.614379 0.110795 \n", + "geoId/5539225 -0.193587 0.502188 0.124404 \n", + "geoId/5548000 -1.381286 0.303114 0.540398 \n", + "geoId/5553000 0.692794 0.924034 0.951688 \n", + "geoId/5566000 1.066660 0.022132 0.039135 \n", + "\n", + " Random Variable 95 Random Variable 96 Random Variable 97 \\\n", + "place \n", + "geoId/0103076 -0.557155 0.444760 1.787642 \n", + "geoId/0107000 0.936660 0.576485 -0.127241 \n", + "geoId/0135896 -1.227397 -1.723762 0.847944 \n", + "geoId/0137000 -1.254448 0.275048 -1.241024 \n", + "geoId/0150000 -0.367606 -1.049303 -3.161325 \n", + "... ... ... ... \n", + "geoId/5531000 -0.250010 0.926896 -0.526254 \n", + "geoId/5539225 -0.376209 -0.331331 0.697165 \n", + "geoId/5548000 -0.359988 0.007904 0.010788 \n", + "geoId/5553000 -0.071096 0.097582 0.952135 \n", + "geoId/5566000 1.102639 -0.438601 -1.744647 \n", + "\n", + " Random Variable 98 Random Variable 99 \n", + "place \n", + "geoId/0103076 0.340410 1.535658 \n", + "geoId/0107000 -0.543845 1.536037 \n", + "geoId/0135896 -0.446194 -0.320127 \n", + "geoId/0137000 -0.163577 0.376057 \n", + "geoId/0150000 -0.586668 0.934307 \n", + "... ... ... \n", + "geoId/5531000 -0.359181 -1.424956 \n", + "geoId/5539225 1.029427 -1.143744 \n", + "geoId/5548000 -0.276071 0.979319 \n", + "geoId/5553000 -1.019633 -0.778193 \n", + "geoId/5566000 1.245214 2.216294 \n", + "\n", + "[498 rows x 108 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Pad our dataframe with more random variables\n", + "num_rows = len(df.index)\n", + "num_new_columns = 100\n", + "random_data = np.random.normal(loc=0, scale=1, size=(num_rows, num_new_columns))\n", + "new_column_names = [f\"Random Variable {i}\" for i in range(num_new_columns)]\n", + "random_data_df = pd.DataFrame(\n", + " random_data,\n", + " columns=new_column_names,\n", + " index=df.index\n", + ")\n", + "df_padded = pd.concat([df, random_data_df], axis=1)\n", + "display(df_padded)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q5f22xUmvoN_" + }, + "source": [ + "Now let's fit a new model to the data and compare R^2 scores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rn57oEF82dju", + "outputId": "27dd8be5-fae5-45b4-a31f-5858a087d3d5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original Model R^2 = 0.7577718062114178\n", + "Padded Model R^2 = 0.7988444670439291\n" + ] + } + ], + "source": [ + "# New R^2\n", + "y_padded = df_padded[dep_var].to_numpy().reshape(-1, 1)\n", + "x_padded = df_padded.loc[:, ~df_padded.columns.isin([dep_var, \"City Name\"])]\n", + "x_padded = sm.add_constant(x_padded)\n", + "\n", + "padded_model = sm.OLS(y_padded, x_padded)\n", + "padded_results = padded_model.fit()\n", + "\n", + "print(\"Original Model R^2 = \", results.rsquared)\n", + "print(\"Padded Model R^2 =\", padded_results.rsquared)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j-j4IbOtwFj8" + }, + "source": [ + "**2.3A)** Which model had a better $R^2$ score?\n", + "\n", + "**2.3B)** Think about the variables used in each model. Should one model be much more predictive than another?\n", + "\n", + "**2.3B)** In general, how would you expect $R^2$ to change as we increase the number of independent variables?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2Ipg_orhxOF_" + }, + "source": [ + "So how do we fix this? We can adjust our $R^2$ metric to account for the number of variables. The most popular way to defined the _**adjusted $R^2$**_ score is as follows:\n", + "\n", + "$$R^{2}_{adj}=1-(1-R^{2}){n-1 \\over n-p-1}$$\n", + "\n", + "where $n$ is the number of data points and $p$ is the number of independent variables.\n", + "\n", + "Now let's compare the adjusted $R^2$ of our models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7pZ9_NmZisGi", + "outputId": "bfa5cddd-dddf-45c9-8082-ab58dbe5c286" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original Model Adjusted R^2 = 0.7548117875500502\n", + "Padded Model Adjusted R^2 = 0.7443112535059662\n" + ] + } + ], + "source": [ + "# Adjusted R^2\n", + "print(\"Original Model Adjusted R^2 = \", results.rsquared_adj)\n", + "print(\"Padded Model Adjusted R^2 =\", padded_results.rsquared_adj)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qU9VwLsNHcKD" + }, + "source": [ + "**2.3D)** Which model had a better adjusted $R^2$ score?\n", + "\n", + "**2.3E)** When would you prefer to use adjusted R^2 over R^2 to evaluate model fit?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1tiopX7PWHiu" + }, + "source": [ + "## 3) Interpreting regression models\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qC7aC0y-O3_D" + }, + "source": [ + "### 3.1) Analyzing weights and intercepts\n", + "The parameters of the regression model itself can also yield important insights.\n", + "\n", + "Run the following code box to display the weights and intercept of our original model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 304 + }, + "id": "_y0xeWysPIm6", + "outputId": "6ede53a7-bbc0-474e-bc4a-4567735c8d75" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
const-0.19367
Count_Person-0.00000
Percent_Person_PhysicalInactivity0.30528
Percent_Person_SleepLessThan7Hours-0.12455
Percent_Person_WithHighBloodPressure0.75717
Percent_Person_WithHighCholesterol-0.13520
Percent_Person_WithMentalHealthNotGood0.69012
\n", + "

" + ], + "text/plain": [ + "const -0.19367\n", + "Count_Person -0.00000\n", + "Percent_Person_PhysicalInactivity 0.30528\n", + "Percent_Person_SleepLessThan7Hours -0.12455\n", + "Percent_Person_WithHighBloodPressure 0.75717\n", + "Percent_Person_WithHighCholesterol -0.13520\n", + "Percent_Person_WithMentalHealthNotGood 0.69012\n", + "dtype: float64" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Display weights/coefficients\n", + "display(results.params.round(5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dvpGBohWPymA" + }, + "source": [ + "**3.1A)** What is the intercept of our model? What are its units?\n", + "\n", + "**3.1B)** What are the units on each of the model weights (aka coefficients)?\n", + "\n", + "**3.1C)** Which variables matter most to our model?\n", + "\n", + "**3.1D)** In words, describe what a weight/coefficient in a linear regression means.\n", + "\n", + "**3.1E)** Our model is used to generate a predicted obesity rate for a fictional city named Dataopolis. If we increased `Percent_Person_WithMentalHealthNotGood` for Dataopolis by 1 unit, _while keeping the values for all remaining variables constant_, by how much would we expect our predicted obesity rate to change?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2ZoaATgWh-fR" + }, + "source": [ + "### 3.2) The effect of correlated variables\n", + "\n", + "When interpreting weights, one thing to look out for is if we have independent variables that are highly correlated with each other.\n", + "\n", + "Let's illustrate why this might be a problem, by adding a variable that is correlated with one of the existing variables" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 845 + }, + "id": "uP4XtXkfLB1U", + "outputId": "f55a1573-1c38-49ad-ea26-9675b434e7aa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "New dataframe to fit:\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"correlated_df\",\n \"rows\": 498,\n \"fields\": [\n {\n \"column\": \"place\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 498,\n \"samples\": [\n \"geoId/5363000\",\n \"geoId/0639892\",\n \"geoId/1714351\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"City Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 475,\n \"samples\": [\n \"Memphis\",\n \"Plano\",\n \"Avondale\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count_Person\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 471628.1194557695,\n \"min\": 76212.0,\n \"max\": 8258035.0,\n \"num_unique_values\": 498,\n \"samples\": [\n 755078.0,\n 78135.0,\n 81004.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_Obesity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6.3675302626575006,\n \"min\": 14.1,\n \"max\": 48.9,\n \"num_unique_values\": 220,\n \"samples\": [\n 33.4,\n 41.6,\n 23.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_PhysicalInactivity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.79401923130349,\n \"min\": 11.2,\n \"max\": 41.8,\n \"num_unique_values\": 209,\n \"samples\": [\n 25.0,\n 39.5,\n 18.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_SleepLessThan7Hours\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.3313227959895455,\n \"min\": 24.9,\n \"max\": 49.5,\n \"num_unique_values\": 166,\n \"samples\": [\n 42.5,\n 25.6,\n 28.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighBloodPressure\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4.56550336745893,\n \"min\": 21.3,\n \"max\": 45.7,\n \"num_unique_values\": 170,\n \"samples\": [\n 40.3,\n 25.6,\n 41.2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithHighCholesterol\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.089525781869934,\n \"min\": 24.6,\n \"max\": 35.6,\n \"num_unique_values\": 95,\n \"samples\": [\n 25.9,\n 34.1,\n 27.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Percent_Person_WithMentalHealthNotGood\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.1019573952670365,\n \"min\": 11.5,\n \"max\": 23.3,\n \"num_unique_values\": 103,\n \"samples\": [\n 15.8,\n 16.3,\n 12.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Correlated Variable\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.2936580170071315,\n \"min\": 9.748686615105605,\n \"max\": 23.243226154344537,\n \"num_unique_values\": 498,\n \"samples\": [\n 16.440967097355323,\n 17.12077873386613,\n 17.01999484946955\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "correlated_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variableCity NameCount_PersonPercent_Person_ObesityPercent_Person_PhysicalInactivityPercent_Person_SleepLessThan7HoursPercent_Person_WithHighBloodPressurePercent_Person_WithHighCholesterolPercent_Person_WithMentalHealthNotGoodCorrelated Variable
place
geoId/0103076Auburn82025.033.023.636.034.330.617.818.761300
geoId/0107000Birmingham196644.044.932.942.945.031.619.717.655787
geoId/0135896Hoover92448.032.519.733.632.631.015.414.736255
geoId/0137000Huntsville225564.037.524.040.036.531.618.016.549451
geoId/0150000Mobile182595.044.228.743.439.832.519.920.277958
..............................
geoId/5531000Green Bay105744.038.926.733.128.130.717.918.645080
geoId/5539225Kenosha98211.043.723.836.629.930.018.617.067335
geoId/5548000Madison280305.032.118.729.926.628.515.615.665917
geoId/5553000Milwaukee561385.043.428.840.036.730.119.019.073143
geoId/5566000Racine76602.042.928.239.232.332.018.417.106196
\n", + "

498 rows × 9 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + "variable City Name Count_Person Percent_Person_Obesity \\\n", + "place \n", + "geoId/0103076 Auburn 82025.0 33.0 \n", + "geoId/0107000 Birmingham 196644.0 44.9 \n", + "geoId/0135896 Hoover 92448.0 32.5 \n", + "geoId/0137000 Huntsville 225564.0 37.5 \n", + "geoId/0150000 Mobile 182595.0 44.2 \n", + "... ... ... ... \n", + "geoId/5531000 Green Bay 105744.0 38.9 \n", + "geoId/5539225 Kenosha 98211.0 43.7 \n", + "geoId/5548000 Madison 280305.0 32.1 \n", + "geoId/5553000 Milwaukee 561385.0 43.4 \n", + "geoId/5566000 Racine 76602.0 42.9 \n", + "\n", + "variable Percent_Person_PhysicalInactivity \\\n", + "place \n", + "geoId/0103076 23.6 \n", + "geoId/0107000 32.9 \n", + "geoId/0135896 19.7 \n", + "geoId/0137000 24.0 \n", + "geoId/0150000 28.7 \n", + "... ... \n", + "geoId/5531000 26.7 \n", + "geoId/5539225 23.8 \n", + "geoId/5548000 18.7 \n", + "geoId/5553000 28.8 \n", + "geoId/5566000 28.2 \n", + "\n", + "variable Percent_Person_SleepLessThan7Hours \\\n", + "place \n", + "geoId/0103076 36.0 \n", + "geoId/0107000 42.9 \n", + "geoId/0135896 33.6 \n", + "geoId/0137000 40.0 \n", + "geoId/0150000 43.4 \n", + "... ... \n", + "geoId/5531000 33.1 \n", + "geoId/5539225 36.6 \n", + "geoId/5548000 29.9 \n", + "geoId/5553000 40.0 \n", + "geoId/5566000 39.2 \n", + "\n", + "variable Percent_Person_WithHighBloodPressure \\\n", + "place \n", + "geoId/0103076 34.3 \n", + "geoId/0107000 45.0 \n", + "geoId/0135896 32.6 \n", + "geoId/0137000 36.5 \n", + "geoId/0150000 39.8 \n", + "... ... \n", + "geoId/5531000 28.1 \n", + "geoId/5539225 29.9 \n", + "geoId/5548000 26.6 \n", + "geoId/5553000 36.7 \n", + "geoId/5566000 32.3 \n", + "\n", + "variable Percent_Person_WithHighCholesterol \\\n", + "place \n", + "geoId/0103076 30.6 \n", + "geoId/0107000 31.6 \n", + "geoId/0135896 31.0 \n", + "geoId/0137000 31.6 \n", + "geoId/0150000 32.5 \n", + "... ... \n", + "geoId/5531000 30.7 \n", + "geoId/5539225 30.0 \n", + "geoId/5548000 28.5 \n", + "geoId/5553000 30.1 \n", + "geoId/5566000 32.0 \n", + "\n", + "variable Percent_Person_WithMentalHealthNotGood Correlated Variable \n", + "place \n", + "geoId/0103076 17.8 18.761300 \n", + "geoId/0107000 19.7 17.655787 \n", + "geoId/0135896 15.4 14.736255 \n", + "geoId/0137000 18.0 16.549451 \n", + "geoId/0150000 19.9 20.277958 \n", + "... ... ... \n", + "geoId/5531000 17.9 18.645080 \n", + "geoId/5539225 18.6 17.067335 \n", + "geoId/5548000 15.6 15.665917 \n", + "geoId/5553000 19.0 19.073143 \n", + "geoId/5566000 18.4 17.106196 \n", + "\n", + "[498 rows x 9 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Correlated Model Weights and Intercept:\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
const-0.28192
Count_Person-0.00000
Percent_Person_PhysicalInactivity0.30604
Percent_Person_SleepLessThan7Hours-0.12529
Percent_Person_WithHighBloodPressure0.75756
Percent_Person_WithHighCholesterol-0.13345
Percent_Person_WithMentalHealthNotGood0.55372
Correlated Variable0.13921
\n", + "

" + ], + "text/plain": [ + "const -0.28192\n", + "Count_Person -0.00000\n", + "Percent_Person_PhysicalInactivity 0.30604\n", + "Percent_Person_SleepLessThan7Hours -0.12529\n", + "Percent_Person_WithHighBloodPressure 0.75756\n", + "Percent_Person_WithHighCholesterol -0.13345\n", + "Percent_Person_WithMentalHealthNotGood 0.55372\n", + "Correlated Variable 0.13921\n", + "dtype: float64" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# New variable correlated with Percent_Person_WithMentalHealthNotGood\n", + "correlated_df = df.copy()\n", + "target_var = \"Percent_Person_WithMentalHealthNotGood\"\n", + "noise = np.random.normal(size=(len(correlated_df.index),))\n", + "correlated_df[\"Correlated Variable\"] = correlated_df[target_var] + noise\n", + "\n", + "# show new data frame\n", + "print(\"New dataframe to fit:\")\n", + "display(correlated_df)\n", + "\n", + "# Create a new model\n", + "y_corr = correlated_df[dep_var].to_numpy().reshape(-1, 1)\n", + "x_corr = correlated_df.loc[:, ~correlated_df.columns.isin([dep_var, \"City Name\"])]\n", + "x_corr = sm.add_constant(x_corr)\n", + "\n", + "correlated_model = sm.OLS(y_corr, x_corr)\n", + "correlated_results = correlated_model.fit()\n", + "\n", + "print(\"Correlated Model Weights and Intercept:\")\n", + "display(correlated_results.params.round(5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HEHJWPxibiY3" + }, + "source": [ + "**3.2A)** Compare the new weights of the correlated model with the weights of our original model. What happened to the weights corresponding to `Percent_Person_WithMentalHealthNotGood`?\n", + "\n", + "**3.2B)** Thinking back to your answers for Q3.1C-E, how might correlated variables affect the interpretation of model weights?" + ] + } + ], + "metadata": { + "colab": { + "include_colab_link": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}