diff --git a/03_main.qmd b/03_main.qmd index c8dd874..3287753 100644 --- a/03_main.qmd +++ b/03_main.qmd @@ -2,6 +2,13 @@ ## Learning Objectives -- item 1 -- item 2 -- item 3 +- Perform linear regression with a **single predictor variable.** +- Estimate the **standard error** of regression coefficients. +- Evaluate the **goodness of fit** of a regression. +- Perform linear regression with **multiple predictor variables.** +- Evaluate the **relative importance of variables** in a multiple linear regression. +- Include **interaction effects** in a multiple linear regression. +- Perform linear regression with **qualitative predictor variables.** +- Model **non-linear relationships** using polynomial regression. +- Identify **non-linearity** in a data set. +- Compare and contrast linear regression with **KNN regression.** diff --git a/03_notes.qmd b/03_notes.qmd index b65567e..44703b0 100644 --- a/03_notes.qmd +++ b/03_notes.qmd @@ -1 +1,272 @@ # Notes {-} + +## Questions to Answer + +Recall the `Advertising` data from **Chapter 2**. Here are a few important questions that we might seek to address: + +1. **Is there a relationship between advertising budget and sales?** +2. **How strong is the relationship between advertising budget and sales?** Does knowledge of the advertising budget provide a lot of information about product sales? +3. **Which media are associated with sales?** +4. **How large is the association between each medium and sales?** For every dollar spent on advertising in a particular medium, by what amount will sales increase? +5. **How accurately can we predict future sales?** +6. **Is the relationship linear?** If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used. +7. **Is there synergy among the advertising media?** Or, in stats terms, is there an interaction effect? + +## Simple Linear Regression: Definition + +**Simple linear regression:** Very straightforward approach to predicting response $Y$ on predictor $X$. + +$$Y \approx \beta_{0} + \beta_{1}X$$ + +- Read "$\approx$" as *"is approximately modeled by."* +- $\beta_{0}$ = intercept +- $\beta_{1}$ = slope + +$$\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1}x$$ + +- $\hat{\beta}_{0}$ = our approximation of intercept +- $\hat{\beta}_{1}$ = our approximation of slope +- $x$ = sample of $X$ +- $\hat{y}$ = our prediction of $Y$ from $x$ +- hat symbol denotes "estimated value" + +- Linear regression is a simple approach to supervised learning + +## Simple Linear Regression: Visualization + +```{r} +#| label: fig3-1 +#| echo: false +#| fig-cap: For the `Advertising` data, the least squares fit for the regression of `sales` onto `TV` is shown. The fit is found by minimizing the residual sum of squares. Each grey line segment represents a residual. In this case a linear fit captures the essence of the relationship, although it overestimates the trend in the left of the plot. +#| out-width: 100% +knitr::include_graphics("images/fig3_1.jpg") +``` + +## Simple Linear Regression: Math + +- **RSS** = *residual sum of squares* + +$$\mathrm{RSS} = e^{2}_{1} + e^{2}_{2} + \ldots + e^{2}_{n}$$ + +$$\mathrm{RSS} = (y_{1} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{1})^{2} + (y_{2} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{2})^{2} + \ldots + (y_{n} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{n})^{2}$$ + +$$\mathrm{RSS} = (y_{1} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{1})^{2} + (y_{2} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{2})^{2} + \ldots + (y_{n} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{n})^{2}$$ + +$$\hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1}\bar{x}$$ + +- $\bar{x}$, $\bar{y}$ = sample means of $x$ and $y$ + +### Visualization of Fit + +```{r} +#| label: fig3-2 +#| echo: false +#| fig-cap: Contour and three-dimensional plots of the RSS on the `Advertising` data, using `sales` as the response and `TV` as the predictor. The red dots correspond to the least squares estimates $\\hat\\beta_0$ and $\\hat\\beta_1$, given by (3.4). +#| out-width: 100% +knitr::include_graphics("images/fig3_2.jpg") +``` + +**Learning Objectives:** + +- Perform linear regression with a **single predictor variable.** + +## Assessing Accuracy of Coefficient Estimates + +$$Y = \beta_{0} + \beta_{1}X + \epsilon$$ + +- **RSE** = *residual standard error* +- Estimate of $\sigma$ + +$$\mathrm{RSE} = \sqrt{\frac{\mathrm{RSS}}{n - 2}}$$ + +$$\mathrm{SE}(\hat\beta_0)^2 = \sigma^2 \left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right],\ \ \mathrm{SE}(\hat\beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$$ + +- **95% confidence interval:** a range of values such that with 95% probability, the range will contain the true unknown value of the parameter + - If we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter + +$$\hat\beta_1 \pm 2\ \cdot \ \mathrm{SE}(\hat\beta_1)$$ + +$$\hat\beta_0 \pm 2\ \cdot \ \mathrm{SE}(\hat\beta_0)$$ + +**Learning Objectives:** + +- Estimate the **standard error** of regression coefficients. + +## Assessing the Accuracy of the Model + +- **RSE** can be considered a measure of the *lack of fit* of the model. +a +- *$R^2$* statistic (also called coefficient of determination) provides an alternative that is in the form of a *proportion of the variance explained*, ranges from 0 to 1, a *good value* depends on the application. + +$$R^2 = 1 - \frac{RSS}{TSS}$$ + +where TSS is the *total sum of squarse*: +$$TSS = \Sigma (y_i - \bar{y})^2$$ + +Quiz: Can *$R^2$* be negative? + +[Answer](https://www.graphpad.com/support/faq/how-can-rsup2sup-be-negative/) + +## Multiple Linear Regression + +**Multiple linear regression** extends simple linear regression for *p* predictors: + +$$Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + ... +\beta_{p}X_p + \epsilon_i$$ + +- $\beta_{j}$ is the *average* effect on $Y$ from $X_{j}$ holding all other predictors fixed. + +- Fit is once again choosing the $\beta_{j}$ that minimizes the RSS. + +- Example in book shows that although fitting *sales* against *newspaper* alone indicated a significant slope (0.055 +- 0.017), when you include *radio* in a multiple regression, *newspaper* no longer has any significant effect. (-0.001 +- 0.006) + +### Important Questions + +1. *Is at least one of the predictors $X_1$, $X_2$, ... , $X_p$ useful in predicting +the response?* + + F statistic close to 1 when there is no relationship, otherwise greater then 1. + +$$F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$ + +2. *Do all the predictors help to explain $Y$ , or is only a subset of the +predictors useful?* + + p-values can help identify important predictors, but it is possible to be mislead by this especially with large number of predictors. Variable selection methods include Forward selection, backward selection and mixed. Topic is continued in Chapter 6. + +3. *How well does the model fit the data?* + + **$R^2$** still gives *proportion of the variance explained*, so look for values "close" to 1. Can also look at **RSE** which is generalized for multiple regression as: + +$$RSE = \sqrt{\frac{1}{n-p-1}RSS}$$ + +4. *Given a set of predictor values, what response value should we predict, +and how accurate is our prediction?* + + Three sets of uncertainty in predictions: + + * Uncertainty in the estimates of $\beta_i$ + * Model bias + * Irreducible error $\epsilon$ + +## Qualitative Predictors + +* Dummy variables: if there are $k$ levels, introduce $k-1$ dummy variables which are equal to one ("one hot") when the underlying qualitative predictor takes that value. For example if there are 3 levels, introduce two new dummy variables and fit the model: + +$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i$$ + +| Qualitative Predicitor | $x_{i1}$ | $x_{i2}$ | +| ---------------------- |:--------:|:--------:| +| level 0 (baseline) | 0 | 0 | +| level 1 | 1 | 0 | +| level 2 | 0 | 1 | + +* Coefficients are interpreted the average effect relative to the baseline. + +* Alternative is to use index variables, a different coefficient for each level: + +$$y_i = \beta_{0 1} + \beta_{0 2} +\beta_{0 3} + \epsilon_i$$ + +## Extensions + +- Interaction / Synergy effects + + Include a product term to account for synergy where one changes in one variable changes the association of the Y with another: + +$$Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 + \beta_{3}X_1 X_2 + \epsilon_i$$ + +- Non-linear relationships (e.g. polynomial fits) + +$$Y = \beta_{0} + \beta_{1}X + \beta_{2}X^2 + ... \beta_{n}X^n + \epsilon_i$$ + +## Potential Problems + +1. *Non-linear relationships* + + Residual plots are useful tool to see if any remaining trends exist. If so consider fitting transformation of the data. + +2. *Correlation of Error Terms* + + Linear regression assumes that the error terms $\epsilon_i$ are uncorrelated. Residuals may indicate that this is not correct (obvious *tracking* in the data). One could also look at the autocorrelation of the residuals. What to do about it? + +3. *Non-constant variance of error terms* + + Again this can be revealed by examining the residuals. Consider transformation of the predictors to remove non-constant variance. The figure below shows residuals demonstrating non-constant variance, and shows this being mitigated to a great extent by log transforming the data. + +```{r} +#| label: fig3-11 +#| echo: false +#| fig-cap: Figure 3.11 +#| out-width: 100% +knitr::include_graphics("images/fig3_11.png") +``` + +4. *Outliers* + + - Outliers are points with for which $y_i$ is far from value predicted by the model (including irreducible error). See point labeled '20' in figure 3.13. + - Detect outliers by plotting studentized residuals (residual $e_i$ divided by the estimated error) and look for residuals larger then 3 standard deviations in absolute value. + - An outlier may not effect the fit much but can have dramatic effect on the **RSE**. + - Often outliers are mistakes in data collection and can be removed, but could also be an indicator of a deficient model. + +5. *High Leverage Points* + + - These are points with unusual values of $x_i$. Examples is point labeled '41' in figure 3.13. + - These points can have large impact on the fit, as in the example, including point 41 pulls slope up significantly. + - Use *leverage statistic* to identify high leverage points, which can be hard to identify in multiple regression. + +```{r} +#| label: fig3-13 +#| echo: false +#| fig-cap: Figure 3.13 +#| out-width: 100% +knitr::include_graphics("images/fig3_13.png") +``` + +6. *Collinearity* + + - Two or more predictor variables are closely related to one another. + - Simple collinearity can be identified by looking at correlations between predictors. + - Causes the standard error to grow (and p-values to grow) + - Often can be dealt with by removing one of the highly correlated predictors or combining them. + - *Multicollinearity* (involving 3 or more predictors) is not so easy to identify. Use *Variance inflation factor*, which is the ratio of the variance of $\hat{\beta_j}$ when fitting the full model to fitting the parameter on its own. Can be computed using the formula: + +$$VIF(\hat{\beta_j}) = \frac{1}{1-R^2_{X_j|X_{-j}}}$$ + +where $R^2_{X_j|X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all the other predictors. + +## Answers to the Marketing Plan questions + +1. **Is there a relationship between advertising budget and sales?** + + Tool: Multiple regression, look at F-statistic. + +2. **How strong is the relationship between advertising budget and sales?** + + Tool: **$R^2$** and **RSE** + +3. **Which media are associated with sales?** + + Tool: p-values for each predictor's *t-statistic*. Explored further in chapter 6. + +4. **How large is the association between each medium and sales?** + + Tool: Confidence intervals on $\hat{\beta_j}$ + +5. **How accurately can we predict future sales?** + + Tool:: Prediction intervals for individual response, confidence intervals for average response. + + +6. **Is the relationship linear?** + + Tool: Residual Plots + +7. **Is there synergy among the advertising media?** + + Tool: Interaction terms and associated p-vales. + +## Comparison of Linear Regression with K-Nearest Neighbors + +- This section examines the K-nearest neighbor (KNN) method (a non-parameteric method). +- This is essentially a k-point moving average. +- This serves to illustrate the Bias-Variance trade-off nicely. + diff --git a/_freeze/03_notes/execute-results/html.json b/_freeze/03_notes/execute-results/html.json index de65636..09e50a6 100644 --- a/_freeze/03_notes/execute-results/html.json +++ b/_freeze/03_notes/execute-results/html.json @@ -1,11 +1,14 @@ { - "hash": "1157b4b6a66df780fe4afff52e109232", + "hash": "3e85649170309cf3a5e235e9d4c6c1b8", "result": { - "markdown": "# Data Structures and Sequences\n\n## Tuples\n\n![](https://pynative.com/wp-content/uploads/2021/02/python-tuple.jpg)\n\nA tuple is a fixed-length, immutable sequence of Python objects which, once assigned, cannot be changed. The easiest way to create one is with a comma-separated sequence of values wrapped in parentheses:\n\n::: {.cell execution_count=1}\n``` {.python .cell-code}\ntup = (4, 5, 6)\ntup\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```\n(4, 5, 6)\n```\n:::\n:::\n\n\nIn many contexts, the parentheses can be omitted\n\n::: {.cell execution_count=2}\n``` {.python .cell-code}\ntup = 4, 5, 6\ntup\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n(4, 5, 6)\n```\n:::\n:::\n\n\nYou can convert any sequence or iterator to a tuple by invoking\n\n::: {.cell execution_count=3}\n``` {.python .cell-code}\ntuple([4,0,2])\n\ntup = tuple('string')\n\ntup\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```\n('s', 't', 'r', 'i', 'n', 'g')\n```\n:::\n:::\n\n\nElements can be accessed with square brackets [] \n\nNote the zero indexing\n\n::: {.cell execution_count=4}\n``` {.python .cell-code}\ntup[0]\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```\n's'\n```\n:::\n:::\n\n\nTuples of tuples\n\n::: {.cell execution_count=5}\n``` {.python .cell-code}\nnested_tup = (4,5,6),(7,8)\n\nnested_tup\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```\n((4, 5, 6), (7, 8))\n```\n:::\n:::\n\n\n::: {.cell execution_count=6}\n``` {.python .cell-code}\nnested_tup[0]\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n(4, 5, 6)\n```\n:::\n:::\n\n\n::: {.cell execution_count=7}\n``` {.python .cell-code}\nnested_tup[1]\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```\n(7, 8)\n```\n:::\n:::\n\n\nWhile the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot:\n\n::: {.cell execution_count=8}\n``` {.python .cell-code}\ntup = tuple(['foo', [1, 2], True])\n\ntup[2]\n```\n\n::: {.cell-output .cell-output-display execution_count=8}\n```\nTrue\n```\n:::\n:::\n\n\n```{{python}}\n\ntup[2] = False\n\n```\n\n````\nTypeError Traceback (most recent call last)\nInput In [9], in ()\n----> 1 tup[2] = False\n\nTypeError: 'tuple' object does not support item assignment\nTypeError: 'tuple' object does not support item assignment\n````\n\nIf an object inside a tuple is mutable, such as a list, you can modify it in place\n\n::: {.cell execution_count=9}\n``` {.python .cell-code}\ntup[1].append(3)\n\ntup\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```\n('foo', [1, 2, 3], True)\n```\n:::\n:::\n\n\nYou can concatenate tuples using the + operator to produce longer tuples:\n\n::: {.cell execution_count=10}\n``` {.python .cell-code}\n(4, None, 'foo') + (6, 0) + ('bar',)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```\n(4, None, 'foo', 6, 0, 'bar')\n```\n:::\n:::\n\n\n### Unpacking tuples\n\nIf you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equals sign:\n\n::: {.cell execution_count=11}\n``` {.python .cell-code}\ntup = (4, 5, 6)\ntup\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```\n(4, 5, 6)\n```\n:::\n:::\n\n\n::: {.cell execution_count=12}\n``` {.python .cell-code}\na, b, c = tup\n\nc\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```\n6\n```\n:::\n:::\n\n\nEven sequences with nested tuples can be unpacked:\n\n::: {.cell execution_count=13}\n``` {.python .cell-code}\ntup = 4, 5, (6,7)\n\na, b, (c, d) = tup\n\nd\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```\n7\n```\n:::\n:::\n\n\nTo easily swap variable names\n\n::: {.cell execution_count=14}\n``` {.python .cell-code}\na, b = 1, 4\n\na\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```\n1\n```\n:::\n:::\n\n\n::: {.cell execution_count=15}\n``` {.python .cell-code}\nb\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```\n4\n```\n:::\n:::\n\n\n::: {.cell execution_count=16}\n``` {.python .cell-code}\nb, a = a, b\n\na\n```\n\n::: {.cell-output .cell-output-display execution_count=16}\n```\n4\n```\n:::\n:::\n\n\n::: {.cell execution_count=17}\n``` {.python .cell-code}\nb\n```\n\n::: {.cell-output .cell-output-display execution_count=17}\n```\n1\n```\n:::\n:::\n\n\nA common use of variable unpacking is iterating over sequences of tuples or lists\n\n::: {.cell execution_count=18}\n``` {.python .cell-code}\nseq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]\n\nseq\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```\n[(1, 2, 3), (4, 5, 6), (7, 8, 9)]\n```\n:::\n:::\n\n\n::: {.cell execution_count=19}\n``` {.python .cell-code}\nfor a, b, c in seq:\n print(f'a={a}, b={b}, c={c}')\n```\n\n::: {.cell-output .cell-output-stdout}\n```\na=1, b=2, c=3\na=4, b=5, c=6\na=7, b=8, c=9\n```\n:::\n:::\n\n\n`*rest` syntax for plucking elements\n\n::: {.cell execution_count=20}\n``` {.python .cell-code}\nvalues = 1,2,3,4,5\n\na, b, *rest = values\n\nrest\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```\n[3, 4, 5]\n```\n:::\n:::\n\n\n As a matter of convention, many Python programmers will use the underscore (_) for unwanted variables:\n\n::: {.cell execution_count=21}\n``` {.python .cell-code}\na, b, *_ = values\n```\n:::\n\n\n### Tuple methods\n\nSince the size and contents of a tuple cannot be modified, it is very light on instance methods. A particularly useful one (also available on lists) is `count`\n\n::: {.cell execution_count=22}\n``` {.python .cell-code}\na = (1,2,2,2,2,3,4,5,7,8,9)\n\na.count(2)\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```\n4\n```\n:::\n:::\n\n\n## List\n\n![](https://pynative.com/wp-content/uploads/2021/03/python-list.jpg)\n\nIn contrast with tuples, lists are variable length and their contents can be modified in place.\n\nLists are mutable. \n\nLists use `[]` square brackts or the `list` function\n\n::: {.cell execution_count=23}\n``` {.python .cell-code}\na_list = [2, 3, 7, None]\n\ntup = (\"foo\", \"bar\", \"baz\")\n\nb_list = list(tup)\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```\n['foo', 'bar', 'baz']\n```\n:::\n:::\n\n\n::: {.cell execution_count=24}\n``` {.python .cell-code}\nb_list[1] = \"peekaboo\"\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```\n['foo', 'peekaboo', 'baz']\n```\n:::\n:::\n\n\nLists and tuples are semantically similar (though tuples cannot be modified) and can be used interchangeably in many functions.\n\n::: {.cell execution_count=25}\n``` {.python .cell-code}\ngen = range(10)\n\ngen\n```\n\n::: {.cell-output .cell-output-display execution_count=25}\n```\nrange(0, 10)\n```\n:::\n:::\n\n\n::: {.cell execution_count=26}\n``` {.python .cell-code}\nlist(gen)\n```\n\n::: {.cell-output .cell-output-display execution_count=26}\n```\n[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n:::\n:::\n\n\n### Adding and removing list elements\n\nthe `append` method\n\n::: {.cell execution_count=27}\n``` {.python .cell-code}\nb_list.append(\"dwarf\")\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=27}\n```\n['foo', 'peekaboo', 'baz', 'dwarf']\n```\n:::\n:::\n\n\nthe `insert` method\n\n::: {.cell execution_count=28}\n``` {.python .cell-code}\nb_list.insert(1, \"red\")\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```\n['foo', 'red', 'peekaboo', 'baz', 'dwarf']\n```\n:::\n:::\n\n\n`insert` is computationally more expensive than `append`\n\nthe `pop` method, the inverse of `insert`\n\n::: {.cell execution_count=29}\n``` {.python .cell-code}\nb_list.pop(2)\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```\n'peekaboo'\n```\n:::\n:::\n\n\n::: {.cell execution_count=30}\n``` {.python .cell-code}\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=30}\n```\n['foo', 'red', 'baz', 'dwarf']\n```\n:::\n:::\n\n\nthe `remove` method\n\n::: {.cell execution_count=31}\n``` {.python .cell-code}\nb_list.append(\"foo\")\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=31}\n```\n['foo', 'red', 'baz', 'dwarf', 'foo']\n```\n:::\n:::\n\n\n::: {.cell execution_count=32}\n``` {.python .cell-code}\nb_list.remove(\"foo\")\n\nb_list\n```\n\n::: {.cell-output .cell-output-display execution_count=32}\n```\n['red', 'baz', 'dwarf', 'foo']\n```\n:::\n:::\n\n\nCheck if a list contains a value using the `in` keyword:\n\n::: {.cell execution_count=33}\n``` {.python .cell-code}\n\"dwarf\" in b_list\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```\nTrue\n```\n:::\n:::\n\n\nThe keyword `not` can be used to negate an `in`\n\n::: {.cell execution_count=34}\n``` {.python .cell-code}\n\"dwarf\" not in b_list\n```\n\n::: {.cell-output .cell-output-display execution_count=34}\n```\nFalse\n```\n:::\n:::\n\n\n### Concatenating and combining lists\n\nsimilar with tuples, use `+` to concatenate\n\n::: {.cell execution_count=35}\n``` {.python .cell-code}\n[4, None, \"foo\"] + [7, 8, (2, 3)]\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```\n[4, None, 'foo', 7, 8, (2, 3)]\n```\n:::\n:::\n\n\nthe `extend` method\n\n::: {.cell execution_count=36}\n``` {.python .cell-code}\nx = [4, None, \"foo\"]\n\nx.extend([7,8,(2,3)])\n\nx\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```\n[4, None, 'foo', 7, 8, (2, 3)]\n```\n:::\n:::\n\n\nlist concatenation by addition is an expensive operation\n\nusing `extend` is preferable\n\n```{{python}}\neverything = []\nfor chunk in list_of_lists:\n everything.extend(chunk)\n\n```\n\nis generally faster than\n\n```{{python}}\n\neverything = []\nfor chunk in list_of_lists:\n everything = everything + chunk\n\n```\n\n### Sorting\n\nthe `sort` method\n\n::: {.cell execution_count=37}\n``` {.python .cell-code}\na = [7, 2, 5, 1, 3]\n\na.sort()\n\na\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```\n[1, 2, 3, 5, 7]\n```\n:::\n:::\n\n\n`sort` options\n\n::: {.cell execution_count=38}\n``` {.python .cell-code}\nb = [\"saw\", \"small\", \"He\", \"foxes\", \"six\"]\n\nb.sort(key = len)\n\nb\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```\n['He', 'saw', 'six', 'small', 'foxes']\n```\n:::\n:::\n\n\n### Slicing\n\nSlicing semantics takes a bit of getting used to, especially if you’re coming from R or MATLAB.\n\nusing the indexing operator `[]`\n\n::: {.cell execution_count=39}\n``` {.python .cell-code}\nseq = [7, 2, 3, 7, 5, 6, 0, 1]\n\nseq[3:5]\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```\n[7, 5]\n```\n:::\n:::\n\n\nalso assigned with a sequence\n\n::: {.cell execution_count=40}\n``` {.python .cell-code}\nseq[3:5] = [6,3]\n\nseq\n```\n\n::: {.cell-output .cell-output-display execution_count=40}\n```\n[7, 2, 3, 6, 3, 6, 0, 1]\n```\n:::\n:::\n\n\nEither the `start` or `stop` can be omitted\n\n::: {.cell execution_count=41}\n``` {.python .cell-code}\nseq[:5]\n```\n\n::: {.cell-output .cell-output-display execution_count=41}\n```\n[7, 2, 3, 6, 3]\n```\n:::\n:::\n\n\n::: {.cell execution_count=42}\n``` {.python .cell-code}\nseq[3:]\n```\n\n::: {.cell-output .cell-output-display execution_count=42}\n```\n[6, 3, 6, 0, 1]\n```\n:::\n:::\n\n\nNegative indices slice the sequence relative to the end:\n\n::: {.cell execution_count=43}\n``` {.python .cell-code}\nseq[-4:]\n```\n\n::: {.cell-output .cell-output-display execution_count=43}\n```\n[3, 6, 0, 1]\n```\n:::\n:::\n\n\nA step can also be used after a second colon to, say, take every other element:\n\n::: {.cell execution_count=44}\n``` {.python .cell-code}\nseq[::2]\n```\n\n::: {.cell-output .cell-output-display execution_count=44}\n```\n[7, 3, 3, 0]\n```\n:::\n:::\n\n\nA clever use of this is to pass -1, which has the useful effect of reversing a list or tuple:\n\n::: {.cell execution_count=45}\n``` {.python .cell-code}\nseq[::-1]\n```\n\n::: {.cell-output .cell-output-display execution_count=45}\n```\n[1, 0, 6, 3, 6, 3, 2, 7]\n```\n:::\n:::\n\n\n## Dictionary\n\n![](https://pynative.com/wp-content/uploads/2021/02/dictionaries-in-python.jpg)\n\nThe dictionary or dict may be the most important built-in Python data structure. \n\nOne approach for creating a dictionary is to use curly braces {} and colons to separate keys and values:\n\n::: {.cell execution_count=46}\n``` {.python .cell-code}\nempty_dict = {}\n\nd1 = {\"a\": \"some value\", \"b\": [1, 2, 3, 4]}\n\nd1\n```\n\n::: {.cell-output .cell-output-display execution_count=46}\n```\n{'a': 'some value', 'b': [1, 2, 3, 4]}\n```\n:::\n:::\n\n\naccess, insert, or set elements \n\n::: {.cell execution_count=47}\n``` {.python .cell-code}\nd1[7] = \"an integer\"\n\nd1\n```\n\n::: {.cell-output .cell-output-display execution_count=47}\n```\n{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}\n```\n:::\n:::\n\n\nand as before\n\n::: {.cell execution_count=48}\n``` {.python .cell-code}\n\"b\" in d1\n```\n\n::: {.cell-output .cell-output-display execution_count=48}\n```\nTrue\n```\n:::\n:::\n\n\nthe `del` and `pop` methods\n\n::: {.cell execution_count=49}\n``` {.python .cell-code}\ndel d1[7]\n\nd1\n```\n\n::: {.cell-output .cell-output-display execution_count=49}\n```\n{'a': 'some value', 'b': [1, 2, 3, 4]}\n```\n:::\n:::\n\n\n::: {.cell execution_count=50}\n``` {.python .cell-code}\nret = d1.pop(\"a\")\n\nret\n```\n\n::: {.cell-output .cell-output-display execution_count=50}\n```\n'some value'\n```\n:::\n:::\n\n\nThe `keys` and `values` methods\n\n::: {.cell execution_count=51}\n``` {.python .cell-code}\nlist(d1.keys())\n```\n\n::: {.cell-output .cell-output-display execution_count=51}\n```\n['b']\n```\n:::\n:::\n\n\n::: {.cell execution_count=52}\n``` {.python .cell-code}\nlist(d1.values())\n```\n\n::: {.cell-output .cell-output-display execution_count=52}\n```\n[[1, 2, 3, 4]]\n```\n:::\n:::\n\n\nthe `items` method\n\n::: {.cell execution_count=53}\n``` {.python .cell-code}\nlist(d1.items())\n```\n\n::: {.cell-output .cell-output-display execution_count=53}\n```\n[('b', [1, 2, 3, 4])]\n```\n:::\n:::\n\n\n the update method to merge one dictionary into another\n\n::: {.cell execution_count=54}\n``` {.python .cell-code}\nd1.update({\"b\": \"foo\", \"c\": 12})\n\nd1\n```\n\n::: {.cell-output .cell-output-display execution_count=54}\n```\n{'b': 'foo', 'c': 12}\n```\n:::\n:::\n\n\n ### Creating dictionaries from sequences\n\n::: {.cell execution_count=55}\n``` {.python .cell-code}\nlist(range(5))\n```\n\n::: {.cell-output .cell-output-display execution_count=55}\n```\n[0, 1, 2, 3, 4]\n```\n:::\n:::\n\n\n::: {.cell execution_count=56}\n``` {.python .cell-code}\ntuples = zip(range(5), reversed(range(5)))\n\ntuples\n\nmapping = dict(tuples)\n\nmapping\n```\n\n::: {.cell-output .cell-output-display execution_count=56}\n```\n{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}\n```\n:::\n:::\n\n\n### Default values\n\nimagine categorizing a list of words by their first letters as a dictionary of lists\n\n::: {.cell execution_count=57}\n``` {.python .cell-code}\nwords = [\"apple\", \"bat\", \"bar\", \"atom\", \"book\"]\n\nby_letter = {}\n\nfor word in words:\n letter = word[0]\n if letter not in by_letter:\n by_letter[letter] = [word]\n else:\n by_letter[letter].append(word)\n\nby_letter\n```\n\n::: {.cell-output .cell-output-display execution_count=57}\n```\n{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}\n```\n:::\n:::\n\n\nThe `setdefault` dictionary method can be used to simplify this workflow. The preceding for loop can be rewritten as:\n\n::: {.cell execution_count=58}\n``` {.python .cell-code}\nby_letter = {}\n\nfor word in words:\n letter = word[0]\n by_letter.setdefault(letter, []).append(word)\n\nby_letter\n```\n\n::: {.cell-output .cell-output-display execution_count=58}\n```\n{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}\n```\n:::\n:::\n\n\nThe built-in `collections`module has a useful class, `defaultdict`, which makes this even easier.\n\n::: {.cell execution_count=59}\n``` {.python .cell-code}\nfrom collections import defaultdict\n\nby_letter = defaultdict(list)\n\nfor word in words:\n by_letter[word[0]].append(word)\n\nby_letter\n```\n\n::: {.cell-output .cell-output-display execution_count=59}\n```\ndefaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})\n```\n:::\n:::\n\n\n### Valid dictionary key types\n\nkeys generally have to be immutable objects like scalars or tuples for *hashability*\n\nTo use a list as a key, one option is to convert it to a tuple, which can be hashed as long as its elements also can be:\n\n::: {.cell execution_count=60}\n``` {.python .cell-code}\nd = {}\n\nd[tuple([1,2,3])] = 5\n\nd\n```\n\n::: {.cell-output .cell-output-display execution_count=60}\n```\n{(1, 2, 3): 5}\n```\n:::\n:::\n\n\n## Set\n\n![](https://pynative.com/wp-content/uploads/2021/03/python-sets.jpg)\n\ncan be created in two ways: via the `set` function or via a `set literal` with curly braces:\n\n::: {.cell execution_count=61}\n``` {.python .cell-code}\nset([2, 2, 2, 1, 3, 3])\n\n{2,2,1,3,3}\n```\n\n::: {.cell-output .cell-output-display execution_count=61}\n```\n{1, 2, 3}\n```\n:::\n:::\n\n\nSets support mathematical set operations like union, intersection, difference, and symmetric difference.\n\nThe `union` of these two sets:\n\n::: {.cell execution_count=62}\n``` {.python .cell-code}\na = {1, 2, 3, 4, 5}\n\nb = {3, 4, 5, 6, 7, 8}\n\na.union(b)\n\na | b\n```\n\n::: {.cell-output .cell-output-display execution_count=62}\n```\n{1, 2, 3, 4, 5, 6, 7, 8}\n```\n:::\n:::\n\n\nThe `&`operator or the `intersection` method\n\n::: {.cell execution_count=63}\n``` {.python .cell-code}\na.intersection(b)\n\na & b\n```\n\n::: {.cell-output .cell-output-display execution_count=63}\n```\n{3, 4, 5}\n```\n:::\n:::\n\n\n[A table of commonly used `set` methods](https://wesmckinney.com/book/python-builtin.html#tbl-table_set_operations)\n\nAll of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result. For very large sets, this may be more efficient\n\n::: {.cell execution_count=64}\n``` {.python .cell-code}\nc = a.copy()\n\nc |= b\n\nc\n```\n\n::: {.cell-output .cell-output-display execution_count=64}\n```\n{1, 2, 3, 4, 5, 6, 7, 8}\n```\n:::\n:::\n\n\n::: {.cell execution_count=65}\n``` {.python .cell-code}\nd = a.copy()\n\nd &= b\n\nd\n```\n\n::: {.cell-output .cell-output-display execution_count=65}\n```\n{3, 4, 5}\n```\n:::\n:::\n\n\nset elements generally must be immutable, and they must be hashable\n\nyou can convert them to tuples\n\nYou can also check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set\n\n::: {.cell execution_count=66}\n``` {.python .cell-code}\na_set = {1, 2, 3, 4, 5}\n\n{1, 2, 3}.issubset(a_set)\n```\n\n::: {.cell-output .cell-output-display execution_count=66}\n```\nTrue\n```\n:::\n:::\n\n\n::: {.cell execution_count=67}\n``` {.python .cell-code}\na_set.issuperset({1, 2, 3})\n```\n\n::: {.cell-output .cell-output-display execution_count=67}\n```\nTrue\n```\n:::\n:::\n\n\n## Built-In Sequence Functions\n\n### enumerate\n\n`enumerate` returns a sequence of (i, value) tuples\n\n### sorted\n\n`sorted` returns a new sorted list \n\n::: {.cell execution_count=68}\n``` {.python .cell-code}\nsorted([7,1,2,9,3,6,5,0,22])\n```\n\n::: {.cell-output .cell-output-display execution_count=68}\n```\n[0, 1, 2, 3, 5, 6, 7, 9, 22]\n```\n:::\n:::\n\n\n### zip\n\n`zip` “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples\n\n::: {.cell execution_count=69}\n``` {.python .cell-code}\nseq1 = [\"foo\", \"bar\", \"baz\"]\n\nseq2 = [\"one\", \"two\", \"three\"]\n\nzipped = zip(seq1, seq2)\n\nlist(zipped)\n```\n\n::: {.cell-output .cell-output-display execution_count=69}\n```\n[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]\n```\n:::\n:::\n\n\n`zip` can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence\n\n::: {.cell execution_count=70}\n``` {.python .cell-code}\nseq3 = [False, True]\n\nlist(zip(seq1, seq2, seq3))\n```\n\n::: {.cell-output .cell-output-display execution_count=70}\n```\n[('foo', 'one', False), ('bar', 'two', True)]\n```\n:::\n:::\n\n\nA common use of `zip` is simultaneously iterating over multiple sequences, possibly also combined with `enumerate`\n\n::: {.cell execution_count=71}\n``` {.python .cell-code}\nfor index, (a, b) in enumerate(zip(seq1, seq2)):\n print(f\"{index}: {a}, {b}\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0: foo, one\n1: bar, two\n2: baz, three\n```\n:::\n:::\n\n\n`reversed` iterates over the elements of a sequence in reverse order\n\n::: {.cell execution_count=72}\n``` {.python .cell-code}\nlist(reversed(range(10)))\n```\n\n::: {.cell-output .cell-output-display execution_count=72}\n```\n[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]\n```\n:::\n:::\n\n\n## List, Set, and Dictionary Comprehensions\n\n```\n[expr for value in collection if condition]\n```\n\nFor example, given a list of strings, we could filter out strings with length 2 or less and convert them to uppercase like this\n\n::: {.cell execution_count=73}\n``` {.python .cell-code}\nstrings = [\"a\", \"as\", \"bat\", \"car\", \"dove\", \"python\"]\n\n[x.upper() for x in strings if len(x) > 2]\n```\n\n::: {.cell-output .cell-output-display execution_count=73}\n```\n['BAT', 'CAR', 'DOVE', 'PYTHON']\n```\n:::\n:::\n\n\nA dictionary comprehension looks like this\n\n```\ndict_comp = {key-expr: value-expr for value in collection\n if condition}\n```\n\nSuppose we wanted a set containing just the lengths of the strings contained in the collection\n\n::: {.cell execution_count=74}\n``` {.python .cell-code}\nunique_lengths = {len(x) for x in strings}\n\nunique_lengths\n```\n\n::: {.cell-output .cell-output-display execution_count=74}\n```\n{1, 2, 3, 4, 6}\n```\n:::\n:::\n\n\nwe could create a lookup map of these strings for their locations in the list\n\n::: {.cell execution_count=75}\n``` {.python .cell-code}\nloc_mapping = {value: index for index, value in enumerate(strings)}\n\nloc_mapping\n```\n\n::: {.cell-output .cell-output-display execution_count=75}\n```\n{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}\n```\n:::\n:::\n\n\n## Nested list comprehensions\n\nSuppose we have a list of lists containing some English and Spanish names. We want to get a single list containing all names with two or more a’s in them\n\n::: {.cell execution_count=76}\n``` {.python .cell-code}\nall_data = [[\"John\", \"Emily\", \"Michael\", \"Mary\", \"Steven\"],\n [\"Maria\", \"Juan\", \"Javier\", \"Natalia\", \"Pilar\"]]\n\nresult = [name for names in all_data for name in names\n if name.count(\"a\") >= 2]\n\nresult\n```\n\n::: {.cell-output .cell-output-display execution_count=76}\n```\n['Maria', 'Natalia']\n```\n:::\n:::\n\n\nHere is another example where we “flatten” a list of tuples of integers into a simple list of integers\n\n::: {.cell execution_count=77}\n``` {.python .cell-code}\nsome_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]\n\nflattened = [x for tup in some_tuples for x in tup]\n\nflattened\n```\n\n::: {.cell-output .cell-output-display execution_count=77}\n```\n[1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n:::\n:::\n\n\n# Functions\n\n![](https://miro.medium.com/max/1200/1*ZegxhR33NdeVRpBPYXnYYQ.gif)\n\n`Functions` are the primary and most important method of code organization and reuse in Python.\n\nthey use the `def` keyword\n\nEach function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments. Here we will define a function with an optional z argument with the default value 1.5\n\n::: {.cell execution_count=78}\n``` {.python .cell-code}\ndef my_function(x, y, z=1.5):\n return (x + y) * z \n\nmy_function(4,25)\n```\n\n::: {.cell-output .cell-output-display execution_count=78}\n```\n43.5\n```\n:::\n:::\n\n\nThe main restriction on function arguments is that the keyword arguments must follow the positional arguments\n\n## Namespaces, Scope, and Local Functions\n\nA more descriptive name describing a variable scope in Python is a namespace.\n\nConsider the following function\n\n::: {.cell execution_count=79}\n``` {.python .cell-code}\na = []\n\ndef func():\n for i in range(5):\n a.append(i)\n```\n:::\n\n\nWhen `func()` is called, the empty list a is created, five elements are appended, and then a is destroyed when the function exits. \n\n::: {.cell execution_count=80}\n``` {.python .cell-code}\nfunc()\n\nfunc()\n\na\n```\n\n::: {.cell-output .cell-output-display execution_count=80}\n```\n[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]\n```\n:::\n:::\n\n\n## Returing Multiple Values\n\nWhat’s happening here is that the function is actually just returning one object, a tuple, which is then being unpacked into the result variables.\n\n::: {.cell execution_count=81}\n``` {.python .cell-code}\ndef f():\n a = 5\n b = 6\n c = 7\n return a, b, c\n\na, b, c = f()\n\na\n```\n\n::: {.cell-output .cell-output-display execution_count=81}\n```\n5\n```\n:::\n:::\n\n\n## Functions are Objects\n\n Suppose we were doing some data cleaning and needed to apply a bunch of transformations to the following list of strings:\n\n::: {.cell execution_count=82}\n``` {.python .cell-code}\nstates = [\" Alabama \", \"Georgia!\", \"Georgia\", \"georgia\", \"FlOrIda\",\n \"south carolina##\", \"West virginia?\"]\n\nimport re\n\ndef clean_strings(strings):\n result = []\n for value in strings:\n value = value.strip()\n value = re.sub(\"[!#?]\", \"\", value)\n value = value.title()\n result.append(value)\n return result\n\nclean_strings(states)\n```\n\n::: {.cell-output .cell-output-display execution_count=82}\n```\n['Alabama',\n 'Georgia',\n 'Georgia',\n 'Georgia',\n 'Florida',\n 'South Carolina',\n 'West Virginia']\n```\n:::\n:::\n\n\nAnother approach\n\n::: {.cell execution_count=83}\n``` {.python .cell-code}\ndef remove_punctuation(value):\n return re.sub(\"[!#?]\", \"\", value)\n\nclean_ops = [str.strip, remove_punctuation, str.title]\n\ndef clean_strings(strings, ops):\n result = []\n for value in strings:\n for func in ops:\n value = func(value)\n result.append(value)\n return result\n\nclean_strings(states, clean_ops)\n```\n\n::: {.cell-output .cell-output-display execution_count=83}\n```\n['Alabama',\n 'Georgia',\n 'Georgia',\n 'Georgia',\n 'Florida',\n 'South Carolina',\n 'West Virginia']\n```\n:::\n:::\n\n\nYou can use functions as arguments to other functions like the built-in `map` function\n\n::: {.cell execution_count=84}\n``` {.python .cell-code}\nfor x in map(remove_punctuation, states):\n print(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Alabama \nGeorgia\nGeorgia\ngeorgia\nFlOrIda\nsouth carolina\nWest virginia\n```\n:::\n:::\n\n\n## Anonymous Lambda Functions\n\n a way of writing functions consisting of a single statement\n\nsuppose you wanted to sort a collection of strings by the number of distinct letters in each string\n\n::: {.cell execution_count=85}\n``` {.python .cell-code}\nstrings = [\"foo\", \"card\", \"bar\", \"aaaaaaa\", \"ababdo\"]\n\nstrings.sort(key=lambda x: len(set(x)))\n\nstrings\n```\n\n::: {.cell-output .cell-output-display execution_count=85}\n```\n['aaaaaaa', 'foo', 'bar', 'card', 'ababdo']\n```\n:::\n:::\n\n\n# Generators\n\nMany objects in Python support iteration, such as over objects in a list or lines in a file. \n\n::: {.cell execution_count=86}\n``` {.python .cell-code}\nsome_dict = {\"a\": 1, \"b\": 2, \"c\": 3}\n\nfor key in some_dict:\n print(key)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\na\nb\nc\n```\n:::\n:::\n\n\nMost methods expecting a list or list-like object will also accept any iterable object. This includes built-in methods such as `min`, `max`, and `sum`, and type constructors like `list` and `tuple`\n\nA `generator` is a convenient way, similar to writing a normal function, to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield keyword instead of return in a function\n\n::: {.cell execution_count=87}\n``` {.python .cell-code}\ndef squares(n=10):\n print(f\"Generating squares from 1 to {n ** 2}\")\n for i in range(1, n + 1):\n yield i ** 2\n\ngen = squares()\n\nfor x in gen:\n print(x, end=\" \")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nGenerating squares from 1 to 100\n1 4 9 16 25 36 49 64 81 100 \n```\n:::\n:::\n\n\n> Since generators produce output one element at a time versus an entire list all at once, it can help your program use less memory.\n\n## Generator expressions\n\n This is a generator analogue to list, dictionary, and set comprehensions. To create one, enclose what would otherwise be a list comprehension within parentheses instead of brackets:\n\n::: {.cell execution_count=88}\n``` {.python .cell-code}\ngen = (x ** 2 for x in range(100))\n\ngen\n```\n\n::: {.cell-output .cell-output-display execution_count=88}\n```\n at 0x7fa541620c80>\n```\n:::\n:::\n\n\nGenerator expressions can be used instead of list comprehensions as function arguments in some cases:\n\n::: {.cell execution_count=89}\n``` {.python .cell-code}\nsum(x ** 2 for x in range(100))\n```\n\n::: {.cell-output .cell-output-display execution_count=89}\n```\n328350\n```\n:::\n:::\n\n\n::: {.cell execution_count=90}\n``` {.python .cell-code}\ndict((i, i ** 2) for i in range(5))\n```\n\n::: {.cell-output .cell-output-display execution_count=90}\n```\n{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}\n```\n:::\n:::\n\n\n## itertools module\n\n`itertools` module has a collection of generators for many common data algorithms.\n\n`groupby` takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function\n\n::: {.cell execution_count=91}\n``` {.python .cell-code}\nimport itertools\n\ndef first_letter(x):\n return x[0]\n\nnames = [\"Alan\", \"Adam\", \"Jackie\", \"Lily\", \"Katie\", \"Molly\"]\n\nfor letter, names in itertools.groupby(names, first_letter):\n print(letter, list(names))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nA ['Alan', 'Adam']\nJ ['Jackie']\nL ['Lily']\nK ['Katie']\nM ['Molly']\n```\n:::\n:::\n\n\n[Table of other itertools functions](https://wesmckinney.com/book/python-builtin.html#tbl-table_itertools)\n\n# Errors and Exception Handling\n\nHandling errors or exceptions gracefully is an important part of building robust programs\n\n::: {.cell execution_count=92}\n``` {.python .cell-code}\ndef attempt_float(x):\n try:\n return float(x)\n except:\n return x\n\nattempt_float(\"1.2345\")\n```\n\n::: {.cell-output .cell-output-display execution_count=92}\n```\n1.2345\n```\n:::\n:::\n\n\n::: {.cell execution_count=93}\n``` {.python .cell-code}\nattempt_float(\"something\")\n```\n\n::: {.cell-output .cell-output-display execution_count=93}\n```\n'something'\n```\n:::\n:::\n\n\nYou might want to suppress only ValueError, since a TypeError (the input was not a string or numeric value) might indicate a legitimate bug in your program. To do that, write the exception type after except:\n\n::: {.cell execution_count=94}\n``` {.python .cell-code}\ndef attempt_float(x):\n try:\n return float(x)\n except ValueError:\n return x\n```\n:::\n\n\n::: {.cell execution_count=95}\n``` {.python .cell-code}\nattempt_float((1, 2))\n```\n:::\n\n\n```\n---------------------------------------------------------------------------\nTypeError Traceback (most recent call last)\nd:\\packages\\bookclub-py4da\\03_notes.qmd in ()\n----> 1001 attempt_float((1, 2))\n\nInput In [114], in attempt_float(x)\n 1 def attempt_float(x):\n 2 try:\n----> 3 return float(x)\n 4 except ValueError:\n 5 return x\n\nTypeError: float() argument must be a string or a real number, not 'tuple'\n\n```\n\nYou can catch multiple exception types by writing a tuple of exception types instead (the parentheses are required):\n\n::: {.cell execution_count=96}\n``` {.python .cell-code}\ndef attempt_float(x):\n try:\n return float(x)\n except (TypeError, ValueError):\n return x\n\nattempt_float((1, 2))\n```\n\n::: {.cell-output .cell-output-display execution_count=95}\n```\n(1, 2)\n```\n:::\n:::\n\n\nIn some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether or not the code in the try block succeeds. To do this, use `finally`:\n\n::: {.cell execution_count=97}\n``` {.python .cell-code}\nf = open(path, mode=\"w\")\n\ntry:\n write_to_file(f)\nfinally:\n f.close()\n```\n:::\n\n\nHere, the file object f will always get closed. \n\nyou can have code that executes only if the try: block succeeds using else:\n\n::: {.cell execution_count=98}\n``` {.python .cell-code}\nf = open(path, mode=\"w\")\n\ntry:\n write_to_file(f)\nexcept:\n print(\"Failed\")\nelse:\n print(\"Succeeded\")\nfinally:\n f.close()\n```\n:::\n\n\n## Exceptions in IPython\n\nIf an exception is raised while you are %run-ing a script or executing any statement, IPython will by default print a full call stack trace. Having additional context by itself is a big advantage over the standard Python interpreter\n\n# Files and the Operating System\n\nTo open a file for reading or writing, use the built-in open function with either a relative or absolute file path and an optional file encoding.\n\nWe can then treat the file object f like a list and iterate over the lines\n\n::: {.cell execution_count=99}\n``` {.python .cell-code}\npath = \"examples/segismundo.txt\"\n\nf = open(path, encoding=\"utf-8\")\n\nlines = [x.rstrip() for x in open(path, encoding=\"utf-8\")]\n\nlines\n```\n:::\n\n\nWhen you use open to create file objects, it is recommended to close the file\n\n::: {.cell execution_count=100}\n``` {.python .cell-code}\nf.close()\n```\n:::\n\n\nsome of the most commonly used methods are `read`, `seek`, and `tell`.\n\n`read(10)` returns 10 characters from the file\n\nthe `read` method advances the file object position by the number of bytes read\n\n`tell()` gives you the current position in the file\n\nTo get consistent behavior across platforms, it is best to pass an encoding (such as `encoding=\"utf-8\"`)\n\n`seek(3)` changes the file position to the indicated byte \n\nTo write text to a file, you can use the file’s `write` or `writelines` methods\n\n## Byte and Unicode with Files\n\nThe default behavior for Python files (whether readable or writable) is text mode, which means that you intend to work with Python strings (i.e., Unicode). \n\n", - "supporting": [ - "03_notes_files" + "markdown": "# Notes {-}\n\n## Questions to Answer\n\nRecall the `Advertising` data from **Chapter 2**. Here are a few important questions that we might seek to address:\n\n1. **Is there a relationship between advertising budget and sales?**\n2. **How strong is the relationship between advertising budget and sales?** Does knowledge of the advertising budget provide a lot of information about product sales?\n3. **Which media are associated with sales?**\n4. **How large is the association between each medium and sales?** For every dollar spent on advertising in a particular medium, by what amount will sales increase? \n5. **How accurately can we predict future sales?**\n6. **Is the relationship linear?** If there is approximately a straight-line relationship between advertising expenditure in the various media and sales, then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.\n7. **Is there synergy among the advertising media?** Or, in stats terms, is there an interaction effect?\n\n## Simple Linear Regression: Definition\n\n**Simple linear regression:** Very straightforward approach to predicting response $Y$ on predictor $X$.\n\n\n$$Y \\approx \\beta_{0} + \\beta_{1}X$$\n\n\n- Read \"$\\approx$\" as *\"is approximately modeled by.\"*\n- $\\beta_{0}$ = intercept\n- $\\beta_{1}$ = slope\n\n\n$$\\hat{y} = \\hat{\\beta}_{0} + \\hat{\\beta}_{1}x$$\n\n\n- $\\hat{\\beta}_{0}$ = our approximation of intercept\n- $\\hat{\\beta}_{1}$ = our approximation of slope\n- $x$ = sample of $X$\n- $\\hat{y}$ = our prediction of $Y$ from $x$\n- hat symbol denotes \"estimated value\" \n\n- Linear regression is a simple approach to supervised learning\n\n## Simple Linear Regression: Visualization\n\n\n::: {.cell}\n::: {.cell-output-display}\n![For the `Advertising` data, the least squares fit for the regression of `sales` onto `TV` is shown. The fit is found by minimizing the residual sum of squares. Each grey line segment represents a residual. In this case a linear fit captures the essence of the relationship, although it overestimates the trend in the left of the plot.](images/fig3_1.jpg){width=100%}\n:::\n:::\n\n\n## Simple Linear Regression: Math\n\n- **RSS** = *residual sum of squares*\n\n\n$$\\mathrm{RSS} = e^{2}_{1} + e^{2}_{2} + \\ldots + e^{2}_{n}$$\n\n$$\\mathrm{RSS} = (y_{1} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{1})^{2} + (y_{2} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{2})^{2} + \\ldots + (y_{n} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{n})^{2}$$\n\n$$\\mathrm{RSS} = (y_{1} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{1})^{2} + (y_{2} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{2})^{2} + \\ldots + (y_{n} - \\hat{\\beta}_{0} - \\hat{\\beta}_{1}x_{n})^{2}$$\n\n$$\\hat{\\beta}_{0} = \\bar{y} - \\hat{\\beta}_{1}\\bar{x}$$\n\n\n- $\\bar{x}$, $\\bar{y}$ = sample means of $x$ and $y$\n\n### Visualization of Fit\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Contour and three-dimensional plots of the RSS on the `Advertising` data, using `sales` as the response and `TV` as the predictor. The red dots correspond to the least squares estimates $\\\\hat\\\\beta_0$ and $\\\\hat\\\\beta_1$, given by (3.4).](images/fig3_2.jpg){width=100%}\n:::\n:::\n\n\n**Learning Objectives:**\n\n- Perform linear regression with a **single predictor variable.**\n\n## Assessing Accuracy of Coefficient Estimates\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\epsilon$$\n\n\n- **RSE** = *residual standard error*\n- Estimate of $\\sigma$\n\n\n$$\\mathrm{RSE} = \\sqrt{\\frac{\\mathrm{RSS}}{n - 2}}$$\n\n$$\\mathrm{SE}(\\hat\\beta_0)^2 = \\sigma^2 \\left[\\frac{1}{n} + \\frac{\\bar{x}^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}\\right],\\ \\ \\mathrm{SE}(\\hat\\beta_1)^2 = \\frac{\\sigma^2}{\\sum_{i=1}^n (x_i - \\bar{x})^2}$$\n\n\n- **95% confidence interval:** a range of values such that with 95% probability, the range will contain the true unknown value of the parameter\n - If we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter\n\n\n$$\\hat\\beta_1 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_1)$$\n\n$$\\hat\\beta_0 \\pm 2\\ \\cdot \\ \\mathrm{SE}(\\hat\\beta_0)$$\n\n\n**Learning Objectives:**\n\n- Estimate the **standard error** of regression coefficients.\n\n## Assessing the Accuracy of the Model\n\n- **RSE** can be considered a measure of the *lack of fit* of the model. \na\n- *$R^2$* statistic (also called coefficient of determination) provides an alternative that is in the form of a *proportion of the variance explained*, ranges from 0 to 1, a *good value* depends on the application.\n\n\n$$R^2 = 1 - \\frac{RSS}{TSS}$$\n\n\nwhere TSS is the *total sum of squarse*:\n\n$$TSS = \\Sigma (y_i - \\bar{y})^2$$\n\n\nQuiz: Can *$R^2$* be negative?\n\n[Answer](https://www.graphpad.com/support/faq/how-can-rsup2sup-be-negative/)\n\n## Multiple Linear Regression\n\n**Multiple linear regression** extends simple linear regression for *p* predictors:\n\n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + ... +\\beta_{p}X_p + \\epsilon_i$$\n\n\n- $\\beta_{j}$ is the *average* effect on $Y$ from $X_{j}$ holding all other predictors fixed. \n\n- Fit is once again choosing the $\\beta_{j}$ that minimizes the RSS.\n\n- Example in book shows that although fitting *sales* against *newspaper* alone indicated a significant slope (0.055 +- 0.017), when you include *radio* in a multiple regression, *newspaper* no longer has any significant effect. (-0.001 +- 0.006) \n\n### Important Questions\n\n1. *Is at least one of the predictors $X_1$, $X_2$, ... , $X_p$ useful in predicting\nthe response?*\n\n F statistic close to 1 when there is no relationship, otherwise greater then 1.\n\n\n$$F = \\frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$\n\n\n2. *Do all the predictors help to explain $Y$ , or is only a subset of the\npredictors useful?*\n\n p-values can help identify important predictors, but it is possible to be mislead by this especially with large number of predictors. Variable selection methods include Forward selection, backward selection and mixed. Topic is continued in Chapter 6.\n\n3. *How well does the model fit the data?*\n\n **$R^2$** still gives *proportion of the variance explained*, so look for values \"close\" to 1. Can also look at **RSE** which is generalized for multiple regression as:\n \n\n$$RSE = \\sqrt{\\frac{1}{n-p-1}RSS}$$\n\n\n4. *Given a set of predictor values, what response value should we predict,\nand how accurate is our prediction?* \n\n Three sets of uncertainty in predictions:\n \n * Uncertainty in the estimates of $\\beta_i$\n * Model bias\n * Irreducible error $\\epsilon$\n\n## Qualitative Predictors\n\n* Dummy variables: if there are $k$ levels, introduce $k-1$ dummy variables which are equal to one (\"one hot\") when the underlying qualitative predictor takes that value. For example if there are 3 levels, introduce two new dummy variables and fit the model:\n\n\n$$y_i = \\beta_0 + \\beta_1 x_{i1} + \\beta_2 x_{i2} + \\epsilon_i$$\n\n\n| Qualitative Predicitor | $x_{i1}$ | $x_{i2}$ |\n| ---------------------- |:--------:|:--------:|\n| level 0 (baseline) | 0 | 0 |\n| level 1 | 1 | 0 |\n| level 2 | 0 | 1 |\n\n* Coefficients are interpreted the average effect relative to the baseline.\n\n* Alternative is to use index variables, a different coefficient for each level:\n\n\n$$y_i = \\beta_{0 1} + \\beta_{0 2} +\\beta_{0 3} + \\epsilon_i$$\n\n\n## Extensions\n\n- Interaction / Synergy effects\n \n Include a product term to account for synergy where one changes in one variable changes the association of the Y with another:\n \n\n$$Y = \\beta_{0} + \\beta_{1}X_1 + \\beta_{2}X_2 + \\beta_{3}X_1 X_2 + \\epsilon_i$$\n\n\n- Non-linear relationships (e.g. polynomial fits)\n\n\n$$Y = \\beta_{0} + \\beta_{1}X + \\beta_{2}X^2 + ... \\beta_{n}X^n + \\epsilon_i$$\n\n\n## Potential Problems\n\n1. *Non-linear relationships* \n\n Residual plots are useful tool to see if any remaining trends exist. If so consider fitting transformation of the data. \n \n2. *Correlation of Error Terms*\n\n Linear regression assumes that the error terms $\\epsilon_i$ are uncorrelated. Residuals may indicate that this is not correct (obvious *tracking* in the data). One could also look at the autocorrelation of the residuals. What to do about it?\n \n3. *Non-constant variance of error terms*\n\n Again this can be revealed by examining the residuals. Consider transformation of the predictors to remove non-constant variance. The figure below shows residuals demonstrating non-constant variance, and shows this being mitigated to a great extent by log transforming the data.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.11](images/fig3_11.png){width=100%}\n:::\n:::\n\n\n4. *Outliers*\n\n - Outliers are points with for which $y_i$ is far from value predicted by the model (including irreducible error). See point labeled '20' in figure 3.13.\n - Detect outliers by plotting studentized residuals (residual $e_i$ divided by the estimated error) and look for residuals larger then 3 standard deviations in absolute value.\n - An outlier may not effect the fit much but can have dramatic effect on the **RSE**. \n - Often outliers are mistakes in data collection and can be removed, but could also be an indicator of a deficient model. \n\n5. *High Leverage Points* \n\n - These are points with unusual values of $x_i$. Examples is point labeled '41' in figure 3.13.\n - These points can have large impact on the fit, as in the example, including point 41 pulls slope up significantly.\n - Use *leverage statistic* to identify high leverage points, which can be hard to identify in multiple regression.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Figure 3.13](images/fig3_13.png){width=100%}\n:::\n:::\n\n\n6. *Collinearity*\n\n - Two or more predictor variables are closely related to one another.\n - Simple collinearity can be identified by looking at correlations between predictors. \n - Causes the standard error to grow (and p-values to grow)\n - Often can be dealt with by removing one of the highly correlated predictors or combining them. \n - *Multicollinearity* (involving 3 or more predictors) is not so easy to identify. Use *Variance inflation factor*, which is the ratio of the variance of $\\hat{\\beta_j}$ when fitting the full model to fitting the parameter on its own. Can be computed using the formula:\n \n\n$$VIF(\\hat{\\beta_j}) = \\frac{1}{1-R^2_{X_j|X_{-j}}}$$\n\n\nwhere $R^2_{X_j|X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all the other predictors.\n\n## Answers to the Marketing Plan questions\n\n1. **Is there a relationship between advertising budget and sales?**\n\n Tool: Multiple regression, look at F-statistic.\n\n2. **How strong is the relationship between advertising budget and sales?** \n\n Tool: **$R^2$** and **RSE**\n \n3. **Which media are associated with sales?**\n \n Tool: p-values for each predictor's *t-statistic*. Explored further in chapter 6.\n\n4. **How large is the association between each medium and sales?**\n\n Tool: Confidence intervals on $\\hat{\\beta_j}$\n\n5. **How accurately can we predict future sales?**\n\n Tool:: Prediction intervals for individual response, confidence intervals for average response.\n \n \n6. **Is the relationship linear?** \n\n Tool: Residual Plots\n \n7. **Is there synergy among the advertising media?** \n\n Tool: Interaction terms and associated p-vales.\n\n## Comparison of Linear Regression with K-Nearest Neighbors\n\n- This section examines the K-nearest neighbor (KNN) method (a non-parameteric method).\n- This is essentially a k-point moving average.\n- This serves to illustrate the Bias-Variance trade-off nicely.\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" ], - "filters": [], - "includes": {} + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true } } \ No newline at end of file diff --git a/images/fig3_1.jpg b/images/fig3_1.jpg new file mode 100644 index 0000000..68171d3 Binary files /dev/null and b/images/fig3_1.jpg differ diff --git a/images/fig3_11.png b/images/fig3_11.png new file mode 100644 index 0000000..a086128 Binary files /dev/null and b/images/fig3_11.png differ diff --git a/images/fig3_13.png b/images/fig3_13.png new file mode 100644 index 0000000..4e11e03 Binary files /dev/null and b/images/fig3_13.png differ diff --git a/images/fig3_2.jpg b/images/fig3_2.jpg new file mode 100644 index 0000000..ab54a85 Binary files /dev/null and b/images/fig3_2.jpg differ