Skip to content

Commit 0bb0893

Browse files
authored
[simple_linear_regression] Review pandas code, update spelling (american), and misc edits (#378)
* [simple_linear_regression] Review lecture pandas code, spelling with update to american spelling * update data location * update all fl to data_url * TST: add label but no caption * update numbered and captioned figures * ensure only one figure is returned * remove tip
1 parent a8316e4 commit 0bb0893

File tree

1 file changed

+80
-39
lines changed

1 file changed

+80
-39
lines changed

Diff for: lectures/simple_linear_regression.md

+80-39
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,18 @@ df
5757
We can use a scatter plot of the data to see the relationship between $y_i$ (ice-cream sales in dollars (\$\'s)) and $x_i$ (degrees Celsius).
5858

5959
```{code-cell} ipython3
60+
---
61+
mystnb:
62+
figure:
63+
caption: "Scatter plot"
64+
name: sales-v-temp1
65+
---
6066
ax = df.plot(
6167
x='X',
6268
y='Y',
6369
kind='scatter',
64-
ylabel='Ice-Cream Sales ($\'s)',
65-
xlabel='Degrees Celcius'
70+
ylabel='Ice-cream sales ($\'s)',
71+
xlabel='Degrees celcius'
6672
)
6773
```
6874

@@ -83,9 +89,16 @@ df['Y_hat'] = α + β * df['X']
8389
```
8490

8591
```{code-cell} ipython3
92+
---
93+
mystnb:
94+
figure:
95+
caption: "Scatter plot with a line of fit"
96+
name: sales-v-temp2
97+
---
8698
fig, ax = plt.subplots()
87-
df.plot(x='X',y='Y', kind='scatter', ax=ax)
88-
df.plot(x='X',y='Y_hat', kind='line', ax=ax)
99+
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
100+
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
101+
plt.show()
89102
```
90103

91104
We can see that this model does a poor job of estimating the relationship.
@@ -98,9 +111,16 @@ df['Y_hat'] = α + β * df['X']
98111
```
99112

100113
```{code-cell} ipython3
114+
---
115+
mystnb:
116+
figure:
117+
caption: "Scatter plot with a line of fit #2"
118+
name: sales-v-temp3
119+
---
101120
fig, ax = plt.subplots()
102-
df.plot(x='X',y='Y', kind='scatter', ax=ax)
103-
df.plot(x='X',y='Y_hat', kind='line', ax=ax)
121+
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
122+
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
123+
plt.show()
104124
```
105125

106126
```{code-cell} ipython3
@@ -109,12 +129,19 @@ df['Y_hat'] = α + β * df['X']
109129
```
110130

111131
```{code-cell} ipython3
132+
---
133+
mystnb:
134+
figure:
135+
caption: "Scatter plot with a line of fit #3"
136+
name: sales-v-temp4
137+
---
112138
fig, ax = plt.subplots()
113-
df.plot(x='X',y='Y', kind='scatter', ax=ax)
114-
df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
139+
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
140+
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
141+
plt.show()
115142
```
116143

117-
However we need to think about formalising this guessing process by thinking of this problem as an optimization problem.
144+
However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem.
118145

119146
Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals
120147

@@ -134,13 +161,20 @@ df
134161
```
135162

136163
```{code-cell} ipython3
164+
---
165+
mystnb:
166+
figure:
167+
caption: "Plot of the residuals"
168+
name: plt-residuals
169+
---
137170
fig, ax = plt.subplots()
138-
df.plot(x='X',y='Y', kind='scatter', ax=ax)
139-
df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
140-
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
171+
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
172+
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
173+
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')
174+
plt.show()
141175
```
142176

143-
The Ordinary Least Squares (OLS) method, as the name suggests, chooses $\alpha$ and $\beta$ in such a way that **minimises** the Sum of the Squared Residuals (SSR).
177+
The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR).
144178

145179
$$
146180
\min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
@@ -152,7 +186,7 @@ $$
152186
C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
153187
$$
154188

155-
that we would like to minimise with parameters $\alpha$ and $\beta$.
189+
that we would like to minimize with parameters $\alpha$ and $\beta$.
156190

157191
## How does error change with respect to $\alpha$ and $\beta$
158192

@@ -173,9 +207,15 @@ for β in np.arange(20,100,0.5):
173207
errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum()
174208
```
175209

176-
Ploting the error
210+
Plotting the error
177211

178212
```{code-cell} ipython3
213+
---
214+
mystnb:
215+
figure:
216+
caption: "Plotting the error"
217+
name: plt-errors
218+
---
179219
ax = pd.Series(errors).plot(xlabel='β', ylabel='error')
180220
plt.axvline(β_optimal, color='r');
181221
```
@@ -188,9 +228,15 @@ for α in np.arange(-500,500,5):
188228
errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum()
189229
```
190230

191-
Ploting the error
231+
Plotting the error
192232

193233
```{code-cell} ipython3
234+
---
235+
mystnb:
236+
figure:
237+
caption: "Plotting the error (2)"
238+
name: plt-errors-2
239+
---
194240
ax = pd.Series(errors).plot(xlabel='α', ylabel='error')
195241
plt.axvline(α_optimal, color='r');
196242
```
@@ -322,22 +368,21 @@ print(α)
322368
Now we can plot the OLS solution
323369
324370
```{code-cell} ipython3
371+
---
372+
mystnb:
373+
figure:
374+
caption: "OLS line of best fit"
375+
name: plt-ols
376+
---
325377
df['Y_hat'] = α + β * df['X']
326378
df['error'] = df['Y_hat'] - df['Y']
327379
328380
fig, ax = plt.subplots()
329-
df.plot(x='X',y='Y', kind='scatter', ax=ax)
330-
df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
381+
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
382+
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
331383
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
332384
```
333385
334-
:::{admonition} Why use OLS?
335-
TODO
336-
337-
1. Discuss mathematical properties for why we have chosen OLS
338-
:::
339-
340-
341386
:::{exercise}
342387
:label: slr-ex1
343388
@@ -347,7 +392,7 @@ Let's consider two economic variables GDP per capita and Life Expectancy.
347392
348393
1. What do you think their relationship would be?
349394
2. Gather some data [from our world in data](https://ourworldindata.org)
350-
3. Use `pandas` to import the `csv` formated data and plot a few different countries of interest
395+
3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
351396
4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
352397
5. Plot the line of best fit found using OLS
353398
6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy
@@ -363,13 +408,13 @@ Let's consider two economic variables GDP per capita and Life Expectancy.
363408
<iframe src="https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita" loading="lazy" style="width: 100%; height: 600px; border: 0px none;"></iframe>
364409
:::
365410
366-
You can download {download}`a copy of the data here <_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv>` if you get stuck
411+
You can download {download}`a copy of the data here <https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv>` if you get stuck
367412
368413
**Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
369414
370415
```{code-cell} ipython3
371-
fl = "_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv" # TODO: Replace with GitHub link
372-
df = pd.read_csv(fl, nrows=10)
416+
data_url = "https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv"
417+
df = pd.read_csv(data_url, nrows=10)
373418
```
374419
375420
```{code-cell} ipython3
@@ -386,7 +431,7 @@ So let's built a list of the columns we want to import
386431
387432
```{code-cell} ipython3
388433
cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita']
389-
df = pd.read_csv(fl, usecols=cols)
434+
df = pd.read_csv(data_url, usecols=cols)
390435
df
391436
```
392437
@@ -453,24 +498,20 @@ df = df[df.year == 2018].reset_index(drop=True).copy()
453498
```
454499
455500
```{code-cell} ipython3
456-
df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)",);
501+
df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",);
457502
```
458503
459504
This data shows a couple of interesting relationships.
460505
461506
1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy
462507
2. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes
463508
464-
Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables
465-
466-
:::{tip}
467-
ln -> ln == elasticities
468-
:::
509+
Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables.
469510
470511
By specifying `logx` you can plot the GDP per Capita data on a log scale
471512
472513
```{code-cell} ipython3
473-
df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)", logx=True);
514+
df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)", logx=True);
474515
```
475516
476517
As you can see from this transformation -- a linear model fits the shape of the data more closely.
@@ -528,9 +569,9 @@ plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy
528569
:::{exercise}
529570
:label: slr-ex2
530571
531-
Minimising the sum of squares is not the **only** way to generate the line of best fit.
572+
Minimizing the sum of squares is not the **only** way to generate the line of best fit.
532573
533-
For example, we could also consider minimising the sum of the **absolute values**, that would give less weight to outliers.
574+
For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers.
534575
535576
Solve for $\alpha$ and $\beta$ using the least absolute values
536577
:::

0 commit comments

Comments
 (0)