Skip to content

Commit e70e509

Browse files
committed
Time series no train/test split
1 parent 7d094ab commit e70e509

File tree

3 files changed

+379
-18
lines changed

3 files changed

+379
-18
lines changed

docs/_quarto.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -289,9 +289,9 @@ website:
289289
href: notes/predictive-modeling/regression/seasonality.qmd
290290
text: "Seasonality Analysis"
291291

292-
- section:
293-
href: notes/predictive-modeling/regression/autoregressive-models.qmd
294-
text: "Autoregressive Models"
292+
- section:
293+
href: notes/predictive-modeling/regression/autoregressive-models.qmd
294+
text: "Autoregressive Models"
295295

296296
- section:
297297
href: notes/predictive-modeling/classification/index.qmd
Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,336 @@
11
# Regression for Seasonality Analysis
2+
3+
Let's explore an example of how to use regression to identify cyclical patterns and perform seasonality analysis with time series data.
4+
5+
6+
```{python}
7+
#| echo: false
8+
9+
#import warnings
10+
#warnings.simplefilter(action='ignore', category=FutureWarning)
11+
12+
from pandas import set_option
13+
set_option('display.max_rows', 6)
14+
```
15+
16+
17+
## Data Loading
18+
19+
20+
As an example time series dataset, let's consider this dataset of U.S. unemployment rates over time, from the Federal Reserve Economic Data (FRED).
21+
22+
Fetching the data, going back as far as possible:
23+
24+
```{python}
25+
from pandas_datareader import get_data_fred
26+
from datetime import datetime
27+
28+
DATASET_NAME = "PAYNSA"
29+
df = get_data_fred(DATASET_NAME, start=datetime(1900,1,1))
30+
print(len(df))
31+
df
32+
```
33+
34+
:::{.callout-tip title="Data Source"}
35+
Here is some more information about the ["PAYNSA" dataset](https://fred.stlouisfed.org/series/PAYNSA):
36+
37+
> All Employees: Total Nonfarm, commonly known as Total Nonfarm Payroll, is a measure of the number of U.S. workers in the economy that excludes proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed.
38+
> This measure accounts for approximately 80 percent of the workers who contribute to Gross Domestic Product (GDP).
39+
> This measure provides useful insights into the current economic situation because it can represent the number of jobs added or lost in an economy.
40+
> Increases in employment might indicate that businesses are hiring which might also suggest that businesses are growing. Additionally, those who are newly employed have increased their personal incomes, which means (all else constant) their disposable incomes have also increased, thus fostering further economic expansion.
41+
> Generally, the U.S. labor force and levels of employment and unemployment are subject to fluctuations due to seasonal changes in weather, major holidays, and the opening and closing of schools.
42+
> The Bureau of Labor Statistics (BLS) adjusts the data to offset the seasonal effects to show non-seasonal changes: for example, women's participation in the labor force; or a general decline in the number of employees, a possible indication of a downturn in the economy. To closely examine seasonal and non-seasonal changes, the BLS releases two monthly statistical measures: the seasonally adjusted All Employees: Total Nonfarm (PAYEMS) and All Employees: Total Nonfarm (PAYNSA), which is not seasonally adjusted.
43+
:::
44+
45+
46+
47+
48+
Wrangling the data, including renaming columns and converting the date index to be datetime-aware, may make it easier for us to work with this data:
49+
50+
```{python}
51+
from pandas import to_datetime
52+
53+
df.rename(columns={DATASET_NAME: "employment"}, inplace=True)
54+
df.index.name = "date"
55+
df.index = to_datetime(df.index)
56+
df
57+
```
58+
59+
60+
61+
## Data Exploration
62+
63+
64+
Visualizing the data:
65+
66+
```{python}
67+
import plotly.express as px
68+
69+
px.line(df, y="employment", height=450,
70+
title="US Employment by month (non-seasonally adjusted)",
71+
labels={"employment": "Employment (in thousands of persons)"},
72+
)
73+
```
74+
75+
**Cyclical Patterns**
76+
77+
Exploring cyclical patterns in the data:
78+
79+
```{python}
80+
px.line(df[(df.index.year >= 1970) & (df.index.year <= 1980)], y="employment",
81+
title="US Employment by month (selected years)", height=450,
82+
labels={"Employment": "Employment (in thousands)"},
83+
)
84+
```
85+
86+
:::{.callout-tip title="Interactive dataviz"}
87+
Hover over the dataviz to see which month(s) typically have higher employment, and which month(s) typically have lower employment.
88+
:::
89+
90+
**Trend Analysis**
91+
92+
Exploring trends:
93+
94+
```{python}
95+
import plotly.express as px
96+
97+
px.scatter(df, y="employment", height=450,
98+
title="US Employment by month (vs Trend)",
99+
labels={"employment": "Employment (in thousands)"},
100+
trendline="ols", trendline_color_override="red"
101+
)
102+
```
103+
104+
Looks like evidence of a possible linear relationship. Let's perform a more formal regression analysis.
105+
106+
107+
## Data Encoding
108+
109+
Because we need numeric features to perform a regression, we convert the dates to a linear time step of integers (after sorting the data first for good measure):
110+
111+
```{python}
112+
df.sort_values(by="date", ascending=True, inplace=True)
113+
114+
df["time_step"] = range(1, len(df) + 1)
115+
df
116+
```
117+
118+
We will use the numeric time step as our input variable (`x`), to predict the employment (`y`).
119+
120+
121+
122+
123+
124+
125+
126+
127+
128+
129+
## Data Splitting
130+
131+
**X/Y Split**
132+
133+
Identifying dependent and independent variables:
134+
135+
```{python}
136+
x = df[["time_step"]]
137+
138+
y = df["employment"]
139+
140+
print("X:", x.shape)
141+
print("Y:", y.shape)
142+
```
143+
144+
**Adding Constants**
145+
146+
We are going to use `statsmodels`, so we add a column of constant ones representing the intercept:
147+
148+
```{python}
149+
import statsmodels.api as sm
150+
151+
# adding in a column of constants, as per the OLS docs
152+
x = sm.add_constant(x)
153+
x.head()
154+
```
155+
156+
**Train/Test Split**
157+
158+
Splitting into training vs testing datasets:
159+
160+
```{python}
161+
#from sklearn.model_selection import train_test_split
162+
#
163+
#x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
164+
#print("TRAIN:", x_train.shape, y_train.shape)
165+
#print("TEST:", x_test.shape, y_test.shape)
166+
```
167+
168+
Splitting data sequentially where earlier data is used in training and recent data is use for testing:
169+
170+
```{python}
171+
#print(len(df))
172+
#
173+
#training_size = round(len(df) * .8)
174+
#print(training_size)
175+
#
176+
#x_train = x.iloc[:training_size] # slice all before
177+
#y_train = y.iloc[:training_size] # slice all before
178+
#
179+
#x_test = x.iloc[training_size:] # slice all after
180+
#y_test = y.iloc[training_size:] # slice all after
181+
#print("TRAIN:", x_train.shape)
182+
#print("TEST:", x_test.shape)
183+
```
184+
185+
186+
For this example, we will not split the data. To help illustrate a story about predictions over the entire time period.
187+
188+
189+
190+
191+
## Model Selection and Training
192+
193+
Training a linear regression model on the training data:
194+
195+
```{python}
196+
import statsmodels.api as sm
197+
198+
#model = sm.OLS(y_train, x_train, missing="drop")
199+
model = sm.OLS(y, x, missing="drop")
200+
print(type(model))
201+
202+
results = model.fit()
203+
print(type(results))
204+
```
205+
206+
Examining training results:
207+
208+
```{python}
209+
print(results.summary())
210+
```
211+
212+
213+
```{python}
214+
print(results.params)
215+
print("------------")
216+
print(f"y = {results.params['time_step'].round(3)}x + {results.params['const'].round(3)}")
217+
```
218+
219+
```{python}
220+
df["prediction"] = results.fittedvalues
221+
df["residual"] = results.resid
222+
```
223+
224+
```{python}
225+
#from pandas import DataFrame
226+
#
227+
## get all rows from the original dataset that wound up in the training set:
228+
#training_set = df.loc[x_train.index].copy()
229+
#print(len(training_set))
230+
#
231+
## create a dataset for the predictions and the residuals:
232+
#training_preds = DataFrame({
233+
# "prediction": results.fittedvalues,
234+
# "residual": results.resid
235+
#})
236+
## merge the training set with the results:
237+
#training_set = training_set.merge(training_preds,
238+
# how="inner", left_index=True, right_index=True
239+
#)
240+
#
241+
## calculate error for each datapoint:
242+
#training_set
243+
```
244+
245+
**Regression Trends**
246+
247+
Plotting trend line:
248+
249+
```{python}
250+
px.line(df, y=["employment", "prediction"], height=350,
251+
title="US Employment (monthly) vs linear trend",
252+
labels={"value":""}
253+
)
254+
```
255+
256+
**Regression Residuals**
257+
258+
Plotting residuals:
259+
260+
261+
```{python}
262+
px.line(df, y="residual",
263+
title="US Employment (monthly) vs linear trend residuals", height=350
264+
)
265+
```
266+
267+
#### Seasonality via Means of Periodic Residuals
268+
269+
Observe there may be some cyclical patterns in the residuals, by calculating periodic means:
270+
271+
```{python}
272+
#| echo: false
273+
274+
from pandas import set_option
275+
set_option('display.max_rows', 15)
276+
```
277+
278+
```{python}
279+
df = df.copy()
280+
df["year"] = df.index.year
281+
df["quarter"] = df.index.quarter
282+
df["month"] = df.index.month
283+
```
284+
285+
```{python}
286+
df.groupby("quarter")["residual"].mean()
287+
```
288+
289+
290+
```{python}
291+
df.groupby("month")["residual"].mean()
292+
```
293+
294+
```{python}
295+
#| echo: false
296+
297+
from pandas import set_option
298+
set_option('display.max_rows', 6)
299+
```
300+
301+
302+
303+
#### Seasonality via Regression on Periodic Residuals
304+
305+
306+
```{python}
307+
# https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
308+
# "one hot encode" the monthly values:
309+
from pandas import get_dummies as one_hot_encode
310+
311+
x_monthly = one_hot_encode(df["month"])
312+
x_monthly.columns=["Jan", "Feb", "Mar", "Apr",
313+
"May", "Jun", "Jul", "Aug",
314+
"Sep", "Oct", "Nov", "Dec"]
315+
x_monthly = x_monthly.astype(int)
316+
x_monthly
317+
```
318+
319+
320+
Predict the residual (i.e. degree to which we will be over or under trend), based on which month it is?
321+
322+
```{python}
323+
324+
325+
y_monthly = df["residual"]
326+
327+
ols_monthly = sm.OLS(y_monthly, x_monthly)
328+
print(type(ols_monthly))
329+
330+
results_monthly = ols_monthly.fit()
331+
print(type(results_monthly))
332+
333+
print(results_monthly.summary())
334+
```
335+
336+
Observe the coefficients tell us how each month contributes towards the regression residuals, in other words, for each month, to what degree does the model predict we will be above or below trend?

0 commit comments

Comments
 (0)