You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+8-8
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,20 @@
2
2
3
3
> Rethinking machine learning pipelines a bit.
4
4
5
-
## What does `scikit-playtime` do?
5
+
## What does `scikit-playtime` do?
6
6
7
7
I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exibit A, exibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.
8
8
9
-
Imagine that you are dealing with the titanic dataset.
9
+
Imagine that you are dealing with the titanic dataset.
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
44
+
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
45
45
46
46
The pipeline works, and it's fine, but you could wonder if this is *easy*. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.
47
47
48
-
## Enter playtime.
48
+
## Enter playtime.
49
49
50
50
Playtime offers an API that allows you to declare the aforementioned pipeline by doing this instead:
51
51
@@ -64,9 +64,9 @@ formula
64
64
65
65

66
66
67
-
It's pretty much the same pipeline as before, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
67
+
It's pretty much the same pipeline as before, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
68
68
69
-
This is what `scikit-playtime` is all about, but this is just the start of what it can do. If that sounds interest you can read more on the [documentation page](https://koaning.github.io/playtime/).
69
+
This is what `scikit-playtime` is all about, but this is just the start of what it can do. If that sounds interest you can read more on the [documentation page](https://koaning.github.io/playtime/).
70
70
71
71
Alternative you may also explore this tool by installing it via:
We gathered some fun datasets to play with. These datasets have been cleaned up
2
-
to make it easy to get started, but their origins are documented here.
2
+
to make it easy to get started, but their origins are documented here.
3
3
4
-
-`me-temperatures.csv` was originally found on [Kaggle](https://www.kaggle.com/datasets/shenba/time-series-datasets?select=daily-minimum-temperatures-in-me.csv)
4
+
-`me-temperatures.csv` was originally found on [Kaggle](https://www.kaggle.com/datasets/shenba/time-series-datasets?select=daily-minimum-temperatures-in-me.csv)
Copy file name to clipboardexpand all lines: docs/index.md
+17-17
Original file line number
Diff line number
Diff line change
@@ -7,20 +7,20 @@ hide:
7
7
8
8
> Rethinking machine learning pipelines a bit.
9
9
10
-
## What does `scikit-playtime` do?
10
+
## What does `scikit-playtime` do?
11
11
12
12
I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exibit A, exibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.
13
13
14
-
Imagine that you are dealing with the titanic dataset.
14
+
Imagine that you are dealing with the titanic dataset.
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
49
+
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
50
50
51
-
Here's what the HTML render of the pipeline looks like.
51
+
Here's what the HTML render of the pipeline looks like.
The pipeline works, and it's fine, but you could wonder if this is *easy*. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.
477
477
478
-
## Enter playtime.
478
+
## Enter playtime.
479
479
480
480
Playtime offers an API that allows you to declare the aforementioned pipeline by doing this instead:
481
481
@@ -491,12 +491,12 @@ This `forumla` object is just an pipeline object that can accumulate components
491
491
formula
492
492
```
493
493
494
-
It's pretty much the same pipeline, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
494
+
It's pretty much the same pipeline, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
495
495
496
496
497
-
## Lets also do text.
497
+
## Lets also do text.
498
498
499
-
Right now we're just exploring base features and one-hot encoding ... but why stop there? We can also encode the name of the passenger using a bag of words representation!
499
+
Right now we're just exploring base features and one-hot encoding ... but why stop there? We can also encode the name of the passenger using a bag of words representation!
Again, as a user you don't need to worry about the internals of the pipeline, you just declare how you want to model.
938
+
Again, as a user you don't need to worry about the internals of the pipeline, you just declare how you want to model.
939
939
940
940
??? question "About that `bag_of_words` representation"
941
941
942
-
The `CountVectorizer` in scikit-learn is great for making bag of words representations, but it assumes an iterable of texts as input. That means we can't get use the `SelectCols` object from `skrub` because that will always return a dataframe, even if we only want a single column for it.
942
+
The `CountVectorizer` in scikit-learn is great for making bag of words representations, but it assumes an iterable of texts as input. That means we can't get use the `SelectCols` object from `skrub` because that will always return a dataframe, even if we only want a single column for it.
943
943
944
-
Again, this is a detail that a modeller should not be concerned with, so `playtime` fixes this internally on your behalf. Part of this involves leveraging [narwhals](https://github.com/narwhals-dev/narwhals) which even allows us to support both polars and pandas in one go.
944
+
Again, this is a detail that a modeller should not be concerned with, so `playtime` fixes this internally on your behalf. Part of this involves leveraging [narwhals](https://github.com/narwhals-dev/narwhals) which even allows us to support both polars and pandas in one go.
945
945
946
946
947
947
## Lets also do timeseries.
948
948
949
-
Sofar we've shown how you might use one hot encoded variables and bag of words representations to preprocess data for a machine learning use-case. This covers a lot of ground already, but why stop here?
949
+
Sofar we've shown how you might use one hot encoded variables and bag of words representations to preprocess data for a machine learning use-case. This covers a lot of ground already, but why stop here?
950
950
951
951
We're still exploring all the ways that you might encode data, but just to give one more example, let's consider timeseries. We could generate some features that can help predict seasonal patterns. Internally we're using [this](https://www.youtube.com/watch?v=cEpiqu3QCW0&t=2s) technique, but again, here's all you need:
952
952
@@ -956,12 +956,12 @@ from playtime import seasonal
956
956
formula = seasonal("timestamp", n_knots=12)
957
957
```
958
958
959
-
Again, this formula contains a pipeline that we can pass to a model.
959
+
Again, this formula contains a pipeline that we can pass to a model.
960
960
961
961
```python
962
962
from sklearn.pipeline import make_pipeline
963
963
from sklearn.linear_model import Ridge
964
-
import matplotlib.pylab as plt
964
+
import matplotlib.pylab as plt
965
965
import numpy as np
966
966
967
967
# Load data that has a timestamp column and a `y` target column
@@ -1019,4 +1019,4 @@ my working hours. This was super cool and I wanted to make sure I recognise them
0 commit comments