Skip to content

Commit 1d2051f

Browse files
authored
Merge pull request #11 from glemaitre/add_precommit_hook
MAINT add pre-commit hook
2 parents fd93543 + cbe78d1 commit 1d2051f

12 files changed

+133
-77
lines changed

.github/workflows/lint.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: Run code format checks
2+
3+
on:
4+
push:
5+
branches:
6+
- "main"
7+
pull_request:
8+
branches:
9+
- '*'
10+
11+
jobs:
12+
run-pre-commit-checks:
13+
name: Run pre-commit checks
14+
runs-on: ubuntu-latest
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
- name: Install uv
19+
uses: astral-sh/setup-uv@v2
20+
- name: Set up Python 3.12
21+
run: uv python install 3.12
22+
- name: Install Venv
23+
run: uv venv --python 3.12
24+
- name: Linter
25+
run: |
26+
source .venv/bin/activate
27+
which python
28+
python --version
29+
uv pip install -e .[lint]
30+
pre-commit install && pre-commit run -v --all-files --show-diff-on-failure

.pre-commit-config.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.3.0
4+
hooks:
5+
- id: end-of-file-fixer
6+
- id: trailing-whitespace
7+
- repo: https://github.com/astral-sh/ruff-pre-commit
8+
# Ruff version.
9+
rev: v0.5.1
10+
hooks:
11+
- id: ruff
12+
args: ["--fix", "--output-format=full"]
13+
- repo: https://github.com/psf/black
14+
rev: 24.3.0
15+
hooks:
16+
- id: black

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2024 vincent d warmerdam
3+
Copyright (c) 2024 vincent d warmerdam
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,4 @@ check: lint precommit test clean
2222
pypi: clean
2323
python setup.py sdist
2424
python setup.py bdist_wheel --universal
25-
twine upload dist/*
25+
twine upload dist/*

README.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,20 @@
22

33
> Rethinking machine learning pipelines a bit.
44
5-
## What does `scikit-playtime` do?
5+
## What does `scikit-playtime` do?
66

77
I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exibit A, exibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.
88

9-
Imagine that you are dealing with the titanic dataset.
9+
Imagine that you are dealing with the titanic dataset.
1010

1111
```python
12-
import pandas as pd
12+
import pandas as pd
1313

1414
df = pd.read_csv("https://calmcode.io/static/data/titanic.csv")
1515
df.head()
1616
```
1717

18-
Here's what the dataset looks like.
18+
Here's what the dataset looks like.
1919

2020
| survived | pclass | name | sex | age | fare | sibsp | parch |
2121
|-----------:|---------:|:----------------------------------------------------|:-------|------:|--------:|--------:|--------:|
@@ -41,11 +41,11 @@ pipe = make_union(
4141
)
4242
```
4343

44-
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
44+
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
4545

4646
The pipeline works, and it's fine, but you could wonder if this is *easy*. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.
4747

48-
## Enter playtime.
48+
## Enter playtime.
4949

5050
Playtime offers an API that allows you to declare the aforementioned pipeline by doing this instead:
5151

@@ -64,9 +64,9 @@ formula
6464

6565
![playtime](docs/imgs/pipe-demo.png)
6666

67-
It's pretty much the same pipeline as before, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
67+
It's pretty much the same pipeline as before, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
6868

69-
This is what `scikit-playtime` is all about, but this is just the start of what it can do. If that sounds interest you can read more on the [documentation page](https://koaning.github.io/playtime/).
69+
This is what `scikit-playtime` is all about, but this is just the start of what it can do. If that sounds interest you can read more on the [documentation page](https://koaning.github.io/playtime/).
7070

7171
Alternative you may also explore this tool by installing it via:
7272

datasets/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
We gathered some fun datasets to play with. These datasets have been cleaned up
2-
to make it easy to get started, but their origins are documented here.
2+
to make it easy to get started, but their origins are documented here.
33

4-
- `me-temperatures.csv` was originally found on [Kaggle](https://www.kaggle.com/datasets/shenba/time-series-datasets?select=daily-minimum-temperatures-in-me.csv)
4+
- `me-temperatures.csv` was originally found on [Kaggle](https://www.kaggle.com/datasets/shenba/time-series-datasets?select=daily-minimum-temperatures-in-me.csv)

docs/index.md

+17-17
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,20 @@ hide:
77

88
> Rethinking machine learning pipelines a bit.
99
10-
## What does `scikit-playtime` do?
10+
## What does `scikit-playtime` do?
1111

1212
I was wondering if there might be an easier way to construct scikit-learn pipelines. Don't get me wrong, scikit-learn is amazing when you want elaborate pipelines (exibit A, exibit B) but maybe there is also a place for something more lightweight and playful. This library is all about exploring that.
1313

14-
Imagine that you are dealing with the titanic dataset.
14+
Imagine that you are dealing with the titanic dataset.
1515

1616
```python
17-
import pandas as pd
17+
import pandas as pd
1818

1919
df = pd.read_csv("https://calmcode.io/static/data/titanic.csv")
2020
df.head()
2121
```
2222

23-
Here's what the dataset looks like.
23+
Here's what the dataset looks like.
2424

2525
| survived | pclass | name | sex | age | fare | sibsp | parch |
2626
|-----------:|---------:|:----------------------------------------------------|:-------|------:|--------:|--------:|--------:|
@@ -46,9 +46,9 @@ pipe = make_union(
4646
)
4747
```
4848

49-
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
49+
This pipeline takes the **age**, **fare**, **sibsp** and **parch** features as-is. These features are already numeric so these do not need to be changed. But the **sex** and **pclass** features are candidates to one-hot encode first. These are categorical features, so it helps to encode them as such.
5050

51-
Here's what the HTML render of the pipeline looks like.
51+
Here's what the HTML render of the pipeline looks like.
5252

5353
<div class="lm-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output" data-mime-type="text/html"><style>#sk-container-id-4 {
5454
/* Definition of color scheme common for light and dark mode */
@@ -475,7 +475,7 @@ div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
475475

476476
The pipeline works, and it's fine, but you could wonder if this is *easy*. After all, you do need to know scikit-learn fairly well in order to build a pipeline this way and you may also need to appreciate Python. There's some nesting happening in here as well, so for a novice or somebody who just immediately wants to make a quick model ... there's some stuff that gets in the way. All of this is fine when you consider that scikit-learn needs to allow for elaborate pipelines ... but if you just want something dead simple ... then you may appreciate another syntax instead.
477477

478-
## Enter playtime.
478+
## Enter playtime.
479479

480480
Playtime offers an API that allows you to declare the aforementioned pipeline by doing this instead:
481481

@@ -491,12 +491,12 @@ This `forumla` object is just an pipeline object that can accumulate components
491491
formula
492492
```
493493

494-
It's pretty much the same pipeline, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
494+
It's pretty much the same pipeline, but it's a lot easier to go ahead and declare. You're mostly dealing with column names and how to encode them, instead of thinking about how scikit-learn constructs a pipeline.
495495

496496

497-
## Lets also do text.
497+
## Lets also do text.
498498

499-
Right now we're just exploring base features and one-hot encoding ... but why stop there? We can also encode the name of the passenger using a bag of words representation!
499+
Right now we're just exploring base features and one-hot encoding ... but why stop there? We can also encode the name of the passenger using a bag of words representation!
500500

501501
```python
502502
from playtime import feats, onehot, bag_of_words
@@ -935,18 +935,18 @@ div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
935935
CountVectorizer())]))])</pre></div> </div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><label>selectcols</label></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-14" type="checkbox"><label for="sk-estimator-id-14" class="sk-toggleable__label sk-toggleable__label-arrow ">SelectCols</label><div class="sk-toggleable__content "><pre>SelectCols(cols=['age', 'fare'])</pre></div> </div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><label>pipeline-1</label></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-15" type="checkbox"><label for="sk-estimator-id-15" class="sk-toggleable__label sk-toggleable__label-arrow ">SelectCols</label><div class="sk-toggleable__content "><pre>SelectCols(cols=('pclass', 'sex'))</pre></div> </div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-16" type="checkbox"><label for="sk-estimator-id-16" class="sk-toggleable__label sk-toggleable__label-arrow ">&nbsp;OneHotEncoder<a class="sk-estimator-doc-link " rel="noopener" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.OneHotEncoder.html">?<span>Documentation for OneHotEncoder</span></a></label><div class="sk-toggleable__content "><pre>OneHotEncoder()</pre></div> </div></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><label>pipeline-2</label></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-17" type="checkbox"><label for="sk-estimator-id-17" class="sk-toggleable__label sk-toggleable__label-arrow ">&nbsp;FunctionTransformer<a class="sk-estimator-doc-link " rel="noopener" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.FunctionTransformer.html">?<span>Documentation for FunctionTransformer</span></a></label><div class="sk-toggleable__content "><pre>FunctionTransformer(func=&lt;function column_pluck at 0x287266700&gt;,
936936
kw_args={'column': 'name'})</pre></div> </div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-18" type="checkbox"><label for="sk-estimator-id-18" class="sk-toggleable__label sk-toggleable__label-arrow ">&nbsp;CountVectorizer<a class="sk-estimator-doc-link " rel="noopener" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">?<span>Documentation for CountVectorizer</span></a></label><div class="sk-toggleable__content "><pre>CountVectorizer()</pre></div> </div></div></div></div></div></div></div></div></div></div></div></div>
937937

938-
Again, as a user you don't need to worry about the internals of the pipeline, you just declare how you want to model.
938+
Again, as a user you don't need to worry about the internals of the pipeline, you just declare how you want to model.
939939

940940
??? question "About that `bag_of_words` representation"
941941

942-
The `CountVectorizer` in scikit-learn is great for making bag of words representations, but it assumes an iterable of texts as input. That means we can't get use the `SelectCols` object from `skrub` because that will always return a dataframe, even if we only want a single column for it.
942+
The `CountVectorizer` in scikit-learn is great for making bag of words representations, but it assumes an iterable of texts as input. That means we can't get use the `SelectCols` object from `skrub` because that will always return a dataframe, even if we only want a single column for it.
943943

944-
Again, this is a detail that a modeller should not be concerned with, so `playtime` fixes this internally on your behalf. Part of this involves leveraging [narwhals](https://github.com/narwhals-dev/narwhals) which even allows us to support both polars and pandas in one go.
944+
Again, this is a detail that a modeller should not be concerned with, so `playtime` fixes this internally on your behalf. Part of this involves leveraging [narwhals](https://github.com/narwhals-dev/narwhals) which even allows us to support both polars and pandas in one go.
945945

946946

947947
## Lets also do timeseries.
948948

949-
Sofar we've shown how you might use one hot encoded variables and bag of words representations to preprocess data for a machine learning use-case. This covers a lot of ground already, but why stop here?
949+
Sofar we've shown how you might use one hot encoded variables and bag of words representations to preprocess data for a machine learning use-case. This covers a lot of ground already, but why stop here?
950950

951951
We're still exploring all the ways that you might encode data, but just to give one more example, let's consider timeseries. We could generate some features that can help predict seasonal patterns. Internally we're using [this](https://www.youtube.com/watch?v=cEpiqu3QCW0&t=2s) technique, but again, here's all you need:
952952

@@ -956,12 +956,12 @@ from playtime import seasonal
956956
formula = seasonal("timestamp", n_knots=12)
957957
```
958958

959-
Again, this formula contains a pipeline that we can pass to a model.
959+
Again, this formula contains a pipeline that we can pass to a model.
960960

961961
```python
962962
from sklearn.pipeline import make_pipeline
963963
from sklearn.linear_model import Ridge
964-
import matplotlib.pylab as plt
964+
import matplotlib.pylab as plt
965965
import numpy as np
966966

967967
# Load data that has a timestamp column and a `y` target column
@@ -1019,4 +1019,4 @@ my working hours. This was super cool and I wanted to make sure I recognise them
10191019
<br><br><br>
10201020
</p>
10211021

1022-
<br>
1022+
<br>

mkdocs.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ markdown_extensions:
2020
- attr_list
2121
- pymdownx.emoji:
2222
emoji_index: !!python/name:material.extensions.emoji.twemoji
23-
emoji_generator: !!python/name:material.extensions.emoji.to_svg
23+
emoji_generator: !!python/name:material.extensions.emoji.to_svg

0 commit comments

Comments
 (0)