Skip to content

Commit c9728c8

Browse files
ogriselArturoAmorQSebastienMelo
authored
Feature branch to update to 1.6 (#813)
* Feature branch to update to 1.6 * MNT Fix several FutureWarnings (#810) * MTN Wrap up quiz sklearn 1.6 verification (#817) * MAINT Use class_of_interest in DecisionBoundaryDisplay (#772) * Resync everything --------- Co-authored-by: Arturo Amor <[email protected]> Co-authored-by: SebastienMelo <[email protected]>
1 parent 721cbe1 commit c9728c8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+282
-197
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# exlude datasets and externals
22
notebooks/datasets
33
notebooks/joblib/
4+
wrap-up/
45

56
# jupyter-book
67
jupyter-book/_build

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
PYTHON_SCRIPTS_DIR = python_scripts
22
NOTEBOOKS_DIR = notebooks
33
JUPYTER_BOOK_DIR = jupyter-book
4+
WRAP_UP_DIR = wrap-up
45
JUPYTER_KERNEL := python3
56
MINIMAL_NOTEBOOK_FILES = $(shell ls $(PYTHON_SCRIPTS_DIR)/*.py | perl -pe "s@$(PYTHON_SCRIPTS_DIR)@$(NOTEBOOKS_DIR)@" | perl -pe "s@\[email protected]@")
67

@@ -37,6 +38,10 @@ quizzes:
3738
full-index:
3839
python build_tools/generate-index.py
3940

41+
run-code-in-wrap-up-quizzes:
42+
python build_tools/generate-wrap-up.py $(GITLAB_REPO_JUPYTERBOOK_DIR) $(WRAP_UP_DIR)
43+
jupytext --execute --to notebook $(WRAP_UP_DIR)/*.py
44+
4045
$(JUPYTER_BOOK_DIR):
4146
jupyter-book build $(JUPYTER_BOOK_DIR)
4247
rm -rf $(JUPYTER_BOOK_DIR)/_build/html/{slides,figures} && cp -r slides figures $(JUPYTER_BOOK_DIR)/_build/html

build_tools/generate-wrap-up.py

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
import sys
2+
import os
3+
import glob
4+
5+
6+
def extract_python_code_blocks(md_file_path):
7+
"""
8+
Extract Python code blocks from a markdown file.
9+
10+
Args:
11+
md_file_path (str): Path to the markdown file
12+
13+
Returns:
14+
list: List of extracted Python code blocks
15+
"""
16+
code_blocks = []
17+
in_python_block = False
18+
current_block = []
19+
20+
with open(md_file_path, "r", encoding="utf-8") as file:
21+
for line in file:
22+
line = line.rstrip("\n")
23+
24+
if line.strip() == "```python":
25+
in_python_block = True
26+
current_block = []
27+
elif line.strip() == "```" and in_python_block:
28+
in_python_block = False
29+
code_blocks.append("\n".join(current_block))
30+
elif in_python_block:
31+
current_block.append(line)
32+
33+
return code_blocks
34+
35+
36+
def write_jupyter_notebook_file(
37+
code_blocks, output_file="notebook_from_md.py"
38+
):
39+
"""
40+
Writes extracted code blocks to a Python file formatted as Jupyter notebook cells.
41+
42+
Args:
43+
code_blocks (list): List of code blocks to write
44+
output_file (str): Path to the output file
45+
"""
46+
with open(output_file, "w", encoding="utf-8") as file:
47+
file.write(
48+
"# %% [markdown] \n # ## Notebook generated from Markdown file\n\n"
49+
)
50+
51+
for i, block in enumerate(code_blocks, 1):
52+
file.write(f"# %% [markdown]\n# ## Cell {i}\n\n# %%\n{block}\n\n")
53+
54+
print(
55+
f"Successfully wrote {len(code_blocks)} code cells to"
56+
f" {output_file}"
57+
)
58+
59+
60+
def process_quiz_files(input_path, output_dir):
61+
"""
62+
Process all wrap_up_quiz files in the input path and convert them to notebooks.
63+
64+
Args:
65+
input_path (str): Path to look for wrap_up_quiz files in subfolders
66+
output_dir (str): Directory to write the generated notebooks
67+
"""
68+
# Create output directory if it doesn't exist
69+
if not os.path.exists(output_dir):
70+
os.makedirs(output_dir)
71+
print(f"Created output directory: {output_dir}")
72+
73+
# Find all files containing "wrap_up_quiz" in their name in the input path subfolders
74+
quiz_files = glob.glob(
75+
f"{input_path}/**/*wrap_up_quiz*.md", recursive=True
76+
)
77+
78+
if not quiz_files:
79+
print(f"No wrap_up_quiz.md files found in {input_path} subfolders.")
80+
return
81+
82+
print(f"Found {len(quiz_files)} wrap_up_quiz files to process.")
83+
84+
# Process each file
85+
for md_file_path in quiz_files:
86+
print(f"\nProcessing: {md_file_path}")
87+
88+
# Extract code blocks
89+
code_blocks = extract_python_code_blocks(md_file_path)
90+
91+
# Generate output filename
92+
subfolder = md_file_path.split(os.sep)[3] # Get subfolder name
93+
output_file = os.path.join(output_dir, f"{subfolder}_wrap_up_quiz.py")
94+
95+
# Display results and write notebook file
96+
if code_blocks:
97+
print(f"Found {len(code_blocks)} Python code blocks")
98+
write_jupyter_notebook_file(code_blocks, output_file=output_file)
99+
else:
100+
print(f"No Python code blocks found in {md_file_path}.")
101+
102+
103+
if __name__ == "__main__":
104+
input_path = sys.argv[1]
105+
output_dir = sys.argv[2]
106+
107+
process_quiz_files(input_path, output_dir)

notebooks/03_categorical_pipeline_ex_02.ipynb

Lines changed: 1 addition & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -160,26 +160,7 @@
160160
"cell_type": "markdown",
161161
"metadata": {},
162162
"source": [
163-
"### Analysis\n",
164-
"\n",
165-
"From an accuracy point of view, the result is almost exactly the same. The\n",
166-
"reason is that `HistGradientBoostingClassifier` is expressive and robust\n",
167-
"enough to deal with misleading ordering of integer coded categories (which was\n",
168-
"not the case for linear models).\n",
169-
"\n",
170-
"However from a computation point of view, the training time is much longer:\n",
171-
"this is caused by the fact that `OneHotEncoder` generates more features than\n",
172-
"`OrdinalEncoder`; for each unique categorical value a column is created.\n",
173-
"\n",
174-
"Note that the current implementation `HistGradientBoostingClassifier` is still\n",
175-
"incomplete, and once sparse representation are handled correctly, training\n",
176-
"time might improve with such kinds of encodings.\n",
177-
"\n",
178-
"The main take away message is that arbitrary integer coding of categories is\n",
179-
"perfectly fine for `HistGradientBoostingClassifier` and yields fast training\n",
180-
"times.\n",
181-
"\n",
182-
"Which encoder should I use?\n",
163+
"## Which encoder should I use?\n",
183164
"\n",
184165
"| | Meaningful order | Non-meaningful order |\n",
185166
"| ---------------- | ----------------------------- | -------------------- |\n",

notebooks/cross_validation_grouping.ipynb

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -189,9 +189,10 @@
189189
"cell_type": "markdown",
190190
"metadata": {},
191191
"source": [
192-
"If we read carefully, 13 writers wrote the digits of our dataset, accounting\n",
193-
"for a total amount of 1797 samples. Thus, a writer wrote several times the\n",
194-
"same numbers. Let's suppose that the writer samples are grouped. Subsequently,\n",
192+
"If we read carefully, `load_digits` loads a copy of the **test set** of the\n",
193+
"UCI ML hand-written digits dataset, which consists of 1797 images by\n",
194+
"**13 different writers**. Thus, each writer wrote several times the same\n",
195+
"numbers. Let's suppose the dataset is ordered by writer. Subsequently,\n",
195196
"not shuffling the data will keep all writer samples together either in the\n",
196197
"training or the testing sets. Mixing the data will break this structure, and\n",
197198
"therefore digits written by the same writer will be available in both the\n",

notebooks/datasets_bike_rides.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@
271271
"metadata": {},
272272
"outputs": [],
273273
"source": [
274-
"data_ride.resample(\"60S\").mean().plot()\n",
274+
"data_ride.resample(\"60s\").mean().plot()\n",
275275
"plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
276276
"_ = plt.title(\"Sensor values for different cyclist measurements\")"
277277
]

notebooks/ensemble_adaboost.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@
271271
"\n",
272272
"estimator = DecisionTreeClassifier(max_depth=3, random_state=0)\n",
273273
"adaboost = AdaBoostClassifier(\n",
274-
" estimator=estimator, n_estimators=3, algorithm=\"SAMME\", random_state=0\n",
274+
" estimator=estimator, n_estimators=3, random_state=0\n",
275275
")\n",
276276
"adaboost.fit(data, target)"
277277
]

notebooks/ensemble_ex_03.ipynb

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,24 @@
107107
"ensemble. However, the scores reach a plateau where adding new trees just\n",
108108
"makes fitting and scoring slower.\n",
109109
"\n",
110+
"Now repeat the analysis for the gradient boosting model."
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": null,
116+
"metadata": {
117+
"lines_to_next_cell": 2
118+
},
119+
"outputs": [],
120+
"source": [
121+
"# Write your code here."
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
110128
"Gradient boosting models overfit when the number of trees is too large. To\n",
111129
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
112130
"offers an early-stopping option. Internally, the algorithm uses an\n",
@@ -115,9 +133,9 @@
115133
"improving for several iterations, it stops adding trees.\n",
116134
"\n",
117135
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
118-
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
119-
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
120-
"deterioration of the overall generalization performance."
136+
"of trees is certainly too large as we have seen above. Change the parameter\n",
137+
"`n_iter_no_change` such that the gradient boosting fitting stops after adding\n",
138+
"5 trees to avoid deterioration of the overall generalization performance."
121139
]
122140
},
123141
{

notebooks/linear_models_regularization.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -618,7 +618,7 @@
618618
"ridge = make_pipeline(\n",
619619
" MinMaxScaler(),\n",
620620
" PolynomialFeatures(degree=2, include_bias=False),\n",
621-
" RidgeCV(alphas=alphas, store_cv_values=True),\n",
621+
" RidgeCV(alphas=alphas, store_cv_results=True),\n",
622622
")"
623623
]
624624
},
@@ -677,7 +677,7 @@
677677
"It indicates that our model is not overfitting.\n",
678678
"\n",
679679
"When fitting the ridge regressor, we also requested to store the error found\n",
680-
"during cross-validation (by setting the parameter `store_cv_values=True`). We\n",
680+
"during cross-validation (by setting the parameter `store_cv_results=True`). We\n",
681681
"can plot the mean squared error for the different `alphas` regularization\n",
682682
"strengths that we tried. The error bars represent one standard deviation of the\n",
683683
"average mean square error across folds for a given value of `alpha`."
@@ -690,7 +690,7 @@
690690
"outputs": [],
691691
"source": [
692692
"mse_alphas = [\n",
693-
" est[-1].cv_values_.mean(axis=0) for est in cv_results[\"estimator\"]\n",
693+
" est[-1].cv_results_.mean(axis=0) for est in cv_results[\"estimator\"]\n",
694694
"]\n",
695695
"cv_alphas = pd.DataFrame(mse_alphas, columns=alphas)\n",
696696
"cv_alphas = cv_alphas.aggregate([\"mean\", \"std\"]).T\n",

notebooks/parameter_tuning_grid_search.ipynb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,9 @@
157157
"preprocessor = ColumnTransformer(\n",
158158
" [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n",
159159
" remainder=\"passthrough\",\n",
160+
" # Silence a deprecation warning in scikit-learn v1.6 related to how the\n",
161+
" # ColumnTransformer stores an attribute that we do not use in this notebook\n",
162+
" force_int_remainder_cols=False,\n",
160163
")"
161164
]
162165
},

notebooks/parameter_tuning_nested.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
" (\"cat_preprocessor\", categorical_preprocessor, categorical_columns),\n",
7171
" ],\n",
7272
" remainder=\"passthrough\",\n",
73+
" force_int_remainder_cols=False, # Silence a warning in scikit-learn v1.6.\n",
7374
")"
7475
]
7576
},

notebooks/parameter_tuning_randomized_search.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@
121121
"preprocessor = ColumnTransformer(\n",
122122
" [(\"cat_preprocessor\", categorical_preprocessor, categorical_columns)],\n",
123123
" remainder=\"passthrough\",\n",
124+
" force_int_remainder_cols=False, # Silence a warning in scikit-learn v1.6.\n",
124125
")"
125126
]
126127
},

notebooks/trees_ex_01.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,9 @@
8383
"<div class=\"admonition warning alert alert-danger\">\n",
8484
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Warning</p>\n",
8585
"<p class=\"last\">At this time, it is not possible to use <tt class=\"docutils literal\"><span class=\"pre\">response_method=\"predict_proba\"</span></tt> for\n",
86-
"multiclass problems. This is a planned feature for a future version of\n",
87-
"scikit-learn. In the mean time, you can use <tt class=\"docutils literal\"><span class=\"pre\">response_method=\"predict\"</span></tt>\n",
88-
"instead.</p>\n",
86+
"multiclass problems on a single plot. This is a planned feature for a future\n",
87+
"version of scikit-learn. In the mean time, you can use\n",
88+
"<tt class=\"docutils literal\"><span class=\"pre\">response_method=\"predict\"</span></tt> instead.</p>\n",
8989
"</div>"
9090
]
9191
},

0 commit comments

Comments
 (0)