Skip to content

Commit

Permalink
Clarify encoding pipeline (#820)
Browse files Browse the repository at this point in the history
* Update preprocessing.qmd

* Update preprocessing.qmd

---------

Co-authored-by: Sebastian Fischer <[email protected]>
  • Loading branch information
mb706 and sebffischer authored Jan 31, 2025
1 parent 8609775 commit 1c124ab
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion book/chapters/chapter9/preprocessing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ In this book, preprocessing refers to everything that happens with *data* before
Another aspect of preprocessing is `r index('feature engineering', aside = TRUE)`, which covers all other transformations of data before it is fed to the machine learning model, including the creation of features from possibly unstructured data, such as written text, sequences or images.
The goal of feature engineering is to enable the data to be handled by a given learner, and/or to further improve predictive performance.
It is important to note that feature engineering helps mostly for simpler algorithms, while highly complex models usually gain less from it and require little data preparation to be trained.
Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks.
Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high-cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks.
Deep learning has shown promising results in automating feature engineering, however, its effectiveness depends on the complexity and nature of the data being processed, as well as the specific problem being addressed.
Typically it can work well with natural language processing and computer vision problems, while for standard tabular data, tree-based ensembles such as a random forest or gradient boosting are often still superior (and easier to handle). However, tabular deep learning approaches are currently catching up quickly.
Hence, manual feature engineering is still often required but with `mlr3pipelines`, which can simplify the process as much as possible.
Expand Down Expand Up @@ -151,6 +151,10 @@ factor_pipeline =
affect_columns = selector_type("factor"), id = "binary_enc")
```

The order in which operations are performed matters here: `po("encodeimpact")` converts high-cardinality `factor` type features into `numeric` features, so these will not be affected by the `po("encode")` operators that come afterwards.
Therefore, the one-hot encoding PipeOp does not need to specify *not* to affect high-cardinality features.
Likewise, once the treatment encoding PipeOp sees the data, all non-binary `factor` features have been converted, so it will only affect binary factors by default.

Now we can apply this pipeline to our xgboost model to use it in a benchmark experiment; we also compare a simpler pipeline that only uses one-hot encoding to demonstrate performance differences resulting from different strategies.

```{r preprocessing-013, message=FALSE}
Expand Down

0 comments on commit 1c124ab

Please sign in to comment.