Learn2Clean V2 is a modular Python framework designed to optimize data preparation pipelines using Deep Reinforcement Learning (DRL). It wraps standard data cleaning tasks (Imputation, Deduplication, Outlier Removal, Normalization) into Gymnasium environments, allowing agents (PPO, DQN) to learn optimal cleaning strategies automatically.
Learn2Clean V1 is a Python library for data preprocessing and cleaning based on Q-Learning, a model-free reinforcement learning technique. It selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preparing the data such that the quality of the ML model result is maximized.
For more details about V1, please refer to the paper presented at the Web Conf 2019 and the related tutorial.
-
Laure Berti-Equille. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. Proceedings of the Web Conf 2019, San Francisco, May 2019. Preprint
-
Laure Berti-Equille. ML to Data Management: A Round Trip. Tutorial Part I, ICDE 2018. Tutorial
- Hydra-Powered Configuration: Compose complex experiments (Datasets + Actions + Agents) using simple YAML files.
- Gymnasium Environments:
- Sequential: Step-by-step pipeline construction (MDP).
- Permutation: Combinatorial pipeline selection (Contextual Bandit).
- Deep RL Integration: Seamless compatibility with Stable Baselines3 (PPO, DQN).
- Comprehensive Benchmarking: Track data drift (Wasserstein distance), model performance, and pipeline quality using Weights & Biases.
- π Universal Data Loaders: Native support for CSV, OpenML, Kaggle, and Hugging Face datasets.
Before starting, make sure you have:
- Python >=3.11, <3.14
- Poetry (Dependency Manager)
If Poetry is not yet installed, run:
pipx install poetryOr with the official installer
curl -sSL https://install.python-poetry.org | python3 -Or follow the official guide: https://python-poetry.org/docs/#installation
-
Clone the repository:
git clone [https://github.com/your-username/Learn2Clean.git](https://github.com/your-username/Learn2Clean.git) cd Learn2Clean -
Install dependencies via Poetry:
poetry install
This creates a virtual environment and installs all required libraries (Hydra, SB3, Pandas, Scikit-learn, etc.).
Learn2Clean uses Weights & Biases for experiment tracking, hyperparameter logging, and visualization of the Q-Learning optimization process. Follow these steps to get your environment ready.
Before running the project, you need a W&B account:
- Go to wandb.ai/site and click Sign Up.
- You can use your GitHub account, Google account, or an email address.
- Once logged in, go to your User Settings (scroll down to the "API keys" section) or visit wandb.ai/authorize.
- Copy your API Key; you will need it for the environment configuration.
To keep your API Key secure and avoid hardcoding secrets, we use a .env file.
- Create a file named
.envin the root directory of the project:
touch .env- Add your W&B API Key to the file:
WANDB_API_KEY=your_secret_api_key_hereYou can log in to W&B using the CLI through Poetry:
poetry run wandb login(If the .env is set up correctly, W&B will automatically detect the key in many environments, but manual login
ensures the session is active.)
The best way to learn the framework is to follow the 10-step tutorial series located in experiments/tutorials/.
π Start the Step-by-Step Guide Here π
π See a detailed documentation Here π
| ID | Script | Description |
|---|---|---|
| 01 | 01_titanic_csv_dummy.py |
Hello World: Load a CSV and apply a single action. |
| 02 | 02_titanic_openml_dummy.py |
Hydra Basics: Swap datasets (OpenML) & override params via config. |
| 03 | 03_titanic_benchmark.py |
Action Space: Run every available cleaning tool on Titanic. |
| 04 | 04_titanic_wandb_benchmark.py |
Tracking: Log distance metrics (Wasserstein) to WandB. |
| 05 | 05_titanic_wandb_benchmark_full.py |
Deep Analysis: Generate impact heatmaps for all distance metrics. |
| 06 | 06_sequential_gymnasium_env.py |
RL Env: Interact with the SequentialCleaningEnv manually. |
| 07 | 07_permutation_space.py |
Math: Visualize the combinatorial explosion of pipelines. |
| 08 | 08_permutation_gymnasium_env.py |
Bandit Env: Interact with the PermutationsCleaningEnv. |
| 09 | 09_sequential_sb3_ppo.py |
Deep RL: Train a PPO Agent to build a cleaning pipeline. |
| 10 | 10_permutations_sb3_dqn.py |
Bandit RL: Train a DQN Agent to pick the best pipeline in one shot. |
Use poetry run to execute any script with the correct environment:
# Example: Train the PPO agent
poetry run python experiments/tutorials/09_sequential_sb3_ppo.pyLearn2Clean uses Hydra to manage configurations. All config files are in experiments/configs/.
You can override any parameter from the command line.
Example 1: Change the dataset to OpenML
poetry run python experiments/tutorials/01_titanic_csv_dummy.py dataset=openmlExample 2: Change the RL Agent hyperparameters
poetry run python experiments/tutorials/09_sequential_sb3_ppo.py \
agent.params.learning_rate=0.0005 \
experiment.total_timesteps=10000Learn2Clean/
βββ data/ # Local datasets and documentation
β βββ README.md
β βββ titanic.csv
β
βββ docs/ # Project documentation
β βββ actions/ # Documentation for atomic actions
β βββ tutorials/ # Markdown guides for the 10-step tutorial
β β βββ 01_foundations.md
β β βββ 02_benchmarking.md
β β βββ 03_environments.md
β β βββ 04_reinforcement_learning.md
β βββ index.md
β
βββ experiments/ # Experimentation & Tutorials Zone
β βββ configs/ # Hydra Configuration Files (.yaml)
β β βββ action/ # Single action overrides
β β βββ actions/ # Action sets (cleaning, preparation...)
β β βββ agent/ # RL Agent params (PPO, DQN)
β β βββ dataset/ # Data source definitions
β β βββ distances/ # Metric definitions
β β βββ env/ # Environment settings
β β βββ experiment/ # Global experiment params
β β βββ hydra/ # Hydra logging & output settings
β β βββ paths/ # Project path management
β β βββ profiler/ # Data profiling settings
β β βββ tutorials/ # Specific configs for the 10 tutorials
β β βββ wandb/ # Weights & Biases settings
β β βββ config.yaml # Entry point config
β β
β βββ outputs/ # Artifacts generated by runs (Logs, Models)
β βββ sandbox/ # Inspection scripts
β βββ tools/ # Helper scripts (WandB setup, Profiling, etc.)
β βββ tutorials/ # The 10 Step-by-step executable scripts
β βββ 01_titanic_csv_dummy.py
β βββ 02_titanic_openml_dummy.py
β βββ 03_titanic_benchmark.py
β βββ 04_titanic_wandb_benchmark.py
β βββ 05_titanic_wandb_benchmark_full.py
β βββ 06_sequential_gymnasium_env.py
β βββ 07_permutation_space.py
β βββ 08_permutation_gymnasium_env.py
β βββ 09_sequential_sb3_ppo.py
β βββ 10_permutations_sb3_dqn.py
β βββ titanic_smart_reward.py
β
βββ src/
β βββ learn2clean/ # Core Library Source Code
β βββ actions/ # Atomic Action Implementations
β β βββ cleaning/ # Deduplication, Imputation, Outlier, Inconsistency
β β βββ preparation/ # Feature Selection, Scaling
β β βββ data_frame_action.py
β βββ agents/ # Agent logic placeholders
β βββ configs/ # Structured Config Classes (Dataclasses)
β βββ distance/ # Distance Metrics (Wasserstein, Skewness, etc.)
β βββ envs/ # Gymnasium Environments
β β βββ permutations_cleaning_env.py
β β βββ sequential_cleaning_env.py
β βββ evaluation/ # ML Models for Reward Calculation
β β βββ classification/
β β βββ clustering/
β β βββ regression/
β βββ loaders/ # Data Loaders (CSV, OpenML, Kaggle, HF)
β βββ observers/ # State Observers (Data Stats)
β βββ rewards/ # Reward Functions
β βββ spaces/ # Action Space Logic (Permutations)
β βββ utils/ # Logging, Wrappers & Mixins
β βββ types.py # Type definitions
β
βββ tests/ # Unit and Integration Tests
βββ .gitignore
βββ mypy.ini
βββ poetry.lock
βββ pyproject.toml
βββ README.md
Learn2Clean supports multiple data sources out of the box. Configure them in experiments/configs/dataset/.
# configs/dataset/titanic_csv.yaml
_target_: learn2clean.loaders.CSVLoader
file_path: ${hydra:runtime.cwd}/data/titanic.csv
# configs/dataset/titanic_openml.yaml
_target_: learn2clean.loaders.OpenMLLoader
name: "titanic"
version: 1
Requires ~/.kaggle/kaggle.json or environment variables.
# configs/dataset/kaggle_custom.yaml
_target_: learn2clean.loaders.KaggleLoader
dataset_id: "zillow/zecon"
filename: "State_time_series.csv"
# configs/dataset/hf_custom.yaml
_target_: learn2clean.loaders.HuggingFaceLoader
path: "julien-c/titanic-survival"
split: "train"
To ensure everything is working correctly, run the test suite:
poetry run pytest
To see coverage reports:
poetry run pytest --cov=learn2clean
- Create a feature branch:
git checkout -b feature/my-feature
- Commit your changes:
git commit -m "Add a new feature" - Push the branch and open a Pull Request.
Contributions are welcome!
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Learn2Clean is licensed under the BSD 3-Clause "New" or "Revised" License.

