Skip to content

LaureBerti/Learn2Clean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

99 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Learn2Clean


Learn2Clean: Automated Data Cleaning with Reinforcement Learning

Learn2Clean V2 is a modular Python framework designed to optimize data preparation pipelines using Deep Reinforcement Learning (DRL). It wraps standard data cleaning tasks (Imputation, Deduplication, Outlier Removal, Normalization) into Gymnasium environments, allowing agents (PPO, DQN) to learn optimal cleaning strategies automatically.

Learn2Clean V1 is a Python library for data preprocessing and cleaning based on Q-Learning, a model-free reinforcement learning technique. It selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preparing the data such that the quality of the ML model result is maximized.

Learn2Clean

For more details about V1, please refer to the paper presented at the Web Conf 2019 and the related tutorial.

  • Laure Berti-Equille. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. Proceedings of the Web Conf 2019, San Francisco, May 2019. Preprint

  • Laure Berti-Equille. ML to Data Management: A Round Trip. Tutorial Part I, ICDE 2018. Tutorial


Key Features of Learn2Clean V2

  • Hydra-Powered Configuration: Compose complex experiments (Datasets + Actions + Agents) using simple YAML files.
  • Gymnasium Environments:
    • Sequential: Step-by-step pipeline construction (MDP).
    • Permutation: Combinatorial pipeline selection (Contextual Bandit).
  • Deep RL Integration: Seamless compatibility with Stable Baselines3 (PPO, DQN).
  • Comprehensive Benchmarking: Track data drift (Wasserstein distance), model performance, and pipeline quality using Weights & Biases.
  • πŸ”Œ Universal Data Loaders: Native support for CSV, OpenML, Kaggle, and Hugging Face datasets.

πŸ“¦ Requirements

Before starting, make sure you have:

  • Python >=3.11, <3.14
  • Poetry (Dependency Manager)

Install Poetry

If Poetry is not yet installed, run:

pipx install poetry

Or with the official installer

curl -sSL https://install.python-poetry.org | python3 -

Or follow the official guide: https://python-poetry.org/docs/#installation


Project Setup

  1. Clone the repository:

    git clone [https://github.com/your-username/Learn2Clean.git](https://github.com/your-username/Learn2Clean.git)
    cd Learn2Clean
  2. Install dependencies via Poetry:

    poetry install

    This creates a virtual environment and installs all required libraries (Hydra, SB3, Pandas, Scikit-learn, etc.).

πŸš€ Weights & Biases (W&B) Setup Guide

Learn2Clean uses Weights & Biases for experiment tracking, hyperparameter logging, and visualization of the Q-Learning optimization process. Follow these steps to get your environment ready.

1. Create a W&B Account

Before running the project, you need a W&B account:

  1. Go to wandb.ai/site and click Sign Up.
  2. You can use your GitHub account, Google account, or an email address.
  3. Once logged in, go to your User Settings (scroll down to the "API keys" section) or visit wandb.ai/authorize.
  4. Copy your API Key; you will need it for the environment configuration.

2. Configure Environment Variables

To keep your API Key secure and avoid hardcoding secrets, we use a .env file.

  1. Create a file named .env in the root directory of the project:
touch .env
  1. Add your W&B API Key to the file:
WANDB_API_KEY=your_secret_api_key_here

3. Authentication & Initialization

You can log in to W&B using the CLI through Poetry:

poetry run wandb login

(If the .env is set up correctly, W&B will automatically detect the key in many environments, but manual login ensures the session is active.)

Tutorials

The best way to learn the framework is to follow the 10-step tutorial series located in experiments/tutorials/.

πŸ‘‰ Start the Step-by-Step Guide Here πŸ‘ˆ

πŸ‘‰ See a detailed documentation Here πŸ‘ˆ

ID Script Description
01 01_titanic_csv_dummy.py Hello World: Load a CSV and apply a single action.
02 02_titanic_openml_dummy.py Hydra Basics: Swap datasets (OpenML) & override params via config.
03 03_titanic_benchmark.py Action Space: Run every available cleaning tool on Titanic.
04 04_titanic_wandb_benchmark.py Tracking: Log distance metrics (Wasserstein) to WandB.
05 05_titanic_wandb_benchmark_full.py Deep Analysis: Generate impact heatmaps for all distance metrics.
06 06_sequential_gymnasium_env.py RL Env: Interact with the SequentialCleaningEnv manually.
07 07_permutation_space.py Math: Visualize the combinatorial explosion of pipelines.
08 08_permutation_gymnasium_env.py Bandit Env: Interact with the PermutationsCleaningEnv.
09 09_sequential_sb3_ppo.py Deep RL: Train a PPO Agent to build a cleaning pipeline.
10 10_permutations_sb3_dqn.py Bandit RL: Train a DQN Agent to pick the best pipeline in one shot.

How to Run a Tutorial

Use poetry run to execute any script with the correct environment:

# Example: Train the PPO agent
poetry run python experiments/tutorials/09_sequential_sb3_ppo.py

βš™οΈ Configuration (Hydra)

Learn2Clean uses Hydra to manage configurations. All config files are in experiments/configs/.

You can override any parameter from the command line.

Example 1: Change the dataset to OpenML

poetry run python experiments/tutorials/01_titanic_csv_dummy.py dataset=openml

Example 2: Change the RL Agent hyperparameters

poetry run python experiments/tutorials/09_sequential_sb3_ppo.py \
    agent.params.learning_rate=0.0005 \
    experiment.total_timesteps=10000

🧰 Project Structure

Learn2Clean/
β”œβ”€β”€ data/                       # Local datasets and documentation
β”‚   β”œβ”€β”€ README.md
β”‚   └── titanic.csv
β”‚
β”œβ”€β”€ docs/                       # Project documentation
β”‚   β”œβ”€β”€ actions/                # Documentation for atomic actions
β”‚   β”œβ”€β”€ tutorials/              # Markdown guides for the 10-step tutorial
β”‚   β”‚   β”œβ”€β”€ 01_foundations.md
β”‚   β”‚   β”œβ”€β”€ 02_benchmarking.md
β”‚   β”‚   β”œβ”€β”€ 03_environments.md
β”‚   β”‚   └── 04_reinforcement_learning.md
β”‚   └── index.md
β”‚
β”œβ”€β”€ experiments/                # Experimentation & Tutorials Zone
β”‚   β”œβ”€β”€ configs/                # Hydra Configuration Files (.yaml)
β”‚   β”‚   β”œβ”€β”€ action/             # Single action overrides
β”‚   β”‚   β”œβ”€β”€ actions/            # Action sets (cleaning, preparation...)
β”‚   β”‚   β”œβ”€β”€ agent/              # RL Agent params (PPO, DQN)
β”‚   β”‚   β”œβ”€β”€ dataset/            # Data source definitions
β”‚   β”‚   β”œβ”€β”€ distances/          # Metric definitions
β”‚   β”‚   β”œβ”€β”€ env/                # Environment settings
β”‚   β”‚   β”œβ”€β”€ experiment/         # Global experiment params
β”‚   β”‚   β”œβ”€β”€ hydra/              # Hydra logging & output settings
β”‚   β”‚   β”œβ”€β”€ paths/              # Project path management
β”‚   β”‚   β”œβ”€β”€ profiler/           # Data profiling settings
β”‚   β”‚   β”œβ”€β”€ tutorials/          # Specific configs for the 10 tutorials
β”‚   β”‚   β”œβ”€β”€ wandb/              # Weights & Biases settings
β”‚   β”‚   └── config.yaml         # Entry point config
β”‚   β”‚
β”‚   β”œβ”€β”€ outputs/                # Artifacts generated by runs (Logs, Models)
β”‚   β”œβ”€β”€ sandbox/                # Inspection scripts
β”‚   β”œβ”€β”€ tools/                  # Helper scripts (WandB setup, Profiling, etc.)
β”‚   └── tutorials/              # The 10 Step-by-step executable scripts
β”‚       β”œβ”€β”€ 01_titanic_csv_dummy.py
β”‚       β”œβ”€β”€ 02_titanic_openml_dummy.py
β”‚       β”œβ”€β”€ 03_titanic_benchmark.py
β”‚       β”œβ”€β”€ 04_titanic_wandb_benchmark.py
β”‚       β”œβ”€β”€ 05_titanic_wandb_benchmark_full.py
β”‚       β”œβ”€β”€ 06_sequential_gymnasium_env.py
β”‚       β”œβ”€β”€ 07_permutation_space.py
β”‚       β”œβ”€β”€ 08_permutation_gymnasium_env.py
β”‚       β”œβ”€β”€ 09_sequential_sb3_ppo.py
β”‚       β”œβ”€β”€ 10_permutations_sb3_dqn.py
β”‚       └── titanic_smart_reward.py
β”‚
β”œβ”€β”€ src/
β”‚   └── learn2clean/            # Core Library Source Code
β”‚       β”œβ”€β”€ actions/            # Atomic Action Implementations
β”‚       β”‚   β”œβ”€β”€ cleaning/       # Deduplication, Imputation, Outlier, Inconsistency
β”‚       β”‚   β”œβ”€β”€ preparation/    # Feature Selection, Scaling
β”‚       β”‚   └── data_frame_action.py
β”‚       β”œβ”€β”€ agents/             # Agent logic placeholders
β”‚       β”œβ”€β”€ configs/            # Structured Config Classes (Dataclasses)
β”‚       β”œβ”€β”€ distance/           # Distance Metrics (Wasserstein, Skewness, etc.)
β”‚       β”œβ”€β”€ envs/               # Gymnasium Environments
β”‚       β”‚   β”œβ”€β”€ permutations_cleaning_env.py
β”‚       β”‚   └── sequential_cleaning_env.py
β”‚       β”œβ”€β”€ evaluation/         # ML Models for Reward Calculation
β”‚       β”‚   β”œβ”€β”€ classification/
β”‚       β”‚   β”œβ”€β”€ clustering/
β”‚       β”‚   └── regression/
β”‚       β”œβ”€β”€ loaders/            # Data Loaders (CSV, OpenML, Kaggle, HF)
β”‚       β”œβ”€β”€ observers/          # State Observers (Data Stats)
β”‚       β”œβ”€β”€ rewards/            # Reward Functions
β”‚       β”œβ”€β”€ spaces/             # Action Space Logic (Permutations)
β”‚       β”œβ”€β”€ utils/              # Logging, Wrappers & Mixins
β”‚       └── types.py            # Type definitions
β”‚
β”œβ”€β”€ tests/                      # Unit and Integration Tests
β”œβ”€β”€ .gitignore
β”œβ”€β”€ mypy.ini
β”œβ”€β”€ poetry.lock
β”œβ”€β”€ pyproject.toml
└── README.md

πŸ’Ύ Data Loading

Learn2Clean supports multiple data sources out of the box. Configure them in experiments/configs/dataset/.

1. Local CSV

# configs/dataset/titanic_csv.yaml
_target_: learn2clean.loaders.CSVLoader
file_path: ${hydra:runtime.cwd}/data/titanic.csv

2. OpenML

# configs/dataset/titanic_openml.yaml
_target_: learn2clean.loaders.OpenMLLoader
name: "titanic"
version: 1

3. Kaggle

Requires ~/.kaggle/kaggle.json or environment variables.

# configs/dataset/kaggle_custom.yaml
_target_: learn2clean.loaders.KaggleLoader
dataset_id: "zillow/zecon"
filename: "State_time_series.csv"

4. Hugging Face

# configs/dataset/hf_custom.yaml
_target_: learn2clean.loaders.HuggingFaceLoader
path: "julien-c/titanic-survival"
split: "train"

πŸ§ͺ Testing

To ensure everything is working correctly, run the test suite:

poetry run pytest

To see coverage reports:

poetry run pytest --cov=learn2clean


🀝 Contributing

  1. Create a feature branch:
    git checkout -b feature/my-feature
  2. Commit your changes:
    git commit -m "Add a new feature"
  3. Push the branch and open a Pull Request.

🀝 Contributing

Contributions are welcome!

  1. Fork the repository.
  2. Create your feature branch (git checkout -b feature/AmazingFeature).
  3. Commit your changes (git commit -m 'Add some AmazingFeature').
  4. Push to the branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

πŸͺͺ License

Learn2Clean is licensed under the BSD 3-Clause "New" or "Revised" License.


πŸ‘€ Author

Laure Berti

About

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors