tryolabs · diegomarvid · Mar 27, 2024 · Mar 4, 2024 · Mar 5, 2024 · Mar 5, 2024
diff --git a/.gitignore b/.gitignore
@@ -6,6 +6,10 @@ __pycache__/
 # C extensions
 *.so
 
+*.joblib
+*.bin
+*.json
+
 # ignore examples folder
 examples/
 

diff --git a/README.md b/README.md
@@ -1,104 +1,219 @@
-# Repo Template
-
-Kick off a project with the right foot.
-
-A repository template for easily setting up a well behaved development environment for a smooth
-collaboration experience.
-
-This template takes care of setting up and configuring:
-
-- A **virtual environment**
-- **Formatting and linting** tools
-- Some shared default **VSCode settings**
-- A **Pull Request template**
-- A **GitHub Action** that runs formatting and linting checks
-
-Any of these configurations and features can be disabled/modified freely after set up if the team
-chooses to.
-
-Note: [pyenv](https://github.com/pyenv/pyenv#installation) and
-[poetry](https://python-poetry.org/docs/#installation) are used for setting up a virtual environment
-with the correct python version. Make sure both of those are installed correctly in your machine.
-
-# Usage
-
-1. Click the `Use this template` button at the top of this repo's home page to spawn a new repo
-   from this template.
-
-2. Clone the new repo to your local environment.
-
-3. Run `sh init.sh <your_project_name> <python version>`.
-
-   Note that:
-
-   - the project's accepted python versions will be set to `^<python version>` - feel free
-     to change this manually in the `pyproject.toml` file after running the script.
-   - your project's source code should be placed in the newly-created folder with your project's
-     name, so that absolute imports (`from my_project.my_module import func`) work everywhere.
-
-4. Nuke this readme and the `init.sh` file.
-
-5. Add to git the changes made by the init script, such as the newly created `poetry.toml`,
-   `poetry.lock` and `.python-version` files.
-
-6. Commit and push your changes - your project is all set up.
-
-7. [Recommended] Set up the following in your GitHub project's `Settings` tab:
-   - Enable branch protection for the `main` branch in the `Branches` menu to prevent non-reviewed
-     pushes/merges to it.
-   - Enable `Automatically delete head branches` in the `General` tab for feature branches to be
-     cleaned up when merged.
-
-# For ongoing projects
-
-If you want to improve the current configs of an existing project, these files are the ones you'll
-probably want to steal some content from:
-
-- [VSCode settings](.vscode/settings.json)
-- [Flake8 config](.flake8)
-- [Black and iSort configs](pyproject.toml)
-- [Style check GitHub Action](.github/workflows/style-checks.yaml)
-
-Additionally, you might want to check the
-[project's source code is correctly installed via Poetry](https://stackoverflow.com/questions/66586856/how-can-i-make-my-project-available-in-the-poetry-environment)
-for intra-project imports to work as expected across the board.
-
-# For developers of this template
-
-To test new changes made to this template:
-
-1. Run the template in test mode with `test=true sh init.sh <your_project_name> <python version>`,
-   which will not delete the [project_base/test.py](project_base/test.py) file from the source
-   directory.
-
-2. Use that file to check everything works as expected (see details in its docstring).
-
-3. Make sure not to version any of the files created by the script. `git reset --hard` + manually
-   deleting the created files not yet added to versioning works, for example.
-
-# Issues and suggestions
-
-Feel free to report issues or propose improvements to this template via GitHub issues or through the
-`#team-tech-meta` channel in Slack.
-
-# Can I use it without Poetry?
-
-This template currently sets up your virtual environment via poetry only.
-
-If you want to use a different dependency manager, you'll have to manually do the following:
-
-1. Remove the `.venv` environment and the `pyproject.toml` and `poetry.lock` files.
-2. Create a new environment with your dependency manager of choice.
-3. Install flake, black and isort as dev dependencies.
-4. Install the current project's source.
-5. Set the path to your new environment's python in the `python.pythonPath` and
-   `python.defaultInterpreterPath` in [vscode settings](.vscode/settings.json).
-
-Disclaimer: this has not been tested, additional steps may be needed.
-
-# Troubleshooting
-
-### pyenv not picking up correct python version from .python-version
-
-Make sure the `PYENV_VERSION` env var isn't set in your current shell
-(and if it is, run `unset PYENV_VERSION`).
+# Pipeline Library
+
+The Pipeline Library is a powerful and flexible tool designed to simplify the creation and management of machine learning pipelines. It provides a high-level interface for defining and executing pipelines, allowing users to focus on the core aspects of their machine learning projects. The library currently supports XGBoost models, with plans to expand support for more models in the future.
+
+## Features
+
+* Intuitive and easy-to-use API for defining pipeline steps and configurations
+* Support for various data loading formats, including CSV and Parquet
+* Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
+* Seamless integration with XGBoost for model training and prediction
+* Hyperparameter optimization using Optuna for fine-tuning models
+* Evaluation metrics calculation and reporting
+* Explainable AI (XAI) dashboard for model interpretability
+* Extensible architecture for adding custom pipeline steps
+
+## Installation
+
+To install the Pipeline Library, you need to have Python 3.9 or higher and Poetry installed. Follow these steps:
+
+1. Clone the repository:
+
+   ```bash
+   git clone https://github.com/tryolabs/pipeline-lib.git
+   ```
+
+2. Navigate to the project directory:
+
+    ```bash
+    cd pipeline-lib
+    ```
+
+3. Install the dependencies using Poetry:
+
+    ```bash
+    poetry install
+    ```
+
+    If you want to include optional dependencies, you can specify the extras:
+
+    ```bash
+    poetry install --extras "xgboost"
+    ```
+
+    or
+
+    ```bash
+    poetry install --extras "all_models"
+    ```
+
+## Usage
+
+Here's an example of how to use the library to run an XGBoost pipeline:
+
+1. Create a `train.json` file with the following content:
+
+
+```json
+{
+    "pipeline": {
+        "name": "XGBoostTrainingPipeline",
+        "description": "Training pipeline for XGBoost models.",
+        "steps": [
+            {
+                "step_type": "GenerateStep",
+                "parameters": {
+                    "path": "examples/ocf/data/trainset_new.parquet"
+                }
+            },
+            {
+                "step_type": "CleanStep",
+                "parameters": {
+                    "drop_na_columns": [
+                        "average_power_kw"
+                    ],
+                    "drop_ids": {
+                        "ss_id": [
+                            7759,
+                            7061
+                        ]
+                    }
+                }
+            },
+            {
+                "step_type": "CalculateFeaturesStep",
+                "parameters": {
+                    "datetime_columns": [
+                        "date"
+                    ],
+                    "features": [
+                        "year",
+                        "month",
+                        "day",
+                        "hour",
+                        "minute"
+                    ]
+                }
+            },
+            {
+                "step_type": "TabularSplitStep",
+                "parameters": {
+                    "train_percentage": 0.95
+                }
+            },
+            {
+                "step_type": "XGBoostFitModelStep",
+                "parameters": {
+                    "target": "average_power_kw",
+                    "drop_columns": [
+                        "ss_id",
+                        "operational_at",
+                        "total_energy_kwh"
+                    ],
+                    "xgb_params": {
+                        "max_depth": 12,
+                        "eta": 0.12410097733370863,
+                        "objective": "reg:squarederror",
+                        "eval_metric": "mae",
+                        "n_jobs": -1,
+                        "n_estimators": 672,
+                        "min_child_weight": 7,
+                        "subsample": 0.8057743223537057,
+                        "colsample_bytree": 0.6316852278944352,
+                        "early_stopping_rounds": 10
+                    },
+                    "save_path": "model.joblib"
+                }
+            }
+        ]
+    }
+}
+```
+
+2. Run the pipeline using the following code:
+
+```python
+import logging
+
+from pipeline_lib.core import Pipeline
+
+logging.basicConfig(level=logging.INFO)
+
+Pipeline.from_json("train.json").run()
+```
+
+3. Create a `predict.json` file with the pipeline configuration for prediction:
+
+```json
+{
+    "pipeline": {
+        "name": "XGBoostPredictionPipeline",
+        "description": "Prediction pipeline for XGBoost models.",
+        "steps": [
+            {
+                "step_type": "GenerateStep",
+                "parameters": {
+                    "path": "examples/ocf/data/testset_new.parquet"
+                }
+            },
+            {
+                "step_type": "CleanStep",
+                "parameters": {
+                    "drop_na_columns": [
+                        "average_power_kw"
+                    ]
+                }
+            },
+            {
+                "step_type": "CalculateFeaturesStep",
+                "parameters": {
+                    "datetime_columns": [
+                        "date"
+                    ],
+                    "features": [
+                        "year",
+                        "month",
+                        "day",
+                        "hour",
+                        "minute"
+                    ]
+                }
+            },
+            {
+                "step_type": "XGBoostPredictStep",
+                "parameters": {
+                    "target": "average_power_kw",
+                    "drop_columns": [
+                        "ss_id",
+                        "operational_at",
+                        "total_energy_kwh"
+                    ],
+                    "load_path": "model.joblib"
+                }
+            },
+            {
+                "step_type": "CalculateMetricsStep",
+                "parameters": {}
+            },
+            {
+                "step_type": "ExplainerDashboardStep",
+                "parameters": {
+                    "max_samples": 1000
+                }
+            }
+        ]
+    }
+}
+```
+
+4. Run the prediction pipeline:
+
+```python
+Pipeline.from_json("predict.json").run()
+```
+
+The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.
+
+## Contributing
+
+Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.
diff --git a/pipeline_lib/__init__.py b/pipeline_lib/__init__.py
@@ -0,0 +1,6 @@
+from .core.pipeline import Pipeline
+
+Pipeline.step_registry.auto_register_steps_from_package("pipeline_lib.core.steps")
+Pipeline.step_registry.auto_register_steps_from_package(
+    "pipeline_lib.implementation.tabular.xgboost"
+)
diff --git a/pipeline_lib/core/__init__.py b/pipeline_lib/core/__init__.py
@@ -1,3 +1,2 @@
 from .data_container import DataContainer  # noqa: F401
-from .pipeline import Pipeline  # noqa: F401
 from .steps import PipelineStep  # noqa: F401