Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #1

Merged
merged 33 commits into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
e1a71c2
add pipeline from config
diegomarvid Mar 4, 2024
74369b1
enable custom step registration via folder path in config.json
diegomarvid Mar 5, 2024
93d304d
change load and register function to classmethod
diegomarvid Mar 5, 2024
3e63edf
add load and save in json
diegomarvid Mar 5, 2024
aa670ec
add simple readme
diegomarvid Mar 5, 2024
c5210f1
add feature importance to fit step
diegomarvid Mar 5, 2024
5f273be
fix optuna storage default param
diegomarvid Mar 5, 2024
a00981f
add extras for better pip installation
diegomarvid Mar 5, 2024
e4ddec5
save xgboost model to joblib
diegomarvid Mar 6, 2024
871bbec
adjust poetry extras
diegomarvid Mar 6, 2024
b447fc9
change config dict in init to **params
diegomarvid Mar 13, 2024
fef94d9
split step_registry to new class
diegomarvid Mar 13, 2024
14e05b9
refactor step registry
diegomarvid Mar 13, 2024
8d91815
improve readme
diegomarvid Mar 14, 2024
4819fad
improve explainer dashboard
diegomarvid Mar 14, 2024
548dac4
fix flake8 errors
diegomarvid Mar 14, 2024
67978af
remove importance in XGBoost Fit since we have Explainer Dashboard
diegomarvid Mar 14, 2024
86edc65
add generate and clean steps to core
diegomarvid Mar 18, 2024
b85720f
add calculate features and improve clean step
diegomarvid Mar 18, 2024
3b3db93
improve efficiency and error handling in calculate features step
diegomarvid Mar 18, 2024
5d65358
improve readme and fix minor issues
diegomarvid Mar 19, 2024
0ec5f16
change static variable to getter and setter
diegomarvid Mar 19, 2024
e2e99ec
add data flow key for improving step data connection
diegomarvid Mar 19, 2024
4808ab6
add more metrics for palf :)
diegomarvid Mar 19, 2024
2c96b4b
add model interface
diegomarvid Mar 25, 2024
ac8443c
differentiate between train & predict steps at exec
diegomarvid Mar 25, 2024
38279c0
delete unnecesary steps
diegomarvid Mar 25, 2024
c32b1ab
refactor metrics
diegomarvid Mar 25, 2024
bc9a4df
improve tabular split and add calc train metrics
diegomarvid Mar 25, 2024
58bca73
save runs & metrics to folder
diegomarvid Mar 25, 2024
6276cc5
update readme
diegomarvid Mar 26, 2024
556ef59
refactor clean step
diegomarvid Mar 26, 2024
a049dee
fix typehints in data container
diegomarvid Mar 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ __pycache__/
# C extensions
*.so

*.joblib
*.bin
*.json

# ignore examples folder
examples/

Expand Down
323 changes: 219 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,219 @@
# Repo Template

Kick off a project with the right foot.

A repository template for easily setting up a well behaved development environment for a smooth
collaboration experience.

This template takes care of setting up and configuring:

- A **virtual environment**
- **Formatting and linting** tools
- Some shared default **VSCode settings**
- A **Pull Request template**
- A **GitHub Action** that runs formatting and linting checks

Any of these configurations and features can be disabled/modified freely after set up if the team
chooses to.

Note: [pyenv](https://github.com/pyenv/pyenv#installation) and
[poetry](https://python-poetry.org/docs/#installation) are used for setting up a virtual environment
with the correct python version. Make sure both of those are installed correctly in your machine.

# Usage

1. Click the `Use this template` button at the top of this repo's home page to spawn a new repo
from this template.

2. Clone the new repo to your local environment.

3. Run `sh init.sh <your_project_name> <python version>`.

Note that:

- the project's accepted python versions will be set to `^<python version>` - feel free
to change this manually in the `pyproject.toml` file after running the script.
- your project's source code should be placed in the newly-created folder with your project's
name, so that absolute imports (`from my_project.my_module import func`) work everywhere.

4. Nuke this readme and the `init.sh` file.

5. Add to git the changes made by the init script, such as the newly created `poetry.toml`,
`poetry.lock` and `.python-version` files.

6. Commit and push your changes - your project is all set up.

7. [Recommended] Set up the following in your GitHub project's `Settings` tab:
- Enable branch protection for the `main` branch in the `Branches` menu to prevent non-reviewed
pushes/merges to it.
- Enable `Automatically delete head branches` in the `General` tab for feature branches to be
cleaned up when merged.

# For ongoing projects

If you want to improve the current configs of an existing project, these files are the ones you'll
probably want to steal some content from:

- [VSCode settings](.vscode/settings.json)
- [Flake8 config](.flake8)
- [Black and iSort configs](pyproject.toml)
- [Style check GitHub Action](.github/workflows/style-checks.yaml)

Additionally, you might want to check the
[project's source code is correctly installed via Poetry](https://stackoverflow.com/questions/66586856/how-can-i-make-my-project-available-in-the-poetry-environment)
for intra-project imports to work as expected across the board.

# For developers of this template

To test new changes made to this template:

1. Run the template in test mode with `test=true sh init.sh <your_project_name> <python version>`,
which will not delete the [project_base/test.py](project_base/test.py) file from the source
directory.

2. Use that file to check everything works as expected (see details in its docstring).

3. Make sure not to version any of the files created by the script. `git reset --hard` + manually
deleting the created files not yet added to versioning works, for example.

# Issues and suggestions

Feel free to report issues or propose improvements to this template via GitHub issues or through the
`#team-tech-meta` channel in Slack.

# Can I use it without Poetry?

This template currently sets up your virtual environment via poetry only.

If you want to use a different dependency manager, you'll have to manually do the following:

1. Remove the `.venv` environment and the `pyproject.toml` and `poetry.lock` files.
2. Create a new environment with your dependency manager of choice.
3. Install flake, black and isort as dev dependencies.
4. Install the current project's source.
5. Set the path to your new environment's python in the `python.pythonPath` and
`python.defaultInterpreterPath` in [vscode settings](.vscode/settings.json).

Disclaimer: this has not been tested, additional steps may be needed.

# Troubleshooting

### pyenv not picking up correct python version from .python-version

Make sure the `PYENV_VERSION` env var isn't set in your current shell
(and if it is, run `unset PYENV_VERSION`).
# Pipeline Library

The Pipeline Library is a powerful and flexible tool designed to simplify the creation and management of machine learning pipelines. It provides a high-level interface for defining and executing pipelines, allowing users to focus on the core aspects of their machine learning projects. The library currently supports XGBoost models, with plans to expand support for more models in the future.

## Features

* Intuitive and easy-to-use API for defining pipeline steps and configurations
* Support for various data loading formats, including CSV and Parquet
* Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
* Seamless integration with XGBoost for model training and prediction
* Hyperparameter optimization using Optuna for fine-tuning models
* Evaluation metrics calculation and reporting
* Explainable AI (XAI) dashboard for model interpretability
* Extensible architecture for adding custom pipeline steps

## Installation

To install the Pipeline Library, you need to have Python 3.9 or higher and Poetry installed. Follow these steps:

1. Clone the repository:

```bash
git clone https://github.com/tryolabs/pipeline-lib.git
```

2. Navigate to the project directory:

```bash
cd pipeline-lib
```

3. Install the dependencies using Poetry:

```bash
poetry install
```

If you want to include optional dependencies, you can specify the extras:

```bash
poetry install --extras "xgboost"
```

or

```bash
poetry install --extras "all_models"
```

## Usage

Here's an example of how to use the library to run an XGBoost pipeline:

1. Create a `train.json` file with the following content:


```json
{
"pipeline": {
"name": "XGBoostTrainingPipeline",
"description": "Training pipeline for XGBoost models.",
"steps": [
{
"step_type": "GenerateStep",
"parameters": {
"path": "examples/ocf/data/trainset_new.parquet"
}
},
{
"step_type": "CleanStep",
"parameters": {
"drop_na_columns": [
"average_power_kw"
],
"drop_ids": {
"ss_id": [
7759,
7061
]
}
}
},
{
"step_type": "CalculateFeaturesStep",
"parameters": {
"datetime_columns": [
"date"
],
"features": [
"year",
"month",
"day",
"hour",
"minute"
]
}
},
{
"step_type": "TabularSplitStep",
"parameters": {
"train_percentage": 0.95
}
},
{
"step_type": "XGBoostFitModelStep",
"parameters": {
"target": "average_power_kw",
"drop_columns": [
"ss_id",
"operational_at",
"total_energy_kwh"
],
"xgb_params": {
"max_depth": 12,
"eta": 0.12410097733370863,
"objective": "reg:squarederror",
"eval_metric": "mae",
"n_jobs": -1,
"n_estimators": 672,
"min_child_weight": 7,
"subsample": 0.8057743223537057,
"colsample_bytree": 0.6316852278944352,
"early_stopping_rounds": 10
},
"save_path": "model.joblib"
}
}
]
}
}
```

2. Run the pipeline using the following code:

```python
import logging

from pipeline_lib.core import Pipeline

logging.basicConfig(level=logging.INFO)

Pipeline.from_json("train.json").run()
```

3. Create a `predict.json` file with the pipeline configuration for prediction:

```json
{
"pipeline": {
"name": "XGBoostPredictionPipeline",
"description": "Prediction pipeline for XGBoost models.",
"steps": [
{
"step_type": "GenerateStep",
"parameters": {
"path": "examples/ocf/data/testset_new.parquet"
}
},
{
"step_type": "CleanStep",
"parameters": {
"drop_na_columns": [
"average_power_kw"
]
}
},
{
"step_type": "CalculateFeaturesStep",
"parameters": {
"datetime_columns": [
"date"
],
"features": [
"year",
"month",
"day",
"hour",
"minute"
]
}
},
{
"step_type": "XGBoostPredictStep",
"parameters": {
"target": "average_power_kw",
"drop_columns": [
"ss_id",
"operational_at",
"total_energy_kwh"
],
"load_path": "model.joblib"
}
},
{
"step_type": "CalculateMetricsStep",
"parameters": {}
},
{
"step_type": "ExplainerDashboardStep",
"parameters": {
"max_samples": 1000
}
}
]
}
}
```

4. Run the prediction pipeline:

```python
Pipeline.from_json("predict.json").run()
```

The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.

## Contributing

Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.
6 changes: 6 additions & 0 deletions pipeline_lib/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from .core.pipeline import Pipeline

Pipeline.step_registry.auto_register_steps_from_package("pipeline_lib.core.steps")
Pipeline.step_registry.auto_register_steps_from_package(
"pipeline_lib.implementation.tabular.xgboost"
)
1 change: 0 additions & 1 deletion pipeline_lib/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
from .data_container import DataContainer # noqa: F401
from .pipeline import Pipeline # noqa: F401
from .steps import PipelineStep # noqa: F401
Loading
Loading