|
1 |
| -# Repo Template |
2 |
| - |
3 |
| -Kick off a project with the right foot. |
4 |
| - |
5 |
| -A repository template for easily setting up a well behaved development environment for a smooth |
6 |
| -collaboration experience. |
7 |
| - |
8 |
| -This template takes care of setting up and configuring: |
9 |
| - |
10 |
| -- A **virtual environment** |
11 |
| -- **Formatting and linting** tools |
12 |
| -- Some shared default **VSCode settings** |
13 |
| -- A **Pull Request template** |
14 |
| -- A **GitHub Action** that runs formatting and linting checks |
15 |
| - |
16 |
| -Any of these configurations and features can be disabled/modified freely after set up if the team |
17 |
| -chooses to. |
18 |
| - |
19 |
| -Note: [pyenv](https://github.com/pyenv/pyenv#installation) and |
20 |
| -[poetry](https://python-poetry.org/docs/#installation) are used for setting up a virtual environment |
21 |
| -with the correct python version. Make sure both of those are installed correctly in your machine. |
22 |
| - |
23 |
| -# Usage |
24 |
| - |
25 |
| -1. Click the `Use this template` button at the top of this repo's home page to spawn a new repo |
26 |
| - from this template. |
27 |
| - |
28 |
| -2. Clone the new repo to your local environment. |
29 |
| - |
30 |
| -3. Run `sh init.sh <your_project_name> <python version>`. |
31 |
| - |
32 |
| - Note that: |
33 |
| - |
34 |
| - - the project's accepted python versions will be set to `^<python version>` - feel free |
35 |
| - to change this manually in the `pyproject.toml` file after running the script. |
36 |
| - - your project's source code should be placed in the newly-created folder with your project's |
37 |
| - name, so that absolute imports (`from my_project.my_module import func`) work everywhere. |
38 |
| - |
39 |
| -4. Nuke this readme and the `init.sh` file. |
40 |
| - |
41 |
| -5. Add to git the changes made by the init script, such as the newly created `poetry.toml`, |
42 |
| - `poetry.lock` and `.python-version` files. |
43 |
| - |
44 |
| -6. Commit and push your changes - your project is all set up. |
45 |
| - |
46 |
| -7. [Recommended] Set up the following in your GitHub project's `Settings` tab: |
47 |
| - - Enable branch protection for the `main` branch in the `Branches` menu to prevent non-reviewed |
48 |
| - pushes/merges to it. |
49 |
| - - Enable `Automatically delete head branches` in the `General` tab for feature branches to be |
50 |
| - cleaned up when merged. |
51 |
| - |
52 |
| -# For ongoing projects |
53 |
| - |
54 |
| -If you want to improve the current configs of an existing project, these files are the ones you'll |
55 |
| -probably want to steal some content from: |
56 |
| - |
57 |
| -- [VSCode settings](.vscode/settings.json) |
58 |
| -- [Flake8 config](.flake8) |
59 |
| -- [Black and iSort configs](pyproject.toml) |
60 |
| -- [Style check GitHub Action](.github/workflows/style-checks.yaml) |
61 |
| - |
62 |
| -Additionally, you might want to check the |
63 |
| -[project's source code is correctly installed via Poetry](https://stackoverflow.com/questions/66586856/how-can-i-make-my-project-available-in-the-poetry-environment) |
64 |
| -for intra-project imports to work as expected across the board. |
65 |
| - |
66 |
| -# For developers of this template |
67 |
| - |
68 |
| -To test new changes made to this template: |
69 |
| - |
70 |
| -1. Run the template in test mode with `test=true sh init.sh <your_project_name> <python version>`, |
71 |
| - which will not delete the [project_base/test.py](project_base/test.py) file from the source |
72 |
| - directory. |
73 |
| - |
74 |
| -2. Use that file to check everything works as expected (see details in its docstring). |
75 |
| - |
76 |
| -3. Make sure not to version any of the files created by the script. `git reset --hard` + manually |
77 |
| - deleting the created files not yet added to versioning works, for example. |
78 |
| - |
79 |
| -# Issues and suggestions |
80 |
| - |
81 |
| -Feel free to report issues or propose improvements to this template via GitHub issues or through the |
82 |
| -`#team-tech-meta` channel in Slack. |
83 |
| - |
84 |
| -# Can I use it without Poetry? |
85 |
| - |
86 |
| -This template currently sets up your virtual environment via poetry only. |
87 |
| - |
88 |
| -If you want to use a different dependency manager, you'll have to manually do the following: |
89 |
| - |
90 |
| -1. Remove the `.venv` environment and the `pyproject.toml` and `poetry.lock` files. |
91 |
| -2. Create a new environment with your dependency manager of choice. |
92 |
| -3. Install flake, black and isort as dev dependencies. |
93 |
| -4. Install the current project's source. |
94 |
| -5. Set the path to your new environment's python in the `python.pythonPath` and |
95 |
| - `python.defaultInterpreterPath` in [vscode settings](.vscode/settings.json). |
96 |
| - |
97 |
| -Disclaimer: this has not been tested, additional steps may be needed. |
98 |
| - |
99 |
| -# Troubleshooting |
100 |
| - |
101 |
| -### pyenv not picking up correct python version from .python-version |
102 |
| - |
103 |
| -Make sure the `PYENV_VERSION` env var isn't set in your current shell |
104 |
| -(and if it is, run `unset PYENV_VERSION`). |
| 1 | +# Pipeline Library |
| 2 | + |
| 3 | +The purpose of this library is to create pipelines for ML as simple as possible. At the moment we support XGBoost models, but we are working to support more models. |
| 4 | + |
| 5 | +This is an example of how to use the library to run an XGBoost pipeline: |
| 6 | + |
| 7 | +We create a `train.json` file with the following content: |
| 8 | + |
| 9 | +```json |
| 10 | +{ |
| 11 | + "custom_steps_path": "examples/ocf/", |
| 12 | + "save_path": "runs/xgboost_train.pkl", |
| 13 | + "pipeline": { |
| 14 | + "name": "XGBoostTrainingPipeline", |
| 15 | + "description": "Training pipeline for XGBoost models.", |
| 16 | + "steps": [ |
| 17 | + { |
| 18 | + "step_type": "OCFGenerateStep", |
| 19 | + "parameters": { |
| 20 | + "path": "examples/ocf/data/trainset_new.parquet" |
| 21 | + } |
| 22 | + }, |
| 23 | + { |
| 24 | + "step_type": "OCFCleanStep", |
| 25 | + "parameters": {} |
| 26 | + }, |
| 27 | + { |
| 28 | + "step_type": "TabularSplitStep", |
| 29 | + "parameters": { |
| 30 | + "id_column": "ss_id", |
| 31 | + "train_percentage": 0.95 |
| 32 | + } |
| 33 | + }, |
| 34 | + { |
| 35 | + "step_type": "XGBoostFitModelStep", |
| 36 | + "parameters": { |
| 37 | + "target": "average_power_kw", |
| 38 | + "drop_columns": [ |
| 39 | + "ss_id" |
| 40 | + ], |
| 41 | + "xgb_params": { |
| 42 | + "max_depth": 12, |
| 43 | + "eta": 0.12410097733370863, |
| 44 | + "objective": "reg:squarederror", |
| 45 | + "eval_metric": "mae", |
| 46 | + "n_jobs": -1, |
| 47 | + "n_estimators": 2, |
| 48 | + "min_child_weight": 7, |
| 49 | + "subsample": 0.8057743223537057, |
| 50 | + "colsample_bytree": 0.6316852278944352 |
| 51 | + }, |
| 52 | + "save_model": true |
| 53 | + } |
| 54 | + } |
| 55 | + ] |
| 56 | + } |
| 57 | +} |
| 58 | +``` |
| 59 | + |
| 60 | +The user can define custom steps to generate and clean their own data and use them in the pipeline. Then we can run the pipeline with the following code: |
| 61 | + |
| 62 | +```python |
| 63 | +import logging |
| 64 | + |
| 65 | +from pipeline_lib.core import Pipeline |
| 66 | + |
| 67 | +logging.basicConfig(level=logging.INFO) |
| 68 | + |
| 69 | +Pipeline.from_json("train.json").run() |
| 70 | +``` |
0 commit comments