Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

mahdikhashan
Copy link
Member

ref: #3951

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Hi @mahdikhashan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mahdikhashan
Copy link
Member Author

mahdikhashan commented Jan 7, 2025

hi @andreyvelich , shall i keep it under user-guides/hp-tuning/?

@andreyvelich
Copy link
Member

Sure, I think we can create a new page for this feature.
FYI, please follow the contribution guide to sign the commits: https://www.kubeflow.org/docs/about/contributing/#getting-started
cc @helenxie-bit

@andreyvelich
Copy link
Member

Part of: kubeflow/katib#2339

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Copy link

@helenxie-bit: GitHub didn't allow me to request PR reviews from the following users: kubeflow/wg-automl-leads.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Thanks for the great contribution!

/lgtm

/cc @kubeflow/wg-automl-leads @andreyvelich @Arhell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added the lgtm label Jan 18, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varodrig @pdarshane @rimolive @hbelmiro Please help with reviewing this documentation page for LLM HP Tuning API.

Signed-off-by: mahdikhashan <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Jan 27, 2025
@helenxie-bit
Copy link
Contributor

Thanks for the update!

/lgtm

or the [Kubeflow Katib GitHub](https://github.com/kubeflow/katib/issues).
{{% /alert %}}

This page describes Large Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describes how to implement Hyperparameter optimization (HPO) using Python API ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thank you.

+++

{{% alert title="Warning" color="warning" %}}
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each web page has a feedback button at the bottom for users to add their feedback and creates an issue if needed.
cc @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

**How to implement Hyperparameter optimization (HPO) **

@andreyvelich to add comments on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

- [Optimizing Hyperparameters of Large Language Models](#optimizing-hyperparameters-of-large-language-models)
- [Example: Optimizing Hyperparameters of Llama-3.2 for Binary Classification on IMDB Dataset](#example-optimizing-hyperparameters-of-llama-32-for-binary-classification-on-imdb-dataset)

## Prerequisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank for including the prerequisites. I'm wondering if these prerequisites should be applied to all the docs/components/katib/user-guides/hp-tuning/ and in this case should be listed in this page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - I checked some of other similar docs under Katib, and I'd say for them it may not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we don't need it since this Prerequisites explained in the Getting Started guide.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
description = "API description"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description could include more information about this page.
Additionally, it will be great to have short paragraph explaining more about this topic, what we are trying to achieve and why. And include a reference to this topic for the audience to learn more about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right - I'll extend it. thanks for reminding me of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

| `parallel_trial_count` | Number of trials to run in parallel, set to `2`. |
| `resources_per_trial` | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory. |

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahdikhashan if you haven't tested the code yet, we should mark this PR as hold . please let us know. thank you.

@mahdikhashan
Copy link
Member Author

@varodrig thanks for you time and help with review it - I'll address your requested changes.
regarding the code, we have a nb example and we (@helenxie-bit) are collaborating on it together.

nb example issue: kubeflow/katib#2480

there is a in progress pr related to this (regarding e2e tests, its not related specifically on this but I have hold on to incorporate the latest possible changes).

Signed-off-by: mahdikhashan <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 12, 2025
Copy link

New changes are detected. LGTM label has been removed.

Signed-off-by: mahdikhashan <[email protected]>
@andreyvelich andreyvelich changed the title [USERGUIDE] LLM Hyperparameter Optimization API katib: [USERGUIDE] LLM Hyperparameter Optimization API Feb 13, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @mahdikhashan!
I left a few comments.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

@@ -0,0 +1,351 @@
+++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this guide under /user-guides/llm-hp-optimization.md for now for more visibility.
WDYT @mahdikhashan @helenxie-bit @Electronic-Waste ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. done.

+++

{{% alert title="Warning" color="warning" %}}
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

Comment on lines 13 to 14
This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify this message to say that this page describes how to optimize HPs in the process of LLMs Fine-Tuning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
it.

## Sections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this Sections, since website has outline at the right panel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

)
```

#### HuggingFaceModelParams
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these sections to the Training Operator doc and cross-reference it from this doc ?
https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/fine-tuning/


### Key Parameters for LLM Hyperparameter Tuning

| **Parameter** | **Description** | **Required** |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of these parameters should be used for LLMs.
Please exclude the ones that can't be used with LLM Trainer (e.g. objective)

secret_key="YOUR_SECRET_KEY"
)
```
## Optimizing Hyperparameters of Large Language Models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clearly say that right now user can tune parameters from training_parameters and lora_config.

algorithm_name = "random",
max_trial_count = 10,
parallel_trial_count = 2,
resources_per_trial={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we should use TrainerResource here, isn't it @mahdikhashan @helenxie-bit ?

cl.wait_for_experiment_condition(name=exp_name)

# Get the best hyperparameters.
print(cl.get_optimal_hyperparameters(exp_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to show output for the Experiment here.

Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Signed-off-by: mahdikhashan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants