-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for train API #2199
Merged
google-oss-prow
merged 114 commits into
kubeflow:master
from
helenxie-bit:add-e2e-test-for-train-api
Dec 22, 2024
Merged
Changes from all commits
Commits
Show all changes
114 commits
Select commit
Hold shift + click to select a range
15b6cb0
add e2e test for train API
helenxie-bit daa0054
fix peft import error
helenxie-bit 8d4af90
update settings of the job
helenxie-bit 86c31c8
fix format
helenxie-bit 01870e2
fix format
helenxie-bit 17f3c33
fix error detection
helenxie-bit 0685dc7
resolve conflict
helenxie-bit 83de64b
resolve conflict
helenxie-bit f954f2d
resolve conflict
helenxie-bit ff48154
fix format
helenxie-bit 304db5d
fix NoneType error
helenxie-bit 486154d
fix format
helenxie-bit 016c41d
test bug
helenxie-bit 1e7bd23
find bug
helenxie-bit 1aced61
find bug
helenxie-bit 3100aae
find bug
helenxie-bit e5b9061
add storage_config
helenxie-bit ffb0685
fix format
helenxie-bit dc1b48a
reduce pvc size
helenxie-bit 8894517
set storage_config
helenxie-bit 36872d7
set storage_config
helenxie-bit 7dd8d40
set storage_config
helenxie-bit 60c322d
set storage_config
helenxie-bit dd970ab
use gpu
helenxie-bit 10bbfa0
use gpu
helenxie-bit d47d6a6
use gpu
helenxie-bit 4ccd4a7
fix 'set_device' error
helenxie-bit 0750322
add timeout error
helenxie-bit 5ca0923
fix format
helenxie-bit 387eb84
fix format
helenxie-bit 9cc5429
fix format
helenxie-bit 8a537ad
fix typo
helenxie-bit e508ef4
update e2e test for train api
helenxie-bit 788359b
add num_labels
helenxie-bit 9b4222e
update pip install
helenxie-bit d75938d
check disk space
helenxie-bit 1148bc8
change sequence of e2e tests
helenxie-bit d29a85d
add clean-up after each e2e test of pytorchjob
helenxie-bit 82ea9be
update cleanup function
helenxie-bit b45f9f7
update cleanup function
helenxie-bit a204746
update cleanup function-add check disk
helenxie-bit 2d8f8b1
check docker volumes
helenxie-bit c748d0e
update cleanup function
helenxie-bit a68e182
update cleanup function
helenxie-bit 227129e
check docker directory
helenxie-bit 79e9e32
update pip install and 'num_workers'
helenxie-bit b7dbf5c
update pip install and 'num_workers'
helenxie-bit 1f639a7
update pip install
helenxie-bit 8322730
change the value of 'clean_pod_policy'
helenxie-bit ed10574
change the value of 'update cleanup function
helenxie-bit 50ed9e8
update cleanup function
helenxie-bit b2cd27a
update cleanup function
helenxie-bit 3af5d87
check docker volumes
helenxie-bit 1a0eff3
check docker volumes
helenxie-bit 604265a
stop the controller and restart it again to clean up
helenxie-bit a4f848f
update cleanup function
helenxie-bit 3e86e90
update cleanup function
helenxie-bit 558330b
update cleanup function
helenxie-bit d4ed2d8
separate e2e test for train api
helenxie-bit 7a2ae05
fix format
helenxie-bit 9efcce5
fix parameter of namespace
helenxie-bit a443ea2
fix format
helenxie-bit 85fd8e6
reduce resources
helenxie-bit 1a0c455
separate e2e test for train API
helenxie-bit afe4240
remove go setup
helenxie-bit 250b830
adjust the version of k8s
helenxie-bit c5b39a4
move test file to new place
helenxie-bit fa99a92
fix typos
helenxie-bit f0d8cc4
rerun tests
helenxie-bit d2c3cac
update install packages
helenxie-bit c3f04c3
Merge remote-tracking branch 'upstream/master' into add-e2e-test-for-…
helenxie-bit 9f42449
build and verify images of storage-intializer and trainer
helenxie-bit bb406ce
fix image build error
helenxie-bit f0b6b38
fix image build error
helenxie-bit 45eb7e0
check disk space
helenxie-bit f217794
make 'setup-storage-initializer-and-trainer' executable
helenxie-bit 083e155
separate step of loading images
helenxie-bit dc74844
check disk space after loading image
helenxie-bit de18ef0
clean up and check disk space
helenxie-bit ef8742c
prune docker build cache
helenxie-bit 1eb3ef1
prune docker build cache
helenxie-bit 1e407a5
adjust sequence of building and loading images
helenxie-bit 7519559
move working directory
helenxie-bit f5d63c4
delete moving working directory
helenxie-bit 08c8562
fix format
helenxie-bit d2ae542
use 'docker system prune'
helenxie-bit 09fc8a9
make the format of the commands to be consistent
helenxie-bit a27e1a2
update base image
helenxie-bit 59d8582
update base image
helenxie-bit 581d2bc
update base image
helenxie-bit 1140a11
delete unnecessary space clear and check code
helenxie-bit 82de69a
merge e2e test for train api into integration tests
helenxie-bit f50094a
resolve conflict in integration tests
helenxie-bit 5efaf3b
check for timeout error
helenxie-bit 13ae587
fix name of trainer image
helenxie-bit dd4c2be
fix env of building storage initializer image
helenxie-bit b21bedd
clean format
helenxie-bit 1669055
skip e2e test for train API when use scheduling
helenxie-bit 3d91dfc
Update name of fileholder
helenxie-bit ba7297a
fix format
helenxie-bit 16645f9
separate e2e test for train API
helenxie-bit 5ec175f
fix format
helenxie-bit b7986e6
move test script
helenxie-bit 267cbe8
update path to test script
helenxie-bit 617dba3
update path to test script
helenxie-bit b5ae618
rerun tests
helenxie-bit 9b997a3
rerun tests
helenxie-bit 6e8f3f7
rerun tests
helenxie-bit f4bb238
update kubernetes version
helenxie-bit 775ff67
update kubernetes version
helenxie-bit fc21273
rerun tests
helenxie-bit 4c51b76
rerun tests
helenxie-bit 52c73b4
adjust kubernetes version to 1.30.6
helenxie-bit 13a0de2
adjust kubernetes version to 1.31.4
helenxie-bit File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
name: E2E Test with train API | ||
on: | ||
- pull_request | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
e2e-test: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
kubernetes-version: ["v1.31.4"] | ||
python-version: ["3.9", "3.10", "3.11"] | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
|
||
- name: Setup E2E Tests | ||
uses: ./.github/workflows/setup-e2e-test | ||
with: | ||
kubernetes-version: ${{ matrix.kubernetes-version }} | ||
python-version: ${{ matrix.python-version }} | ||
|
||
- name: Build trainer | ||
run: | | ||
./scripts/gha/build-trainer.sh | ||
env: | ||
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test | ||
|
||
- name: Load trainer | ||
run: | | ||
kind load docker-image ${{ env.TRAINER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }} | ||
env: | ||
KIND_CLUSTER: training-operator-cluster | ||
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test | ||
|
||
- name: Build storage initializer | ||
run: | | ||
./scripts/gha/build-storage-initializer.sh | ||
env: | ||
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test | ||
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test | ||
|
||
- name: Load storage initializer | ||
run: | | ||
kind load docker-image ${{ env.STORAGE_INITIALIZER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }} | ||
env: | ||
KIND_CLUSTER: training-operator-cluster | ||
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test | ||
|
||
- name: Run tests | ||
run: | | ||
pip install pytest | ||
python3 -m pip install -e sdk/python[huggingface] | ||
pytest -s sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py --log-cli-level=debug | ||
env: | ||
STORAGE_INITIALIZER_IMAGE: kubeflowtraining/storage-initializer:test | ||
TRAINER_TRANSFORMER_IMAGE_DEFAULT: kubeflowtraining/trainer:test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2024 The Kubeflow Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# The script is used to build Kubeflow Training image. | ||
|
||
|
||
set -o errexit | ||
set -o nounset | ||
set -o pipefail | ||
|
||
docker build sdk/python/kubeflow/storage_initializer -t ${STORAGE_INITIALIZER_CI_IMAGE} -f sdk/python/kubeflow/storage_initializer/Dockerfile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
#!/bin/bash | ||
|
||
# Copyright 2024 The Kubeflow Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# The script is used to build Kubeflow Training image. | ||
|
||
|
||
set -o errexit | ||
set -o nounset | ||
set -o pipefail | ||
|
||
docker build sdk/python/kubeflow/trainer -t ${TRAINER_CI_IMAGE} -f sdk/python/kubeflow/trainer/Dockerfile.cpu |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Use an official Python runtime as a parent image | ||
FROM python:3.11 | ||
|
||
# Set the working directory in the container | ||
WORKDIR /app | ||
|
||
# Copy the requirements.txt file into the container | ||
COPY requirements.txt /app/requirements.txt | ||
|
||
# Install any needed packages specified in requirements.txt | ||
RUN pip install --no-cache-dir torch==2.5.1 | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Copy the Python package and its source code into the container | ||
COPY . /app | ||
|
||
# Run storage.py when the container launches | ||
ENTRYPOINT ["torchrun", "hf_llm_training.py"] |
96 changes: 96 additions & 0 deletions
96
sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Copyright 2024 kubeflow.org. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import logging | ||
|
||
import transformers | ||
from kubeflow.storage_initializer.hugging_face import ( | ||
HuggingFaceDatasetParams, | ||
HuggingFaceModelParams, | ||
HuggingFaceTrainerParams, | ||
) | ||
from kubeflow.training import TrainingClient, constants | ||
from peft import LoraConfig | ||
|
||
import test.e2e.utils as utils | ||
|
||
logging.basicConfig(format="%(message)s") | ||
logging.getLogger("kubeflow.training.api.training_client").setLevel(logging.DEBUG) | ||
|
||
TRAINING_CLIENT = TrainingClient(job_kind=constants.PYTORCHJOB_KIND) | ||
|
||
|
||
def test_sdk_e2e_create_from_train_api(job_namespace="default"): | ||
JOB_NAME = "pytorchjob-from-train-api" | ||
|
||
# Use test case from fine-tuning API tutorial. | ||
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ | ||
TRAINING_CLIENT.train( | ||
name=JOB_NAME, | ||
namespace=job_namespace, | ||
# BERT model URI and type of Transformer to train it. | ||
model_provider_parameters=HuggingFaceModelParams( | ||
model_uri="hf://google-bert/bert-base-cased", | ||
transformer_type=transformers.AutoModelForSequenceClassification, | ||
num_labels=5, | ||
), | ||
# In order to save test time, use 8 samples from Yelp dataset. | ||
dataset_provider_parameters=HuggingFaceDatasetParams( | ||
repo_id="yelp_review_full", | ||
split="train[:8]", | ||
), | ||
# Specify HuggingFace Trainer parameters. | ||
trainer_parameters=HuggingFaceTrainerParams( | ||
training_parameters=transformers.TrainingArguments( | ||
output_dir="test_trainer", | ||
save_strategy="no", | ||
evaluation_strategy="no", | ||
do_eval=False, | ||
disable_tqdm=True, | ||
log_level="info", | ||
num_train_epochs=1, | ||
), | ||
# Set LoRA config to reduce number of trainable parameters. | ||
lora_config=LoraConfig( | ||
r=8, | ||
lora_alpha=8, | ||
lora_dropout=0.1, | ||
bias="none", | ||
), | ||
), | ||
num_workers=1, | ||
num_procs_per_worker=1, | ||
resources_per_worker={ | ||
"gpu": 0, | ||
"cpu": 2, | ||
"memory": "10G", | ||
}, | ||
storage_config={ | ||
"size": "10Gi", | ||
"access_modes": ["ReadWriteOnce"], | ||
}, | ||
) | ||
|
||
logging.info(f"List of created {TRAINING_CLIENT.job_kind}s") | ||
logging.info(TRAINING_CLIENT.list_jobs(job_namespace)) | ||
|
||
try: | ||
utils.verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, wait_timeout=900) | ||
except Exception as e: | ||
utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace) | ||
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace) | ||
raise Exception(f"PyTorchJob create from API E2E fails. Exception: {e}") | ||
|
||
utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace) | ||
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the Kubernetes version to be aligned with other ci tests? Like:
https://github.com/kubeflow/training-operator/blob/69094e16309382d929606f8c5ce9a9d8c00308b1/.github/workflows/test-example-notebooks.yaml#L16-L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to save compute resources, I think for now we can just run this test on a single k8s version, since we run the rests E2E tests on the all versions.
WDYT @Electronic-Waste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Maybe we can select one k8s version from this list:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me change the version to
v1.30.6
. And we can update it if needed in the future.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we support 1.28-1.31, I would suggest that we run our integration tests on 1.29, 1.30, 1.31, we can update it in the following PR.
For the
train
API tests, I think running it on 1.31 should be sufficient.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. So I think we will still keep the
v1.31.4
version.