-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RayJob training example using pytorch resnet image classifier #2107
Add RayJob training example using pytorch resnet image classifier #2107
Conversation
24b91ee
to
da09f40
Compare
da09f40
to
39c73c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to update files under pytorch-text-classifier/
. If we update the YAML/Python files, users will fail to follow the original documentation, which does not focus on checkpointing.
bucketName: GCS_BUCKET | ||
mountOptions: "implicit-dirs,uid=1000,gid=100" | ||
workerGroupSpecs: | ||
- replicas: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run kubectl create -f ...
, I got the following error. We should specify minReplicas
and maxReplicas
here.
error: error validating "ray-job.pytorch-image-classifier.yaml": error validating data: [ValidationError(RayJob.spec.rayClusterSpec.workerGroupSpecs[0]): missing required field "maxReplicas" in io.ray.v1.RayJob.spec.rayClusterSpec.workerGroupSpecs, ValidationError(RayJob.spec.rayClusterSpec.workerGroupSpecs[0]): missing required field "minReplicas" in io.ray.v1.RayJob.spec.rayClusterSpec.workerGroupSpecs]; if you choose to ignore these errors, turn validation off with --validate=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, that's weird -- I didn't run into that error, maybe I didn't have validation enabled? I added both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Both of them are required.
kuberay/ray-operator/apis/ray/v1/raycluster_types.go
Lines 53 to 58 in 7fb46ab
// MinReplicas denotes the minimum number of desired Pods for this worker group. | |
// +kubebuilder:default:=0 | |
MinReplicas *int32 `json:"minReplicas"` | |
// MaxReplicas denotes the maximum number of desired Pods for this worker group, and the default value is maxInt32. | |
// +kubebuilder:default:=2147483647 | |
MaxReplicas *int32 `json:"maxReplicas"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those have kubebuilder defaults though, should it error if not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, the kubebuilder defaults will never be used because there is no omitempty
in either MinReplicas
or MaxReplicas
.
MaxReplicas *int32 `json:"maxReplicas,omitempty"`
containers: | ||
- name: ray-worker | ||
image: rayproject/ray-ml:2.9.0-gpu | ||
lifecycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this.
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
…ckpointing Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
39c73c9
to
7b9759d
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
I removed changes in the pytorch-text-classifier sample. I wanted to have an example of it where it uses a shared filesystem, but I guess we can keep those two examples separate |
@@ -131,15 +131,16 @@ def train_func(config): | |||
# The checkpoints and metrics are reported by `RayTrainReportCallback` | |||
run_config = RunConfig( | |||
name="ptl-sent-classification", | |||
storage_path="/mnt/cluster_storage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change? This change may cause users fail to follow this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this was leftover from the last change, reverting it
Signed-off-by: Andrew Sy Kim <[email protected]>
bucketName: GCS_BUCKET | ||
mountOptions: "implicit-dirs,uid=1000,gid=100" | ||
workerGroupSpecs: | ||
- replicas: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, the kubebuilder defaults will never be used because there is no omitempty
in either MinReplicas
or MaxReplicas
.
MaxReplicas *int32 `json:"maxReplicas,omitempty"`
The CI failure is not related to this PR. Merge. |
Why are these changes needed?
Add an example RayJob based on the Finetuning a Pytorch Image Classifier with Ray Train example.
This example will be referenced for user guides that demonstrate distributed checkpointing with GCSFuse.
Also updates the existing pytorch text classifier with an example using GCSFuse.
Related issue number
Checks