Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RayJob training example using pytorch resnet image classifier #2107

Merged

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

Add an example RayJob based on the Finetuning a Pytorch Image Classifier with Ray Train example.

This example will be referenced for user guides that demonstrate distributed checkpointing with GCSFuse.

Also updates the existing pytorch text classifier with an example using GCSFuse.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to update files under pytorch-text-classifier/. If we update the YAML/Python files, users will fail to follow the original documentation, which does not focus on checkpointing.

bucketName: GCS_BUCKET
mountOptions: "implicit-dirs,uid=1000,gid=100"
workerGroupSpecs:
- replicas: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run kubectl create -f ..., I got the following error. We should specify minReplicas and maxReplicas here.

error: error validating "ray-job.pytorch-image-classifier.yaml": error validating data: [ValidationError(RayJob.spec.rayClusterSpec.workerGroupSpecs[0]): missing required field "maxReplicas" in io.ray.v1.RayJob.spec.rayClusterSpec.workerGroupSpecs, ValidationError(RayJob.spec.rayClusterSpec.workerGroupSpecs[0]): missing required field "minReplicas" in io.ray.v1.RayJob.spec.rayClusterSpec.workerGroupSpecs]; if you choose to ignore these errors, turn validation off with --validate=false

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's weird -- I didn't run into that error, maybe I didn't have validation enabled? I added both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Both of them are required.

// MinReplicas denotes the minimum number of desired Pods for this worker group.
// +kubebuilder:default:=0
MinReplicas *int32 `json:"minReplicas"`
// MaxReplicas denotes the maximum number of desired Pods for this worker group, and the default value is maxInt32.
// +kubebuilder:default:=2147483647
MaxReplicas *int32 `json:"maxReplicas"`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those have kubebuilder defaults though, should it error if not specified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the kubebuilder defaults will never be used because there is no omitempty in either MinReplicas or MaxReplicas.

MaxReplicas *int32 `json:"maxReplicas,omitempty"`

containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-gpu
lifecycle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this.

lifecycle:
  preStop:
    exec:
      command: [ "/bin/sh","-c","ray stop" ]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@andrewsykim andrewsykim force-pushed the pytorch-lightning-image-classifier branch from 39c73c9 to 7b9759d Compare May 16, 2024 16:50
@andrewsykim
Copy link
Collaborator Author

I prefer not to update files under pytorch-text-classifier/. If we update the YAML/Python files, users will fail to follow the original documentation, which does not focus on checkpointing.

I removed changes in the pytorch-text-classifier sample. I wanted to have an example of it where it uses a shared filesystem, but I guess we can keep those two examples separate

@@ -131,15 +131,16 @@ def train_func(config):
# The checkpoints and metrics are reported by `RayTrainReportCallback`
run_config = RunConfig(
name="ptl-sent-classification",
storage_path="/mnt/cluster_storage",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change? This change may cause users fail to follow this doc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this was leftover from the last change, reverting it

bucketName: GCS_BUCKET
mountOptions: "implicit-dirs,uid=1000,gid=100"
workerGroupSpecs:
- replicas: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the kubebuilder defaults will never be used because there is no omitempty in either MinReplicas or MaxReplicas.

MaxReplicas *int32 `json:"maxReplicas,omitempty"`

@kevin85421
Copy link
Member

The CI failure is not related to this PR. Merge.

@kevin85421 kevin85421 merged commit 732453e into ray-project:master May 16, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants