-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat][kubectl-plugin] Implement kubectl ray job submit with Deletion Policy API for RayJob Cleanup #3064
base: master
Are you sure you want to change the base?
Conversation
@@ -161,6 +162,7 @@ func NewJobSubmitCommand(streams genericclioptions.IOStreams) *cobra.Command { | |||
cmd.Flags().StringVar(&options.workerMemory, "worker-memory", "4Gi", "amount of memory in each worker group replica") | |||
cmd.Flags().StringVar(&options.workerGPU, "worker-gpu", "0", "number of GPUs in each worker group replica") | |||
cmd.Flags().BoolVar(&options.dryRun, "dry-run", false, "print the generated YAML instead of creating the cluster. Only works when filename is not provided") | |||
cmd.Flags().BoolVar(&options.cleanupJob, "cleanup", false, "Delete the Ray job after job completion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can have this be a --deletion-policy
that maps to this: https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L112
For compatibility maybe we need a flag for both shutdownAfterJobFinishes
and deletionPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rewrited the code and introduced the shutdown-after-job-finishes
, deletion-policy
, and ttl-seconds-after-finished
flags for cleanup.
3c14973
to
c5f33b3
Compare
7aeff99
to
70ddaa8
Compare
ec4a3ca
to
a673e44
Compare
73c9099
to
748812f
Compare
dashboardAddr = "http://localhost:8265" | ||
clusterTimeout = 120.0 | ||
portforwardtimeout = 60.0 | ||
rayjobDeletionTimeout = 30.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judge within 30 seconds whether the event RayJobDeletionPolicy feature gate must be enabled to use the DeletionPolicy feature
is received.
…olicy flag Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
Signed-off-by: win5923 <[email protected]>
d94a1ec
to
188d52b
Compare
func (c *k8sClient) WaitRayJobDeletionPolicyEnabled(ctx context.Context, namespace, name string, startTime time.Time, timeout time.Duration) error { | ||
timeoutCtx, cancel := context.WithTimeout(ctx, timeout*time.Second) | ||
defer cancel() | ||
|
||
watcher, err := c.KubernetesClient().CoreV1().Events(namespace).Watch(ctx, metav1.ListOptions{ | ||
FieldSelector: fmt.Sprintf("involvedObject.name=%s", name), | ||
}) | ||
if err != nil { | ||
return fmt.Errorf("failed to watch events for RayJob %s in namespace %s: %w", name, namespace, err) | ||
} | ||
defer watcher.Stop() | ||
|
||
for { | ||
select { | ||
case <-timeoutCtx.Done(): | ||
// If the RayJobDeletionPolicy feature event is not received within the timeout period, it is considered enabled. | ||
return nil | ||
case event := <-watcher.ResultChan(): | ||
if event.Type == watch.Error { | ||
return fmt.Errorf("error watching events: %v", event.Object) | ||
} | ||
|
||
e, ok := event.Object.(*corev1.Event) | ||
if !ok { | ||
continue | ||
} | ||
|
||
if strings.Contains(e.Message, "RayJobDeletionPolicy feature gate must be enabled to use the DeletionPolicy feature") { | ||
if e.FirstTimestamp.Time.After(startTime) || e.LastTimestamp.Time.After(startTime) { | ||
return fmt.Errorf("%s", e.Message) | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will block for 30 seconds until the RayJobDeletionPolicy feature event is found. If the event is not found within 30 seconds, it assume that the user has enabled RayJobDeletionPolicy by default.
There might be a better way, but I haven't thought of one yet.
Why are these changes needed?
Adds the
shutdown-after-job-finishes
,deletion-policy
andttl-seconds-after-finished
flag to thekubectl ray job submit
to allow users to manage RayJob CRs cleanup more easily by configuring deletion policies.This behavior relies on ray-operator enable
RayJobDeletionPolicy
feature gate.If the user has not enabled the RayJobDeletionPolicy
ttl-seconds-after-finished
but not setshutdown-after-job-finishes
ttl-seconds-after-finished
andshutdown-after-job-finishes
deletion-policy
to DeleteSelfdeletion-policy
to DeleteClusterdeletion-policy
to DeleteWorkersdeletion-policy
to DeleteNoneRelated issue number
Checks