Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK Info Exposure Rework (status/details, wait, error-handling, etc.) #33

Open
Maxusmusti opened this issue Nov 21, 2022 · 4 comments
Open
Assignees

Comments

@Maxusmusti
Copy link
Collaborator

Merge both cluster.status() and cluster.is_ready() into one one function (likely still called status). This will tell the user exactly where in the process of setup their cluster is currently (whether still in AppWrapper stages or in Ray stages). Then, there will be a second function called cluster.details() that will output the cluster information like all of the specs, worker count, uri, active/inactive, etc. (what we currently see when calling cluster.status() on a fully set-up cluster).

@Maxusmusti
Copy link
Collaborator Author

Also add a wait() function (likely a simple loop checking using the above mentioned status() function)

@Maxusmusti
Copy link
Collaborator Author

Change status() return from bool to info object

@MichaelClifford MichaelClifford moved this from In Progress to Todo in codeflare-sdk sprint board Jan 24, 2023
@Maxusmusti Maxusmusti assigned Maxusmusti and unassigned atinsood Jan 24, 2023
@Maxusmusti Maxusmusti moved this from Todo to In Progress in codeflare-sdk sprint board Feb 3, 2023
@Maxusmusti Maxusmusti moved this from Todo to In Progress in [deprecated] project-codeflare sprint board Feb 3, 2023
@Maxusmusti Maxusmusti changed the title Consolidate status() and is_ready(), and create a new function details() SDK Info Exposure Rework (status/details, wait, error-handling, etc.) Feb 13, 2023
@thinkahead
Copy link

The Ray cluster is missing the status.state ray-project/kuberay#991

oc create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v0.5.0&timeout=90s"

The status from a RayCluster shows:

status:
  availableWorkerReplicas: 2
  desiredWorkerReplicas: 1
  endpoints:
    client: "10001"
    dashboard: "8265"
    gcs: "6379"
  head:
    serviceIP: 172.21.234.58
  lastUpdateTime: "2023-06-05T20:01:31Z"
  maxWorkerReplicas: 1
  minWorkerReplicas: 1

This causes a problem for the codeflare API https://github.com/project-codeflare/codeflare-sdk/blob/main/src/codeflare_sdk/cluster/cluster.py#L428 where it looks for the state causing the cluster to stay as STARTING (<CodeFlareClusterStatus.STARTING: 2>, False)

@thinkahead
Copy link

Please ignore my previous comment, it works with the 0.5.0 (I was using 0.4.0 previously)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants