Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: record last state transition times #2053

Merged
merged 1 commit into from
May 1, 2024

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Mar 29, 2024

Why are these changes needed?

Problem Statement

My ML platform team runs the kuberay ray-operator. We want to measure the time
it takes for RayCluster's to transition from their initial "unhealthy" state to
some other state. This metric is important for us because our users want their
RayClusters to start in a timely manner. It seems like neither the ray-operator
nor RayClusters provide this info currently.

Design

Add a new .status.stateTransitionTimes field to the RayCluster custom
resource. This field is a map[ClusterState]*metav1.Time that indicates the
time of the last state transition for each state. This field is updated
whenever the .status.state changes.

manual testing steps:

  1. make manifests generate
  2. kubectl config use-context CONTEXT
  3. make docker-build docker-push deploy IMG=europe-west4-docker.pkg.dev/spotify-workbench/images/operator:$USER-$(git rev-parse --short=7 HEAD)
  4. kubectl --context CONTEXT apply -f /path/to/raycluster.yaml
  5. verify there's a .status.stateTransitionTimes in the output of kubectl --context CONTEXT -n NAMESPACE get rayclusters NAME -o yaml

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review April 2, 2024 20:10
@kevin85421 kevin85421 self-assigned this Apr 2, 2024
@kevin85421
Copy link
Member

@sydneyw-spotify just saw you update this PR recently. Feel free to ping me whenever this PR is ready for review.

@davidxia davidxia marked this pull request as ready for review April 26, 2024 07:16
@davidxia
Copy link
Contributor Author

@kevin85421, thanks. This is ready for review now. Example input and output in the info gist link in PR description.

@davidxia davidxia force-pushed the state-transitions branch from ec886c7 to 4996fca Compare April 29, 2024 14:52
## Problem Statement

My ML platform team runs the kuberay ray-operator. We want to measure the time
it takes for RayCluster's to transition from their initial "unhealthy" state to
some other state. This metric is important for us because our users want their
RayClusters to start in a timely manner. It seems like neither the ray-operator
nor RayClusters provide this info currently.

## Design

Add a new `.status.stateTransitionTimes` field to the `RayCluster` custom
resource. This field is a `map[ClusterState]*metav1.Time` that indicates the
time of the last state transition for each state. This field is updated
whenever the `.status.state` changes.

* [original discussion doc](https://docs.google.com/document/d/14yPSZ9iLk7a0qEg14rNWr60Btz0HEeQ3oWKP-GN9QTM)
* [related Slack thread](https://ray-distributed.slack.com/archives/C01CKH05XBN/p1709321264762029)
* [example input and output RayClusters](https://gist.github.com/davidxia/205d2b23202356a2d3172c51e0912f35)
@davidxia davidxia force-pushed the state-transitions branch from 4996fca to fdfb6c7 Compare April 29, 2024 17:23
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will open a follow up to move the envtest to a better place.

@kevin85421 kevin85421 merged commit 3c44ba0 into ray-project:master May 1, 2024
24 checks passed
@davidxia davidxia deleted the state-transitions branch May 1, 2024 12:36
@davidxia
Copy link
Contributor Author

davidxia commented May 1, 2024

LGTM. I will open a follow up to move the envtest to a better place.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants