Skip to content

Conversation

@dekaihu
Copy link
Contributor

@dekaihu dekaihu commented Oct 9, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:
This pull request introduces a resource interpreter customization for Kubeflow Notebooks. The changes include the customization definition with Lua scripts for health interpretation, status aggregation, and status reflection, along with corresponding tests and test data.

Which issue(s) this PR fixes:
Fixes #
Part of #6589

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Introduced a built-in interpreter for Kubeflow Notebooks.

@gemini-code-assist
Copy link

Summary of Changes

Hello @dekaihu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates support for Kubeflow Notebook resources by adding a new resource interpreter. This interpreter provides essential logic for health checking, status aggregation, and status reflection, which are crucial for managing and observing Notebook instances within a Kubernetes environment. The addition ensures that Notebooks can be properly monitored and their states accurately represented.

Highlights

  • Kubeflow Notebook Resource Interpretation: Introduced custom resource interpretation for kubeflow.org/v1/Notebook resources, enabling the system to understand and manage their lifecycle more effectively.
  • Lua Scripted Logic: Implemented Lua scripts for healthInterpretation, statusAggregation, and statusReflection to define how Notebooks' health, aggregated status, and reflected status are determined.
  • Comprehensive Test Coverage: Added dedicated test configurations and sample YAML data (desired-notebook.yaml, observed-notebook.yaml, status-file.yaml) to validate the correctness of the new Notebook interpreter logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 9, 2025
@dekaihu dekaihu force-pushed the interpreter-notebook branch from 5ed377c to 28ccf7a Compare October 9, 2025 07:23
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a resource interpreter customization for Kubeflow Notebooks. The implementation has several issues in the Lua scripts for health interpretation, status aggregation, and status reflection that could lead to incorrect behavior. I've provided suggestions to fix these issues and improve the robustness of the scripts. Additionally, there's a minor inconsistency in the test configuration that should be addressed.

@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.22%. Comparing base (4594ca0) to head (dcc2735).
⚠️ Report is 92 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6814      +/-   ##
==========================================
+ Coverage   45.73%   46.22%   +0.49%     
==========================================
  Files         689      692       +3     
  Lines       57104    47194    -9910     
==========================================
- Hits        26114    21817    -4297     
+ Misses      29358    23723    -5635     
- Partials     1632     1654      +22     
Flag Coverage Δ
unittests 46.22% <ø> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XiShanYongYe-Chang
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a resource interpreter customization for Kubeflow Notebooks. The changes include the customization definition with Lua scripts for health interpretation, status aggregation, and status reflection, along with corresponding tests and test data.

My review of the Lua scripts has identified a couple of issues that could affect the correctness and robustness of the interpreter:

  • In the statusAggregation script, the containerState is being overwritten in each loop iteration, which will result in loss of information from all but the last member cluster.
  • The statusReflection script performs unsafe access to status fields, which could lead to an inconsistent structure in the returned status if some fields are missing in the observed object.

I have provided specific suggestions to address these points. After these fixes, the PR should be in good shape.

@dekaihu dekaihu force-pushed the interpreter-notebook branch 2 times, most recently from 18f0f45 to 14f50ea Compare October 9, 2025 11:57
@XiShanYongYe-Chang
Copy link
Member

For DependencyInterpretation, whether it involves scenarios of dependency distribution?

@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 14, 2025

For DependencyInterpretation, whether it involves scenarios of dependency distribution?

Notebook is a single pod template task that only relies on the pod volume mount, and the mount data generally exists in the member cluster. There should be no scenario where it relies on resource propagation.

@XiShanYongYe-Chang
Copy link
Member

Notebook is a single pod template task that only relies on the pod volume mount, and the mount data generally exists in the member cluster. There should be no scenario where it relies on resource propagation.

Okay, if new dependent resources are identified later, we can continue iterating.

@dekaihu dekaihu force-pushed the interpreter-notebook branch from e711f27 to 23f9b07 Compare October 22, 2025 08:55
@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 22, 2025

karmada control surface results:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
        resources:
          requests:
            cpu: 100m
status:
  conditions:
  - lastProbeTime: "2025-10-22T08:54:17Z"
    lastTransitionTime: "2025-10-22T08:54:17Z"
    message: All notebooks are ready
    reason: Ready
    status: "True"
    type: Ready
  containerState:
    running:
      startedAt: "2025-10-22T08:54:14Z"
  readyReplicas: 2

member1:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
        resources:
          requests:
            cpu: 100m
status:
  conditions:
  - lastProbeTime: "2025-10-22T08:54:14Z"
    lastTransitionTime: "2025-10-22T08:54:10Z"
    status: "True"
    type: Initialized
  - lastProbeTime: "2025-10-22T08:54:14Z"
    lastTransitionTime: "2025-10-22T08:54:14Z"
    status: "True"
    type: Ready
  - lastProbeTime: "2025-10-22T08:54:14Z"
    lastTransitionTime: "2025-10-22T08:54:14Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: "2025-10-22T08:54:14Z"
    lastTransitionTime: "2025-10-22T08:54:10Z"
    status: "True"
    type: PodScheduled
  containerState:
    running:
      startedAt: "2025-10-22T08:54:14Z"
  readyReplicas: 1

member2:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
        resources:
          requests:
            cpu: 100m
status:
  conditions:
  - lastProbeTime: "2025-10-22T08:53:31Z"
    lastTransitionTime: "2025-10-22T08:53:28Z"
    status: "True"
    type: Initialized
  - lastProbeTime: "2025-10-22T08:53:31Z"
    lastTransitionTime: "2025-10-22T08:53:31Z"
    status: "True"
    type: Ready
  - lastProbeTime: "2025-10-22T08:53:31Z"
    lastTransitionTime: "2025-10-22T08:53:31Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: "2025-10-22T08:53:31Z"
    lastTransitionTime: "2025-10-22T08:53:28Z"
    status: "True"
    type: PodScheduled
  containerState:
    running:
      startedAt: "2025-10-22T08:53:31Z"
  readyReplicas: 1

@XiShanYongYe-Chang
Copy link
Member

Hi @dekaihu could you help explain this process a bit?

@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 22, 2025

Hi @dekaihu could you help explain this process a bit?

OK, Notebook is a single-pod job that provides services such as Jupyter. If scheduled on a single cluster, the Karmada control plane status aggregation defaults to all fields being aggregated, directly returning the member cluster's status. For multi-cluster distribution, the status aggregation is not considered Ready until all member clusters reach Ready status, starting with the latest startup time. If services in some member clusters are not Ready, the aggregated status should be UnReady. Similarly, when aggregating data in a multi-member cluster, we define a comprehensive set of conditions for the conditions field, which is displayed as the result to the user.


if state.waiting ~= nil then
local reason = state.waiting.reason or ""
if reason == "ContainerCreating" then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values for "reason"? Is there any possibility that some values are missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Reason values ​​for the Notebook in the Waiting state include ContainerCreating, ErrImagePull, ImagePullBackOff, CreateContainerError, and CrashLoopBackOff. Here, I only regard ContainerCreating as the state where the Notebook is still being created normally.

Comment on lines 95 to 97
if aggregatedContainerState == nil or aggregatedContainerState.running ~= nil then
aggregatedContainerState = st.containerState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this situation lead to having both running and waiting/terminated states at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, I get it.

@XiShanYongYe-Chang
Copy link
Member

Thanks~
/lgtm

Can you help add the release note?

/cc @RainbowMango

@karmada-bot karmada-bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Oct 23, 2025
@dekaihu dekaihu force-pushed the interpreter-notebook branch from 668d75c to 4722e9c Compare October 23, 2025 06:52
@XiShanYongYe-Chang
Copy link
Member

Sorry, I didn't notice what was changed.

@dekaihu dekaihu force-pushed the interpreter-notebook branch 2 times, most recently from ce57d0a to 8069331 Compare October 23, 2025 07:55
@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 23, 2025

Sorry, I didn't notice what was changed.

Sorry, I just fixed the aggregation of empty containerState. Please refer to the following aggregation example

@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 23, 2025

Sorry, I didn't notice what was changed.

Sorry, I just fixed the aggregation of empty containerState. Please refer to the following aggregation example

karmada aggregation results:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
status:
  conditions:
  - lastProbeTime: "2025-10-23T07:52:33Z"
    lastTransitionTime: "2025-10-23T07:52:33Z"
    message: 'Notebook executed failed in member clusters: member1'
    reason: NotebookFailed
    status: "True"
    type: Failed
  containerState:
    waiting:
      reason: Unschedulable
  readyReplicas: 1

member1:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
status:
  conditions:
  - lastProbeTime: "2025-10-23T07:52:30Z"
    lastTransitionTime: "2025-10-23T07:52:30Z"
    message: '0/6 nodes are available: 1 node(s) had untolerated taint {virtual-kubelet.io/provider:
      leinao}, 5 node(s) were unschedulable. preemption: 0/6 nodes are available:
      6 Preemption is not helpful for scheduling.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  containerState: {}
  readyReplicas: 0

member2:

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: notebook-sample-v1
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: ghcr.io/kubeflow/kubeflow/notebook-servers/jupyter:latest
        name: notebook-sample-v1
status:
  conditions:
  - lastProbeTime: "2025-10-23T07:52:33Z"
    lastTransitionTime: "2025-10-23T07:52:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: "2025-10-23T07:52:33Z"
    lastTransitionTime: "2025-10-23T07:52:33Z"
    status: "True"
    type: Ready
  - lastProbeTime: "2025-10-23T07:52:33Z"
    lastTransitionTime: "2025-10-23T07:52:33Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: "2025-10-23T07:52:33Z"
    lastTransitionTime: "2025-10-23T07:52:30Z"
    status: "True"
    type: PodScheduled
  containerState:
    running:
      startedAt: "2025-10-23T07:52:32Z"
  readyReplicas: 1

@dekaihu dekaihu force-pushed the interpreter-notebook branch from 8ed8297 to ebd2b2b Compare October 24, 2025 08:02
@dekaihu dekaihu force-pushed the interpreter-notebook branch from d13cd61 to dcc2735 Compare October 24, 2025 08:35
@XiShanYongYe-Chang
Copy link
Member

Hi @dekaihu, If you're ready, you can cc me.

@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 24, 2025

@XiShanYongYe-Chang It's ready. The CI process e2e test occasionally fails. Is there a way to re-trigger the e2e test without submitting an empty commit?

@XiShanYongYe-Chang
Copy link
Member

It's ready. The CI process e2e test occasionally fails. Is there a way to re-trigger the e2e test without submitting an empty commit?

You need to push again; it's enough if the commit ID has changed. For example, you can use git commit --amend to achieve this.

@dekaihu
Copy link
Contributor Author

dekaihu commented Oct 24, 2025

It's ready. The CI process e2e test occasionally fails. Is there a way to re-trigger the e2e test without submitting an empty commit?

You need to push again; it's enough if the commit ID has changed. For example, you can use git commit --amend to achieve this.

OK, thanks

@XiShanYongYe-Chang
Copy link
Member

Thanks~
/lgtm
Can you help fill in the release note?

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2025
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@RainbowMango RainbowMango added this to the v1.16 milestone Nov 3, 2025
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2025
@karmada-bot karmada-bot merged commit 634cc33 into karmada-io:master Nov 3, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants