Skip to content

Apply PodSetInfo to PipelineRun taskRunTemplate.podTemplate #84

@gbenhaim

Description

@gbenhaim

What would you like to be added:

Enhance the RunWithPodSetsInfo method in the PipelineRun controller to apply labelSelector and tolerations from the PodSetInfo to the pipelinerun.spec.taskRunTemplate.podTemplate field.

Currently, the RunWithPodSetsInfo method in internal/controller/pipelinerun_controller.go only sets the spec status to empty and returns:

func (p *PipelineRun) RunWithPodSetsInfo(podSetsInfo []podset.PodSetInfo) error {
	p.Spec.Status = ""
	return nil
}

The enhancement should:

  1. Extract node scheduling information from PodSetInfo: Parse the podSetsInfo parameter to extract:

    • Node selectors from the resource flavor
    • Tolerations for tainted nodes
    • Any additional pod template specifications
  2. Apply to PipelineRun taskRunTemplate: Update the PipelineRun's spec.taskRunTemplate.podTemplate field with:

    • nodeSelector from the podset.PodSetInfo
    • tolerations from the podset.PodSetInfo
  3. Handle multiple PodSetInfo entries: When multiple PodSetInfo entries exist, apply the appropriate scheduling constraints to ensure all TaskRuns in the PipelineRun are scheduled according to the resource flavor requirements.

Why is this needed:

Currently, when Kueue admits a PipelineRun workload and assigns it to a specific resource flavor, the node scheduling information (labelSelector and tolerations) from the resource flavor is not propagated to the actual TaskRun pods. This creates a disconnect between Kueue's resource management and Tekton's pod scheduling.

This enhancement is critical for several reasons:

  1. Resource Flavor Enforcement: When administrators configure ClusterQueues with specific resource flavors (e.g., GPU nodes, high-memory nodes), the PipelineRun's TaskRuns should actually run on those designated nodes. Without this, workloads might be scheduled on inappropriate nodes despite Kueue's resource allocation.

  2. Node Affinity and Tolerations: Resource flavors often include node selectors and tolerations to target specific node pools (e.g., node-type=gpu, workload-type=build). These constraints must be applied to TaskRun pods to ensure proper scheduling.

  3. Multi-tenant Isolation: In multi-tenant environments, resource flavors provide isolation by directing workloads to specific node pools. This isolation is only effective if TaskRun pods respect these constraints.

  4. Compliance with Kueue Design: The RunWithPodSetsInfo method exists specifically to allow job controllers to apply Kueue-determined scheduling constraints. The current no-op implementation defeats this purpose.

Example Impact:

# ClusterQueue with GPU resource flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-flavor
spec:
  nodeLabels:
    node-type: gpu
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule

Without this enhancement, a PipelineRun admitted with the gpu-flavor would not have its TaskRuns scheduled on GPU nodes, leading to resource misallocation and potential workload failures.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions