Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream: Ensure that k8s plugins (dask, kfoperator) have resource requests and limits set #6264

Merged

Conversation

katrogan
Copy link
Contributor

@katrogan katrogan commented Feb 21, 2025

Why are the changes needed?

When a user specifies dask scheduler or worker requests, but not limits, k8s pod scheduling fails with

95s         Warning   FailedCreate                        replicaset/a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-944f52ccec-6d86b47fff   Error creating: pods "a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-944f52ccec-k6p94" is forbidden: failed quota: project-quota: must specify limits.cpu for: dask-worker; limits.memory for: dask-worker
85s         Warning   FailedCreate                        replicaset/a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-c20eb58c04-686775db55   (combined from similar events): Error creating: pods "a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-c20eb58c04-2kgxq" is forbidden: failed quota: project-quota: must specify limits.cpu for: dask-worker; limits.memory for: dask-worker
85s         Warning   FailedCreate                        replicaset/a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-944f52ccec-6d86b47fff   (combined from similar events): Error creating: pods "a2bjwpn2f5h8666jxmv4-fofhakuy-0-default-worker-944f52ccec-r667v" is forbidden: failed quota: project-quota: must specify limits.cpu for: dask-worker; limits.memory for: dask-worker

What changes were proposed in this pull request?

This change follows existing container and k8s pod task logic to ensure that when one of requests or limits are not set, the unspecified value comes from the specified one, or otherwise both values come from the platform defaults.

How was this patch tested?

Ran locally on a flyte deployment, verified after changes that worker pods god scheduled

8m11s       Normal    SuccessfulCreate                    replicaset/azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-c44f7b7bd9-79f8584d77   Created pod: azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-c44f7b7bd9-bptqq
8m12s       Normal    ScalingReplicaSet                   deployment/azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-c44f7b7bd9              Scaled up replica set azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-c44f7b7bd9-79f8584d77 to 1
...
8m12s       Normal    SuccessfulCreate                    replicaset/azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-1755145870-6d8c6d5b79   Created pod: azpkkgg9cwsr6nbmq2k5-fofhakuy-0-default-worker-1755145870-tltd9

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR enhances Kubernetes resource management by implementing proper resource requests and limits handling through a new ApplyK8sResourceOverrides utility. The changes focus on preventing pod scheduling failures in quota-restricted environments by ensuring both CPU and memory limits are properly specified. The implementation includes safeguards for platform-defined defaults when values are unspecified.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

Signed-off-by: Katrina Rogan <[email protected]>
@katrogan katrogan added fixed For any bug fixes changed For changes in existing functionality labels Feb 21, 2025
@flyte-bot
Copy link
Collaborator

flyte-bot commented Feb 21, 2025

Code Review Agent Run #3b164d

Actionable Suggestions - 5
  • flyteplugins/go/tasks/plugins/k8s/dask/dask.go - 3
    • Consider validating TaskExecutionMetadata before use · Line 88-88
    • Consider validating task execution metadata · Line 94-94
    • Consider adding teMetadata parameter validation · Line 116-116
  • flyteplugins/go/tasks/plugins/k8s/dask/dask_test.go - 2
    • Consider memory limit increase implications · Line 331-331
    • Consider separating proto clone and assertion · Line 466-466
Review Details
  • Files reviewed - 4 · Commit Range: acfbfb9..acfbfb9
    • flyteplugins/go/tasks/pluginmachinery/flytek8s/utils.go
    • flyteplugins/go/tasks/plugins/k8s/dask/dask.go
    • flyteplugins/go/tasks/plugins/k8s/dask/dask_test.go
    • flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.

Project coverage is 36.85%. Comparing base (e9e227b) to head (acfbfb9).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...plugins/go/tasks/pluginmachinery/flytek8s/utils.go 0.00% 6 Missing ⚠️
.../plugins/k8s/kfoperators/common/common_operator.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6264      +/-   ##
==========================================
- Coverage   36.86%   36.85%   -0.01%     
==========================================
  Files        1318     1318              
  Lines      134773   134782       +9     
==========================================
- Hits        49683    49679       -4     
- Misses      80758    80771      +13     
  Partials     4332     4332              
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 51.85% <ø> (-0.03%) ⬇️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (ø)
unittests-flyteidl 7.22% <ø> (ø)
unittests-flyteplugins 53.98% <46.15%> (-0.02%) ⬇️
unittests-flytepropeller 42.78% <ø> (ø)
unittests-flytestdlib 55.35% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Collaborator

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Enhanced Resource Management for K8s Plugins

utils.go - Added new function ApplyK8sResourceOverrides to ensure both resource requests and limits are set

dask.go - Updated Dask worker and scheduler specs to apply resource overrides

common_operator.go - Added resource override handling for KFOperator plugins

Testing - Enhanced Resource Management Test Coverage

dask_test.go - Added comprehensive tests for resource requirement adjustments and platform limits

@katrogan katrogan changed the title Upstream: Ensure that k8s plugins (dask, kfoperator) have resource requests and limits set ( Upstream: Ensure that k8s plugins (dask, kfoperator) have resource requests and limits set Feb 21, 2025
@katrogan katrogan merged commit 199731b into master Feb 21, 2025
51 checks passed
@katrogan katrogan deleted the union/upstream-2712fcaef2c6b6fc0c1f410266fbbf8b1d69b7b1 branch February 21, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changed For changes in existing functionality fixed For any bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants