Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simplify flytekit.Resources #6245

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

granthamtaylor
Copy link

Defining resources for tasks is error prone and counter-intuitive due to the nuances around requests and limits.

This RFC discusses an alternative solution to simplify the definition of resources, remove some footguns, and smoothen flytekit altogether.

Signed-off-by: Grantham Taylor <[email protected]>
Copy link

welcome bot commented Feb 14, 2025

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@flyte-bot
Copy link
Collaborator

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.87%. Comparing base (b04df59) to head (fa3fb0a).

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6245       +/-   ##
===========================================
- Coverage   50.51%   36.87%   -13.64%     
===========================================
  Files        1162     1318      +156     
  Lines       91811   134703    +42892     
===========================================
+ Hits        46375    49672     +3297     
- Misses      41340    80703    +39363     
- Partials     4096     4328      +232     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 51.96% <ø> (+0.02%) ⬆️
unittests-flytecopilot 30.99% <ø> (?)
unittests-flytectl 62.29% <ø> (ø)
unittests-flyteidl 7.22% <ø> (?)
unittests-flyteplugins 54.03% <ø> (ø)
unittests-flytepropeller 42.77% <ø> (-0.01%) ⬇️
unittests-flytestdlib 55.35% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidmirror-ops davidmirror-ops added the rfc A label for RFC issues label Feb 14, 2025
@cjidboon94
Copy link
Contributor

This RFC would be a big loss for me. I love having control over reqs/lims directly.
The suggestion in the RFC could be a new golden path sure if it's supposed to "smoothen" things,
while keeping the classic requests/limits setting for power users and people think they know what they are doing with a warning in code and documentation that it's for advanced users.

@granthamtaylor
Copy link
Author

This RFC would be a big loss for me. I love having control over reqs/lims directly.

The suggestion in the RFC could be a new golden path sure if it's supposed to "smoothen" things,

while keeping the classic requests/limits setting for power users and people think they know what they are doing with a warning in code and documentation that it's for advanced users.

Hi! Thank you for sharing. Perhaps, going forward, we can allow power users to define the request and limits separately in the form of a tuple ("1Gi", "2Gi"). I think that, ideally, this is preferable over the current pattern of two separate arguments ("requests" and "limits") in that, as a single argument it'll be easy to validate that within Resources. Additionally, I think this will help keep new users from hurting themselves.

Is it just mem and cpu that you want to have complete control over?


Some of the footguns around `requests` and `limits` are non-deterministic and not intuitive, thus can escape detection during development but have significant and detrimental impact in production.

For example, the possibility of setting `requests < limits` for `mem` can lead to a pod becoming overloaded. For `cpu`, doing so can lead to CPU availability, for some multi-processing applications, to be non-deterministic, thus resulting in unpredictable performance degradation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the memory case in particular, when a requests != limits the pod is considered burstable (quality of service designation) and more likely to be evicted when the node is under memory pressure


## 6 Alternatives

Should any resource type (`CPU`) need both `limits` and `requests`, we should allow definition of them via a `tuple` (IE `(400m, 600m)`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having an escape hatch but prioritizing the common use-case

@katrogan
Copy link
Contributor

@cjidboon94 that's really helpful feedback, do you mind expanding in which circumstances you prefer to set request != limits

@fg91
Copy link
Member

fg91 commented Feb 27, 2025

I agree that the UX of the task arguments @task(requests=..., limits=..., accelerator=..., shared_memory=) is definitely very Kubernetes-centric:

  • separate requests and limits directly translate to pod requests/limits
  • accelerator not part of requests/limits because it translates to a node selector/affinity and toleration
  • shared memory not part of requests limits because it translates to a volume+mount

I also agree that for ML engineers/Data Scientists who don't know/care about K8s, the UX can feel a bit awkward. Especially the intricacies of requests vs. limits.

We maintain a decorator wrapping the flytekit task decorator which derives the limits from the requests in a way that makes sense for us. Our users never use the limits arg.

At the same time, I feel for platform engineers it absolutely makes sense to have full control over both requests and limits.

To summarize:
From an ML engineer's perspective, I would favor:

  • Aggregation of cpu, memory, gpu, shared memory, ephemeral storage, ... into a single struct
  • Deriving sane defaults for the limits derived from the specified requests so that I as an ML engineer don't have to think about K8s intricacies.

From a platform engineer's perspective:

  • I would like to be able to override the requests/limits derived by default from the specified resources.

All of this being said, this is a very breaking change which needs to be considered carefullly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc A label for RFC issues
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

6 participants