Skip to content

Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 13, 2025

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented May 9, 2025

The parameter returned by the hud query queued_jobs_aggregate min_queue_time_minutes means for the returned number of queued jobs, what is the minimum of the jobs queue time. The parameter min_queue_time_minutes in contrast the the one with the maximum queued job waiting for a particular instance type.

Currently we've been filtering for min_queue_time_minutes, what doesn't make a lot of sense. It does not add any additional checks/protections and can introduce fatal failures. In case the query have a divergent configuration from the lambda, say 10 minutes over 30 minutes used currently, and new jobs are always coming, the scaleUpChron will never run.

So, as this is a fine check (it stills don't do exactly what we wanted it to do, but, better than nothing): I kept the check but validate if at least the longest queued job for an instance type is at least SCALE_UP_MAX_QUEUE_TIME_MINUTES (default to 30). This still would make it possible to overprovision in case of hud fails, but it is better than nothing.

Copy link

vercel bot commented May 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview May 13, 2025 0:54am

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2025
Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the config name as well

@jeanschmidt jeanschmidt merged commit 367dacc into main May 13, 2025
6 checks passed
@jeanschmidt jeanschmidt deleted the jeanschmidt/fix_scaleUpChron_queue_check branch May 13, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants