Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

jeanschmidt · 2025-05-09T18:16:21Z

The parameter returned by the hud query queued_jobs_aggregate min_queue_time_minutes means for the returned number of queued jobs, what is the minimum of the jobs queue time. The parameter min_queue_time_minutes in contrast the the one with the maximum queued job waiting for a particular instance type.

Currently we've been filtering for min_queue_time_minutes, what doesn't make a lot of sense. It does not add any additional checks/protections and can introduce fatal failures. In case the query have a divergent configuration from the lambda, say 10 minutes over 30 minutes used currently, and new jobs are always coming, the scaleUpChron will never run.

So, as this is a fine check (it stills don't do exactly what we wanted it to do, but, better than nothing): I kept the check but validate if at least the longest queued job for an instance type is at least SCALE_UP_MAX_QUEUE_TIME_MINUTES (default to 30). This still would make it possible to overprovision in case of hud fails, but it is better than nothing.

vercel · 2025-05-09T18:16:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	May 13, 2025 0:54am

ZainRizvi

We should change the config name as well

…nschmidt/fix_scaleUpChron_queue_check

Fix scaleUpChron check for queue time using max_queue_time_minutes

89356f0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2025

ZainRizvi approved these changes May 9, 2025

View reviewed changes

jeanschmidt added 2 commits May 13, 2025 14:43

Merge branch 'main' of https://github.com/pytorch/test-infra into jea…

22b0a67

…nschmidt/fix_scaleUpChron_queue_check

20250513145352

57ce649

zxiiro approved these changes May 13, 2025

View reviewed changes

jeanschmidt merged commit 367dacc into main May 13, 2025
6 checks passed

jeanschmidt deleted the jeanschmidt/fix_scaleUpChron_queue_check branch May 13, 2025 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

Uh oh!

jeanschmidt commented May 9, 2025 •

edited

Loading

Uh oh!

vercel bot commented May 9, 2025 •

edited

Loading

Uh oh!

ZainRizvi left a comment

Uh oh!

Uh oh!

Uh oh!

Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

Fix scaleUpChron check for queue time using max_queue_time_minutes #6618

Uh oh!

Conversation

jeanschmidt commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeanschmidt commented May 9, 2025 •

edited

Loading

vercel bot commented May 9, 2025 •

edited

Loading