Fix scaleUpChron check for queue time using max_queue_time_minutes #6618
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The parameter returned by the hud query
queued_jobs_aggregate
min_queue_time_minutes
means for the returned number of queued jobs, what is the minimum of the jobs queue time. The parametermin_queue_time_minutes
in contrast the the one with the maximum queued job waiting for a particular instance type.Currently we've been filtering for
min_queue_time_minutes
, what doesn't make a lot of sense. It does not add any additional checks/protections and can introduce fatal failures. In case the query have a divergent configuration from the lambda, say 10 minutes over 30 minutes used currently, and new jobs are always coming, the scaleUpChron will never run.So, as this is a fine check (it stills don't do exactly what we wanted it to do, but, better than nothing): I kept the check but validate if at least the longest queued job for an instance type is at least
SCALE_UP_MAX_QUEUE_TIME_MINUTES
(default to 30). This still would make it possible to overprovision in case of hud fails, but it is better than nothing.