-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ai/worker: Restart warm containers when they crash #3399
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3399 +/- ##
===================================================
- Coverage 32.11405% 32.11178% -0.00227%
===================================================
Files 147 147
Lines 40789 40795 +6
===================================================
+ Hits 13099 13100 +1
- Misses 26916 26920 +4
- Partials 774 775 +1
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
@@ -498,6 +499,13 @@ func (m *DockerManager) watchContainer(rc *RunnerContainer, borrowCtx context.Co | |||
if failures >= maxHealthCheckFailures && time.Since(startTime) > pipelineStartGracePeriod { | |||
slog.Error("Container health check failed too many times", slog.String("container", rc.Name)) | |||
m.destroyContainer(rc, false) | |||
if rc.KeepWarm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not work, when the container is idle. For example, you can test locally:
- Start Orchestrator with Warm runner => Orchestrator starts the container ALL GOOD
- Wait 60s =>
Container is idle, returning to pool container
ALL GOOD - Stop Container (simulate that it crashes) => It won't be recreated NOT GOOD
I think that maybe we should move this warchContainer
(or create similar function), so that the warm container is ALWAYS healthchecked (and recreated) not only when it's not idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yeah, we only watch containers as they are being used for now. It is unlikely they will crash when idle, but I guess it's not too big of a refactor to change this behavior so I'll include that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, if in rush, then I'm ok merging this PR as it is. Otherwise, it would be good to restart also IDLE container. But right, it's unlikely they crash while not doing anything 🙃
What does this pull request do? Explain your changes. (required)
This makes sure that we keep the warm containers running even when they crash due to
failed healthcheck.
Specific updates (required)
RunnerContainer
so we can repeat the same warm callHow did you test each of these updates (required)
TODO
Does this pull request close any open issues?
Fixes https://linear.app/livepeer/issue/ENG-2443/startup-time-warm-containers-can-go-cold
Checklist:
make
runs successfully./test.sh
pass