Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ai/worker: Restart warm containers when they crash #3399

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

victorges
Copy link
Member

@victorges victorges commented Feb 18, 2025

What does this pull request do? Explain your changes. (required)
This makes sure that we keep the warm containers running even when they crash due to
failed healthcheck.

Specific updates (required)

  • Save optimization flags on RunnerContainer so we can repeat the same warm call
  • Restart kept-warm containers when they crash healthcheck

How did you test each of these updates (required)
TODO

Does this pull request close any open issues?
Fixes https://linear.app/livepeer/issue/ENG-2443/startup-time-warm-containers-can-go-cold

Checklist:

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.

Project coverage is 32.11178%. Comparing base (232df3a) to head (3e86a1d).

Files with missing lines Patch % Lines
ai/worker/docker.go 45.45455% 5 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3399         +/-   ##
===================================================
- Coverage   32.11405%   32.11178%   -0.00227%     
===================================================
  Files            147         147                 
  Lines          40789       40795          +6     
===================================================
+ Hits           13099       13100          +1     
- Misses         26916       26920          +4     
- Partials         774         775          +1     
Files with missing lines Coverage Δ
ai/worker/container.go 43.85965% <ø> (ø)
ai/worker/docker.go 73.53630% <45.45455%> (-1.04802%) ⬇️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...3e86a1d. Read the comment docs.

Files with missing lines Coverage Δ
ai/worker/container.go 43.85965% <ø> (ø)
ai/worker/docker.go 73.53630% <45.45455%> (-1.04802%) ⬇️

... and 1 file with indirect coverage changes

@leszko
Copy link
Contributor

leszko commented Feb 19, 2025

@@ -498,6 +499,13 @@ func (m *DockerManager) watchContainer(rc *RunnerContainer, borrowCtx context.Co
if failures >= maxHealthCheckFailures && time.Since(startTime) > pipelineStartGracePeriod {
slog.Error("Container health check failed too many times", slog.String("container", rc.Name))
m.destroyContainer(rc, false)
if rc.KeepWarm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not work, when the container is idle. For example, you can test locally:

  1. Start Orchestrator with Warm runner => Orchestrator starts the container ALL GOOD
  2. Wait 60s => Container is idle, returning to pool container ALL GOOD
  3. Stop Container (simulate that it crashes) => It won't be recreated NOT GOOD

I think that maybe we should move this warchContainer (or create similar function), so that the warm container is ALWAYS healthchecked (and recreated) not only when it's not idle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yeah, we only watch containers as they are being used for now. It is unlikely they will crash when idle, but I guess it's not too big of a refactor to change this behavior so I'll include that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if in rush, then I'm ok merging this PR as it is. Otherwise, it would be good to restart also IDLE container. But right, it's unlikely they crash while not doing anything 🙃

Copy link

linear bot commented Feb 19, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch. go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants