ai/worker: Restart warm containers when they crash #3399

victorges · 2025-02-18T22:18:52Z

What does this pull request do? Explain your changes. (required)
This makes sure that we keep the warm containers running even when they crash due to
failed healthcheck.

Specific updates (required)

Save optimization flags on RunnerContainer so we can repeat the same warm call
Restart kept-warm containers when they crash healthcheck

How did you test each of these updates (required)
TODO

Does this pull request close any open issues?
Fixes https://linear.app/livepeer/issue/ENG-2443/startup-time-warm-containers-can-go-cold

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

codecov · 2025-02-18T22:32:10Z

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.

Project coverage is 32.11178%. Comparing base (232df3a) to head (3e86a1d).

Files with missing lines	Patch %	Lines
ai/worker/docker.go	45.45455%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3399         +/-   ##
===================================================
- Coverage   32.11405%   32.11178%   -0.00227%     
===================================================
  Files            147         147                 
  Lines          40789       40795          +6     
===================================================
+ Hits           13099       13100          +1     
- Misses         26916       26920          +4     
- Partials         774         775          +1

Files with missing lines	Coverage Δ
ai/worker/container.go	`43.85965% <ø> (ø)`
ai/worker/docker.go	`73.53630% <45.45455%> (-1.04802%)`	⬇️

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...3e86a1d. Read the comment docs.

Files with missing lines	Coverage Δ
ai/worker/container.go	`43.85965% <ø> (ø)`
ai/worker/docker.go	`73.53630% <45.45455%> (-1.04802%)`	⬇️

... and 1 file with indirect coverage changes

leszko · 2025-02-19T09:24:07Z

fix https://linear.app/livepeer/issue/ENG-2441/startup-time-delay-after-payment-created

leszko · 2025-02-19T13:31:13Z

ai/worker/docker.go

@@ -498,6 +499,13 @@ func (m *DockerManager) watchContainer(rc *RunnerContainer, borrowCtx context.Co
 		if failures >= maxHealthCheckFailures && time.Since(startTime) > pipelineStartGracePeriod {
 			slog.Error("Container health check failed too many times", slog.String("container", rc.Name))
 			m.destroyContainer(rc, false)
+			if rc.KeepWarm {


I think this will not work, when the container is idle. For example, you can test locally:

Start Orchestrator with Warm runner => Orchestrator starts the container ALL GOOD

Wait 60s => Container is idle, returning to pool container ALL GOOD

Stop Container (simulate that it crashes) => It won't be recreated NOT GOOD

I think that maybe we should move this warchContainer (or create similar function), so that the warm container is ALWAYS healthchecked (and recreated) not only when it's not idle.

Good catch, yeah, we only watch containers as they are being used for now. It is unlikely they will crash when idle, but I guess it's not too big of a refactor to change this behavior so I'll include that here.

yeah, if in rush, then I'm ok merging this PR as it is. Otherwise, it would be good to restart also IDLE container. But right, it's unlikely they crash while not doing anything 🙃

linear · 2025-02-19T18:17:53Z

ENG-2443 startup time: warm containers can go cold

ai/worker: Restart warm containers when they crash

3e86a1d

github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 18, 2025

leszko reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai/worker: Restart warm containers when they crash #3399

ai/worker: Restart warm containers when they crash #3399

victorges commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025

leszko commented Feb 19, 2025

leszko Feb 19, 2025

victorges Feb 19, 2025

leszko Feb 20, 2025

linear bot commented Feb 19, 2025

ai/worker: Restart warm containers when they crash #3399

Are you sure you want to change the base?

ai/worker: Restart warm containers when they crash #3399

Conversation

victorges commented Feb 18, 2025 • edited Loading

codecov bot commented Feb 18, 2025

Codecov Report

leszko commented Feb 19, 2025

leszko Feb 19, 2025

Choose a reason for hiding this comment

victorges Feb 19, 2025

Choose a reason for hiding this comment

leszko Feb 20, 2025

Choose a reason for hiding this comment

linear bot commented Feb 19, 2025

victorges commented Feb 18, 2025 •

edited

Loading