Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ai/worker: Restart warm containers when they crash #3399

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions ai/worker/container.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,11 @@ type RunnerContainerConfig struct {
ContainerImageID string

// For managed containers only
ID string
GPU string
KeepWarm bool
containerTimeout time.Duration
ID string
GPU string
KeepWarm bool
OptimizationFlags OptimizationFlags
containerTimeout time.Duration
}

// Create global references to functions to allow for mocking in tests.
Expand Down
16 changes: 12 additions & 4 deletions ai/worker/docker.go
Original file line number Diff line number Diff line change
Expand Up @@ -405,10 +405,11 @@
Endpoint: RunnerEndpoint{
URL: "http://localhost:" + containerHostPort,
},
ID: resp.ID,
GPU: gpu,
KeepWarm: keepWarm,
containerTimeout: runnerContainerTimeout,
ID: resp.ID,
GPU: gpu,
KeepWarm: keepWarm,
OptimizationFlags: optimizationFlags,
containerTimeout: runnerContainerTimeout,
}

rc, err := NewRunnerContainer(ctx, cfg, containerName)
Expand Down Expand Up @@ -498,6 +499,13 @@
if failures >= maxHealthCheckFailures && time.Since(startTime) > pipelineStartGracePeriod {
slog.Error("Container health check failed too many times", slog.String("container", rc.Name))
m.destroyContainer(rc, false)
if rc.KeepWarm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not work, when the container is idle. For example, you can test locally:

  1. Start Orchestrator with Warm runner => Orchestrator starts the container ALL GOOD
  2. Wait 60s => Container is idle, returning to pool container ALL GOOD
  3. Stop Container (simulate that it crashes) => It won't be recreated NOT GOOD

I think that maybe we should move this warchContainer (or create similar function), so that the warm container is ALWAYS healthchecked (and recreated) not only when it's not idle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yeah, we only watch containers as they are being used for now. It is unlikely they will crash when idle, but I guess it's not too big of a refactor to change this behavior so I'll include that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if in rush, then I'm ok merging this PR as it is. Otherwise, it would be good to restart also IDLE container. But right, it's unlikely they crash while not doing anything 🙃

slog.Info("Container was kept warm, restarting", slog.String("container", rc.Name))
err := m.Warm(context.Background(), rc.Pipeline, rc.ModelID, rc.OptimizationFlags)
if err != nil {
slog.Error("Error restarting warm container", slog.String("container", rc.Name), slog.String("error", err.Error()))
}

Check warning on line 507 in ai/worker/docker.go

View check run for this annotation

Codecov / codecov/patch

ai/worker/docker.go#L503-L507

Added lines #L503 - L507 were not covered by tests
}
return
}

Expand Down
Loading