feat: Graceful Shutdown with State Persistence#258
feat: Graceful Shutdown with State Persistence#258bupd merged 11 commits intocontainer-registry:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a CLI/env Changes
Sequence Diagram(s)sequenceDiagram
participant OS as OS
participant Main as cmd/main.go
participant Satellite as Satellite
participant Scheduler as Scheduler(s)
participant Replicator as Replicator
OS->>Main: termination signal
Main->>Main: parse --shutdown-timeout / create timeout ctx
Main->>Satellite: Satellite.Stop(ctx)
Satellite->>Satellite: log scheduler_count
Satellite->>Scheduler: Stop(ctx) for each scheduler
Scheduler->>Scheduler: stop scheduling new tasks
Scheduler->>Scheduler: wait for wg (running goroutines) (rgba(0,128,0,0.5))
Replicator->>Replicator: per-item loops check ctx.Done()
alt within timeout
Scheduler->>Satellite: stopped
Satellite->>Main: return success
else timeout
Scheduler->>Satellite: returns ctx error
Satellite->>Main: return error -> exit non-zero
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codacy's Analysis Summary2 new issues (≤ 0 issue) Review Pull Request in Codacy →
|
c1dfbb4 to
0c340f8
Compare
|
Hey @bupd , how we can stop this spam pr reviewer bot 😭 , Codacy . I think this bot is not by harbor |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cmd/main.go`:
- Around line 191-211: The shutdown path currently waits only on wg.Wait()
(errgroup) but not on scheduler goroutines started in satellite.Run; make
Scheduler provide explicit synchronization and block Stop until all its internal
goroutines and spawned process goroutines complete. Add a WaitGroup (or a done
channel) to the Scheduler struct, increment it in Scheduler.Run and around
launchProcess() invocations, and decrement when those goroutines finish; change
Scheduler.Stop to signal cancellation and then wait on that internal WaitGroup
(or close/wait on the done channel) before returning. Finally, update main
shutdown logic to call Scheduler.Stop() (or Scheduler.Wait()) and only consider
shutdownDone after the scheduler's wait has returned so that wg.Wait,
Scheduler.Stop/Wait, and shutdownCtx are all coordinated.
🧹 Nitpick comments (1)
cmd/main.go (1)
99-215: Consider extracting graceful shutdown into a helper to satisfy funlen.
runnow exceeds the configured function-length threshold. A small helper for shutdown logic should keep the linter happy and simplifyrun.♻️ Proposed refactor (extract graceful shutdown)
func run(jsonLogging bool, token, groundControlURL string, shutdownTimeout string) error { @@ - // Wait until context is cancelled - <-ctx.Done() - - // Graceful shutdown with timeout - shutdownDuration, err := time.ParseDuration(shutdownTimeout) - if err != nil { - log.Warn().Err(err).Str("shutdownTimeout", shutdownTimeout). - Msg("Invalid shutdown timeout, defaulting to 30s") - shutdownDuration = 30 * time.Second - } - - log.Info().Dur("timeout", shutdownDuration). - Msg("Received shutdown signal, initiating graceful shutdown") - - // Create a shutdown context with timeout - shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownDuration) - defer shutdownCancel() - - // Stop schedulers to prevent new tasks from being accepted - log.Info().Msg("Stopping schedulers to prevent new replication tasks") - s.Stop(ctx) - - // Wait for in-progress tasks with timeout - log.Info().Msg("Waiting for in-progress replication tasks to complete") - shutdownDone := make(chan struct{}) - go func() { - err := wg.Wait() - if err != nil { - log.Error().Err(err).Msg("Error waiting for goroutines during shutdown") - } - close(shutdownDone) - }() - - select { - case <-shutdownDone: - log.Info().Msg("Graceful shutdown completed successfully") - case <-shutdownCtx.Done(): - log.Warn().Msg("Shutdown timeout exceeded, forcing exit") - return fmt.Errorf("graceful shutdown timeout exceeded") - } - - return nil + return gracefulShutdown(ctx, log, s, wg, shutdownTimeout) } + +func gracefulShutdown(ctx context.Context, log *zerolog.Logger, s *satellite.Satellite, wg *errgroup.Group, shutdownTimeout string) error { + // Wait until context is cancelled + <-ctx.Done() + + shutdownDuration, err := time.ParseDuration(shutdownTimeout) + if err != nil { + log.Warn().Err(err).Str("shutdownTimeout", shutdownTimeout). + Msg("Invalid shutdown timeout, defaulting to 30s") + shutdownDuration = 30 * time.Second + } + + log.Info().Dur("timeout", shutdownDuration). + Msg("Received shutdown signal, initiating graceful shutdown") + + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), shutdownDuration) + defer shutdownCancel() + + log.Info().Msg("Stopping schedulers to prevent new replication tasks") + s.Stop(ctx) + + log.Info().Msg("Waiting for in-progress replication tasks to complete") + shutdownDone := make(chan struct{}) + go func() { + if err := wg.Wait(); err != nil { + log.Error().Err(err).Msg("Error waiting for goroutines during shutdown") + } + close(shutdownDone) + }() + + select { + case <-shutdownDone: + log.Info().Msg("Graceful shutdown completed successfully") + case <-shutdownCtx.Done(): + log.Warn().Msg("Shutdown timeout exceeded, forcing exit") + return fmt.Errorf("graceful shutdown timeout exceeded") + } + + return nil +}As per coding guidelines:
**/*.go: Use strict golangci-lint with limits on function length (100 lines, 50 statements).
6c28cd4 to
840a5cf
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cmd/main.go`:
- Around line 194-196: The code calls s.Stop(ctx) after waiting on <-ctx.Done(),
so ctx is already cancelled; change the call to use the shutdown context created
for teardown (shutdownCtx) instead: locate the scheduler stop call s.Stop(ctx)
and replace it with s.Stop(shutdownCtx) so Stop() receives the live shutdownCtx
(not the cancelled ctx) for deadline/timeout-aware shutdown operations.
bae21a3 to
028f788
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cmd/main.go`:
- Around line 194-215: s.Stop(shutdownCtx) can block forever and is not guarded
by the shutdown timeout; run Stop in its own goroutine (or change Scheduler.Stop
to accept/respect a context) and include its completion in the same select that
races wg.Wait() against shutdownCtx. Concretely: start a goroutine that calls
s.Stop(shutdownCtx) and signals a new channel (e.g., stopDone), keep the
existing goroutine that waits on wg.Wait() signaling shutdownDone, then select
over both stopDone and shutdownDone vs shutdownCtx to determine graceful vs
timed-out shutdown and log/return appropriately; alternatively modify
Scheduler.Stop(ctx context.Context) to return promptly when ctx is done and then
the existing logic can keep calling s.Stop(shutdownCtx).
🧹 Nitpick comments (1)
internal/scheduler/scheduler.go (1)
127-131:Stop()blocks indefinitely without timeout or cancellation.The method waits for all goroutines via
wg.Wait()but has no timeout mechanism or context awareness. IfRun()hasn't exited (e.g., context not cancelled),Stop()will block forever. Consider accepting a context parameter to allow timeout-based waiting, or document that the caller must cancel the scheduler's context before callingStop().♻️ Proposed refactor to add context-aware waiting
-// Stop signals the scheduler to stop and waits for all goroutines to complete -func (s *Scheduler) Stop() { - // Wait for the scheduler's main Run goroutine to complete - s.wg.Wait() +// Stop signals the scheduler to stop and waits for all goroutines to complete. +// The caller must cancel the context passed to Run() before calling Stop(), +// otherwise this will block indefinitely. +func (s *Scheduler) Stop(ctx context.Context) error { + done := make(chan struct{}) + go func() { + s.wg.Wait() + close(done) + }() + + select { + case <-done: + return nil + case <-ctx.Done(): + return ctx.Err() + } }
6dd63fe to
95126a6
Compare
|
Hey @bupd I applied all suggestion of coderabbit mostly . Now can you take a look , if you have a moment :) |
979b0a3 to
84b46d0
Compare
84b46d0 to
3a38c4e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/satellite/satellite.go`:
- Around line 110-115: The current logging is misleading: change the warn log
inside the loop that calls sched.Stop(ctx) to a generic failure message (e.g.,
"Scheduler stop failed") and include the actual error and scheduler name using
sched.Name() and err; do not claim it "timed out" unless you check the error
type. Also track whether any schedulers returned an error (e.g., a bool anyErr
or an error slice) while iterating over sched.Stop(ctx), and replace the
unconditional final log.Info().Msg("All schedulers stopped") with conditional
logging—if anyErr log a warning/info like "Some schedulers failed to stop"
(include count or details), otherwise log "All schedulers stopped". Ensure you
reference sched.Stop, sched.Name, ctx, and err when making these changes.
Resolved :) |
3f0cbb6 to
7a02bb7
Compare
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: intojhanurag <aojharaj2004@gmail.com>
Signed-off-by: bupd <bupdprasanth@gmail.com>
Signed-off-by: bupd <bupdprasanth@gmail.com>
Signed-off-by: bupd <bupdprasanth@gmail.com>
7a02bb7 to
033ca05
Compare
Description
I fixed the issue as described in the issue description.
Closed:
#229
Summary by CodeRabbit
New Features
Improvements