Recovery line may recover parts of different global checkpoints #2

ArcadeMode · 2021-08-16T11:25:08Z

In coordinated mode, the ChandyLamportBarrierSource may generate barriers while a failure is present (but undetected) in the system.
The issue is hardly observed with checkpoint intervals of 20-30 seconds plus. but at lower intervals the detection-time for a failure becomes largely equal or higher than the checkpoint interval, meaning its almost guaranteed to generate barriers while a failure is in the system. this causes a partial global checkpoint to be taken (barriers wont move past the failed instance). The recovery line calculations then computes a consistent global checkpoint which may include checkpoints from the previous global checkpoint, resulting in part of the latest global checkpoint and part of the second latest global checkpoint to be restored.

A hotfix has been put in place that preemptively stops the barrier generation timer when a connection from the coordinator to a worker fails. This is not a proper fix but does reduce the emergence of this behavior significantly since the behavior is easily observed in rollback distance metrics, cases where this behavior is triggered can be re-run with high probability that the run will then be successful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery line may recover parts of different global checkpoints #2

Recovery line may recover parts of different global checkpoints #2

ArcadeMode commented Aug 16, 2021

Recovery line may recover parts of different global checkpoints #2

Recovery line may recover parts of different global checkpoints #2

Comments

ArcadeMode commented Aug 16, 2021