You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In coordinated mode, the ChandyLamportBarrierSource may generate barriers while a failure is present (but undetected) in the system.
The issue is hardly observed with checkpoint intervals of 20-30 seconds plus. but at lower intervals the detection-time for a failure becomes largely equal or higher than the checkpoint interval, meaning its almost guaranteed to generate barriers while a failure is in the system. this causes a partial global checkpoint to be taken (barriers wont move past the failed instance). The recovery line calculations then computes a consistent global checkpoint which may include checkpoints from the previous global checkpoint, resulting in part of the latest global checkpoint and part of the second latest global checkpoint to be restored.
A hotfix has been put in place that preemptively stops the barrier generation timer when a connection from the coordinator to a worker fails. This is not a proper fix but does reduce the emergence of this behavior significantly since the behavior is easily observed in rollback distance metrics, cases where this behavior is triggered can be re-run with high probability that the run will then be successful.
The text was updated successfully, but these errors were encountered:
In coordinated mode, the ChandyLamportBarrierSource may generate barriers while a failure is present (but undetected) in the system.
The issue is hardly observed with checkpoint intervals of 20-30 seconds plus. but at lower intervals the detection-time for a failure becomes largely equal or higher than the checkpoint interval, meaning its almost guaranteed to generate barriers while a failure is in the system. this causes a partial global checkpoint to be taken (barriers wont move past the failed instance). The recovery line calculations then computes a consistent global checkpoint which may include checkpoints from the previous global checkpoint, resulting in part of the latest global checkpoint and part of the second latest global checkpoint to be restored.
A hotfix has been put in place that preemptively stops the barrier generation timer when a connection from the coordinator to a worker fails. This is not a proper fix but does reduce the emergence of this behavior significantly since the behavior is easily observed in rollback distance metrics, cases where this behavior is triggered can be re-run with high probability that the run will then be successful.
The text was updated successfully, but these errors were encountered: