You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: training/instabilities/README.md
+8
Original file line number
Diff line number
Diff line change
@@ -73,3 +73,11 @@ The issue is that `alpha` is multiplied after the matrix-matrix multiplication i
73
73
## "Bad" combination of data batch and model parameter state
74
74
75
75
PaLM team observed dozens of loss spikes at "highly irregular intervals" when training larger models. While they were not able to track down the root cause, they mitigated the issue by restarting from an earlier checkpoint and skipping potentially problematic data batches. [Section 5.1 Training instability](https://arxiv.org/pdf/2204.02311.pdf)
76
+
77
+
78
+
## Time-domain correlation divergence in Adam
79
+
80
+
[A Theory on Adam Instability in Large-Scale Machine Learning](https://arxiv.org/abs/2304.09871) performs a rigorous study of divergence spikes while training LLMs at up to 546B parameters - and suggests that the time-domain correlation leads to divergence of Adam. This is triggered by the epsilon value not being small enough and gradient
81
+
estimation components become similar to the epsilon.
82
+
83
+
In section 7.1 they propose practical suggestions, the most interesting one of them is setting epsilon to 0 and possibly dealing with division by zero condition.
0 commit comments