Time-domain correlation divergence in Adam

stas00 · stas00 · commit 1c0eaf9030c3 · 2024-10-27T20:49:27.000-07:00
diff --git a/training/instabilities/README.md b/training/instabilities/README.md
@@ -73,3 +73,11 @@ The issue is that `alpha` is multiplied after the matrix-matrix multiplication i
 ## "Bad" combination of data batch and model parameter state
 
 PaLM team observed dozens of loss spikes at "highly irregular intervals" when training larger models. While they were not able to track down the root cause, they mitigated the issue by restarting from an earlier checkpoint and skipping potentially problematic data batches. [Section 5.1 Training instability](https://arxiv.org/pdf/2204.02311.pdf)
+
+
+## Time-domain correlation divergence in Adam
+
+[A Theory on Adam Instability in Large-Scale Machine Learning](https://arxiv.org/abs/2304.09871) performs a rigorous study of divergence spikes while training LLMs at up to 546B parameters - and suggests that the time-domain correlation leads to divergence of Adam. This is triggered by the epsilon value not being small enough and gradient
+estimation components become similar to the epsilon.
+
+In section 7.1 they propose practical suggestions, the most interesting one of them is setting epsilon to 0 and possibly dealing with division by zero condition.