Skip to content

Commit 1c0eaf9

Browse files
committed
Time-domain correlation divergence in Adam
1 parent 6b06a5a commit 1c0eaf9

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

training/instabilities/README.md

+8
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,11 @@ The issue is that `alpha` is multiplied after the matrix-matrix multiplication i
7373
## "Bad" combination of data batch and model parameter state
7474

7575
PaLM team observed dozens of loss spikes at "highly irregular intervals" when training larger models. While they were not able to track down the root cause, they mitigated the issue by restarting from an earlier checkpoint and skipping potentially problematic data batches. [Section 5.1 Training instability](https://arxiv.org/pdf/2204.02311.pdf)
76+
77+
78+
## Time-domain correlation divergence in Adam
79+
80+
[A Theory on Adam Instability in Large-Scale Machine Learning](https://arxiv.org/abs/2304.09871) performs a rigorous study of divergence spikes while training LLMs at up to 546B parameters - and suggests that the time-domain correlation leads to divergence of Adam. This is triggered by the epsilon value not being small enough and gradient
81+
estimation components become similar to the epsilon.
82+
83+
In section 7.1 they propose practical suggestions, the most interesting one of them is setting epsilon to 0 and possibly dealing with division by zero condition.

0 commit comments

Comments
 (0)