Skip to content

Commit 331aaaa

Browse files
added adagard
1 parent 190596e commit 331aaaa

File tree

1 file changed

+110
-0
lines changed
  • docs/Deep Learning/Optimizers in Deep Learning

1 file changed

+110
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Add AdaGrad in Deep Learning Optimizers
2+
3+
This section contains an explanation and implementation of the AdaGrad optimization algorithm used in deep learning. AdaGrad is known for its ability to adapt the learning rate based on the frequency of updates for each parameter.
4+
5+
## Table of Contents
6+
- [Introduction](#introduction)
7+
- [Mathematical Explanation](#mathematical-explanation)
8+
- [AdaGrad in Gradient Descent](#adagrad-in-gradient-descent)
9+
- [Update Rule](#update-rule)
10+
- [Implementation in Keras](#implementation-in-keras)
11+
- [Usage](#usage)
12+
- [Results](#results)
13+
- [Advantages of AdaGrad](#advantages-of-adagrad)
14+
- [Limitations of AdaGrad](#limitations-of-adagrad)
15+
- [What Next](#what-next)
16+
17+
## Introduction
18+
19+
AdaGrad (Adaptive Gradient Algorithm) is an optimization method that adjusts the learning rate for each parameter individually based on the accumulated squared gradients. This allows the algorithm to perform well in scenarios where sparse features are involved, as it effectively scales down the learning rate for frequently updated parameters.
20+
21+
## Mathematical Explanation
22+
23+
### AdaGrad in Gradient Descent
24+
25+
AdaGrad modifies the standard gradient descent algorithm by adjusting the learning rate for each parameter based on the sum of the squares of the past gradients.
26+
27+
### Update Rule
28+
29+
The update rule for AdaGrad is as follows:
30+
31+
1. Accumulate the squared gradients:
32+
33+
$
34+
G_t = G_{t-1} + g_t^2
35+
$
36+
37+
2. Update the parameters:
38+
39+
$
40+
\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t
41+
$
42+
43+
where:
44+
- $ G_t $ is the accumulated sum of squares of gradients up to time step $ t $
45+
- $ g_t $ is the gradient at time step $ t $
46+
- $ \eta $ is the learning rate
47+
- $ \epsilon $ is a small constant to prevent division by zero
48+
49+
## Implementation in Keras
50+
51+
Here is a simple implementation of the AdaGrad optimizer using Keras:
52+
53+
```python
54+
import numpy as np
55+
from keras.models import Sequential
56+
from keras.layers import Dense
57+
from keras.optimizers import Adagrad
58+
59+
# Generate dummy data
60+
X_train = np.random.rand(1000, 20)
61+
y_train = np.random.randint(2, size=(1000, 1))
62+
63+
# Define a simple model
64+
model = Sequential()
65+
model.add(Dense(64, activation='relu', input_dim=20))
66+
model.add(Dense(1, activation='sigmoid'))
67+
68+
# Compile the model with AdaGrad optimizer
69+
optimizer = Adagrad(learning_rate=0.01)
70+
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
71+
72+
# Train the model
73+
model.fit(X_train, y_train, epochs=50, batch_size=32)
74+
```
75+
76+
In this example:
77+
- We generate some dummy data for training.
78+
- We define a simple neural network model with one hidden layer.
79+
- We compile the model using the AdaGrad optimizer with a learning rate of 0.01.
80+
- We train the model for 50 epochs with a batch size of 32.
81+
82+
## Usage
83+
84+
To use this implementation, ensure you have the required dependencies installed:
85+
86+
```bash
87+
pip install numpy keras
88+
```
89+
90+
Then, you can run the provided script to train a model using the AdaGrad optimizer.
91+
92+
## Results
93+
94+
The results of the training process, including the loss and accuracy, will be displayed after each epoch. You can adjust the learning rate and other hyperparameters to see how they affect the training process.
95+
96+
## Advantages of AdaGrad
97+
98+
1. **Adaptive Learning Rates**: AdaGrad adapts the learning rate for each parameter, making it effective for dealing with sparse data and features.
99+
2. **No Need for Manual Learning Rate Decay**: Since AdaGrad automatically decays the learning rate, it eliminates the need to manually set learning rate schedules.
100+
3. **Good for Sparse Data**: AdaGrad performs well on problems with sparse features, such as natural language processing and computer vision tasks.
101+
102+
## Limitations of AdaGrad
103+
104+
1. **Aggressive Learning Rate Decay**: The accumulated gradient sum can grow very large, causing the learning rate to become very small and eventually stopping the learning process.
105+
2. **Not Suitable for Non-Sparse Data**: For dense data, AdaGrad’s aggressive learning rate decay can slow down convergence, making it less effective.
106+
3. **Memory Usage**: AdaGrad requires storing the sum of squared gradients for each parameter, which can be memory-intensive for large models.
107+
108+
## What Next
109+
110+
To address these issues, various optimization algorithms have been developed, such as Adam, which incorporate techniques. Which we'll see in next section .

0 commit comments

Comments
 (0)