Skip to content

Commit c98af4a

Browse files
Merge branch 'CodeHarborHub:main' into master2
2 parents f67363f + 1210ea3 commit c98af4a

File tree

5 files changed

+547
-0
lines changed

5 files changed

+547
-0
lines changed

blog/authors.yml

+7
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,10 @@ DharshiBalasubramaniyam:
7676
title: Full Stack Developer
7777
url: "https://github.com/DharshiBalasubramaniyam"
7878
image_url: https://avatars.githubusercontent.com/u/139672976?s=400&v=4
79+
80+
akshitha-chiluka:
81+
name: Akshitha Chiluka
82+
title: Software Engineering Undergraduate
83+
url: https://github.com/AKSHITHA-CHILUKA
84+
image_url: https://avatars.githubusercontent.com/u/120377576?v=4
85+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Add AdaGrad in Deep Learning Optimizers
2+
3+
This section contains an explanation and implementation of the AdaGrad optimization algorithm used in deep learning. AdaGrad is known for its ability to adapt the learning rate based on the frequency of updates for each parameter.
4+
5+
## Table of Contents
6+
- [Introduction](#introduction)
7+
- [Mathematical Explanation](#mathematical-explanation)
8+
- [AdaGrad in Gradient Descent](#adagrad-in-gradient-descent)
9+
- [Update Rule](#update-rule)
10+
- [Implementation in Keras](#implementation-in-keras)
11+
- [Usage](#usage)
12+
- [Results](#results)
13+
- [Advantages of AdaGrad](#advantages-of-adagrad)
14+
- [Limitations of AdaGrad](#limitations-of-adagrad)
15+
- [What Next](#what-next)
16+
17+
## Introduction
18+
19+
AdaGrad (Adaptive Gradient Algorithm) is an optimization method that adjusts the learning rate for each parameter individually based on the accumulated squared gradients. This allows the algorithm to perform well in scenarios where sparse features are involved, as it effectively scales down the learning rate for frequently updated parameters.
20+
21+
## Mathematical Explanation
22+
23+
### AdaGrad in Gradient Descent
24+
25+
AdaGrad modifies the standard gradient descent algorithm by adjusting the learning rate for each parameter based on the sum of the squares of the past gradients.
26+
27+
### Update Rule
28+
29+
The update rule for AdaGrad is as follows:
30+
31+
1. Accumulate the squared gradients:
32+
33+
$$
34+
G_t = G_{t-1} + g_t^2
35+
$$
36+
37+
2. Update the parameters:
38+
39+
40+
$$η = \theta_{t-1} - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t$$
41+
42+
where:
43+
- $G_t$ is the accumulated sum of squares of gradients up to time step $t$
44+
- $g_t$ is the gradient at time step $t$
45+
- $\eta$ is the learning rate
46+
- $\epsilon$ is a small constant to prevent division by zero
47+
48+
## Implementation in Keras
49+
50+
Here is a simple implementation of the AdaGrad optimizer using Keras:
51+
52+
```python
53+
import numpy as np
54+
from keras.models import Sequential
55+
from keras.layers import Dense
56+
from keras.optimizers import Adagrad
57+
58+
# Generate dummy data
59+
X_train = np.random.rand(1000, 20)
60+
y_train = np.random.randint(2, size=(1000, 1))
61+
62+
# Define a simple model
63+
model = Sequential()
64+
model.add(Dense(64, activation='relu', input_dim=20))
65+
model.add(Dense(1, activation='sigmoid'))
66+
67+
# Compile the model with AdaGrad optimizer
68+
optimizer = Adagrad(learning_rate=0.01)
69+
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
70+
71+
# Train the model
72+
model.fit(X_train, y_train, epochs=50, batch_size=32)
73+
```
74+
75+
In this example:
76+
- We generate some dummy data for training.
77+
- We define a simple neural network model with one hidden layer.
78+
- We compile the model using the AdaGrad optimizer with a learning rate of 0.01.
79+
- We train the model for 50 epochs with a batch size of 32.
80+
81+
## Usage
82+
83+
To use this implementation, ensure you have the required dependencies installed:
84+
85+
```bash
86+
pip install numpy keras
87+
```
88+
89+
Then, you can run the provided script to train a model using the AdaGrad optimizer.
90+
91+
## Results
92+
93+
The results of the training process, including the loss and accuracy, will be displayed after each epoch. You can adjust the learning rate and other hyperparameters to see how they affect the training process.
94+
95+
## Advantages of AdaGrad
96+
97+
1. **Adaptive Learning Rates**: AdaGrad adapts the learning rate for each parameter, making it effective for dealing with sparse data and features.
98+
2. **No Need for Manual Learning Rate Decay**: Since AdaGrad automatically decays the learning rate, it eliminates the need to manually set learning rate schedules.
99+
3. **Good for Sparse Data**: AdaGrad performs well on problems with sparse features, such as natural language processing and computer vision tasks.
100+
101+
## Limitations of AdaGrad
102+
103+
1. **Aggressive Learning Rate Decay**: The accumulated gradient sum can grow very large, causing the learning rate to become very small and eventually stopping the learning process.
104+
2. **Not Suitable for Non-Sparse Data**: For dense data, AdaGrad’s aggressive learning rate decay can slow down convergence, making it less effective.
105+
3. **Memory Usage**: AdaGrad requires storing the sum of squared gradients for each parameter, which can be memory-intensive for large models.
106+
107+
## What Next
108+
109+
To address these issues, various optimization algorithms have been developed, such as Adam, which incorporate techniques. Which we'll see in next section .
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Add Adam in Deep Learning Optimizers
2+
3+
This Section contains an explanation and implementation of the Adam optimization algorithm used in deep learning. Adam (Adaptive Moment Estimation) is a popular optimizer that combines the benefits of two other widely used methods: AdaGrad and RMSProp.
4+
5+
## Table of Contents
6+
- [Introduction](#introduction)
7+
- [Mathematical Explanation](#mathematical-explanation)
8+
- [Adam in Gradient Descent](#adam-in-gradient-descent)
9+
- [Update Rule](#update-rule)
10+
- [Implementation in Keras](#implementation-in-keras)
11+
- [Results](#results)
12+
- [Advantages of Adam](#advantages-of-adam)
13+
- [Limitations of Adam](#limitations-of-adam)
14+
15+
16+
## Introduction
17+
18+
Adam is an optimization algorithm that computes adaptive learning rates for each parameter. It combines the advantages of the AdaGrad and RMSProp algorithms by using estimates of the first and second moments of the gradients. Adam is widely used in deep learning due to its efficiency and effectiveness.
19+
20+
## Mathematical Explanation
21+
22+
### Adam in Gradient Descent
23+
24+
Adam optimizes the stochastic gradient descent by calculating individual adaptive learning rates for each parameter based on the first and second moments of the gradients.
25+
26+
### Update Rule
27+
28+
The update rule for Adam is as follows:
29+
30+
1. Compute the first moment estimate (mean of gradients):
31+
32+
$$
33+
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
34+
$$
35+
36+
2. Compute the second moment estimate (uncentered variance of gradients):
37+
38+
$$
39+
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
40+
$$
41+
42+
3. Correct the bias for the first moment estimate:
43+
44+
$$
45+
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
46+
$$
47+
48+
4. Correct the bias for the second moment estimate:
49+
50+
$$
51+
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
52+
$$
53+
54+
5. Update the parameters:
55+
56+
$$
57+
\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
58+
$$
59+
60+
where:
61+
- $\theta$ are the model parameters
62+
- $\eta$ is the learning rate
63+
- $\beta_1$ and $\beta_2$ are the exponential decay rates for the moment estimates
64+
- $\epsilon$ is a small constant to prevent division by zero
65+
- $g_t$ is the gradient at time step $t$
66+
67+
## Implementation in Keras
68+
69+
Simple implementation of the Adam optimizer using Keras:
70+
71+
```python
72+
import numpy as np
73+
from keras.models import Sequential
74+
from keras.layers import Dense
75+
from keras.optimizers import Adam
76+
77+
# Generate data
78+
X_train = np.random.rand(1000, 20)
79+
y_train = np.random.randint(2, size=(1000, 1))
80+
81+
# Define a model
82+
model = Sequential()
83+
model.add(Dense(64, activation='relu', input_dim=20))
84+
model.add(Dense(1, activation='sigmoid'))
85+
86+
# Compile the model with Adam optimizer
87+
optimizer = Adam(learning_rate=0.001)
88+
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
89+
90+
# Train the model
91+
model.fit(X_train, y_train, epochs=50, batch_size=32)
92+
```
93+
94+
In this example:
95+
- We generate some dummy data for training.
96+
- We define a simple neural network model with one hidden layer.
97+
- We compile the model using the Adam optimizer with a learning rate of 0.001.
98+
- We train the model for 50 epochs with a batch size of 32.
99+
100+
101+
## Results
102+
103+
The results of the training process, including the loss and accuracy, will be displayed after each epoch. You can adjust the learning rate and other hyperparameters to see how they affect the training process.
104+
105+
## Advantages of Adam
106+
107+
1. **Adaptive Learning Rates**: Adam computes adaptive learning rates for each parameter, which helps in faster convergence.
108+
2. **Momentum**: Adam includes momentum, which helps in smoothing the optimization path and avoiding local minima.
109+
3. **Bias Correction**: Adam includes bias correction, improving convergence in the early stages of training.
110+
4. **Robustness**: Adam works well in practice for a wide range of problems, including those with noisy gradients or sparse data.
111+
112+
## Limitations of Adam
113+
114+
1. **Hyperparameter Sensitivity**: The performance of Adam is sensitive to the choice of hyperparameters ($\beta_1$, $\beta_2$, $\eta$), which may require careful tuning.
115+
2. **Memory Usage**: Adam requires additional memory to store the first and second moments, which can be significant for large models.
116+
3. **Generalization**: Models trained with Adam might not generalize as well as those trained with simpler optimizers like SGD in certain cases.

0 commit comments

Comments
 (0)