Skip to content

Commit 233b563

Browse files
authored
Merge pull request #4087 from Shantnu-singh/main
Added Gradient decent in Deep Learning Optimizers
2 parents 277bd76 + d258fbf commit 233b563

File tree

1 file changed

+131
-0
lines changed

1 file changed

+131
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
2+
# Gradient Descent in Deep Learning Optimizers
3+
4+
This repository contains an in-depth explanation and implementation of Gradient Descent, a fundamental optimization algorithm used in deep learning. Gradient Descent is used to minimize the loss function of a model by iteratively updating its parameters.
5+
6+
## Table of Contents
7+
- [Introduction](#introduction)
8+
- [Mathematical Explanation](#mathematical-explanation)
9+
- [Gradient in Gradient Descent](#gradient-in-gradient-descent)
10+
- [Basic Gradient Descent](#basic-gradient-descent)
11+
- [Stochastic Gradient Descent (SGD)](#stochastic-gradient-descent-sgd)
12+
- [Mini-Batch Gradient Descent](#mini-batch-gradient-descent)
13+
- [Comparison](#comparison)
14+
- [Implementation in Keras](#implementation-in-keras)
15+
- [Usage](#usage)
16+
- [Limation of Gradient Descent](#problems-with-gradient-descent-as-a-deep-learning-optimizer)
17+
- [Results](#results)
18+
19+
20+
## Introduction
21+
22+
Gradient Descent is an optimization algorithm used for minimizing the loss function in machine learning and deep learning models. It works by iteratively adjusting the model parameters in the opposite direction of the gradient of the loss function with respect to the parameters.
23+
24+
## Mathematical Explanation
25+
26+
### Gradient in Gradient Descent
27+
28+
The gradient of a function measures the steepness and direction of the function at a given point. In the context of Gradient Descent, the gradient of the loss function with respect to the parameters indicates how the loss function will change if the parameters are changed.
29+
30+
Mathematically, the gradient is a vector of partial derivatives:
31+
32+
$$∇J(θ)=[∂J(θ)∂θ1​,∂J(θ)∂θ2​,…,∂J(θ)∂θn​]$$
33+
34+
### Basic Gradient Descent
35+
36+
The update rule for the parameters $θ$ in basic gradient descent is:
37+
38+
$$θ = θ - η∇J(θ)$$
39+
40+
where:
41+
- $θ$ are the model parameters
42+
- $η$ is the learning rate, a small positive number
43+
- $∇J(θ)$ is the gradient of the loss function with respect to the parameters
44+
45+
### Stochastic Gradient Descent (SGD)
46+
47+
In Stochastic Gradient Descent, the parameters are updated for each training example rather than after calculating the gradient over the entire dataset.
48+
49+
$$θ = θ - η∇J(θ; x^(i); y^(i))$$
50+
51+
where $x^(i); y^(i)$ represents the $i$-th training example.
52+
53+
### Mini-Batch Gradient Descent
54+
55+
Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. It updates the parameters after computing the gradient on a mini-batch of the training data.
56+
57+
$$θ = θ - η∇J(θ; x^mini-batch; y^mini-batch)$$
58+
59+
### Comparison
60+
61+
| Method | Description | Update Frequency | Pros | Cons |
62+
|---------------------------|--------------------------------------------------------------|-----------------------------|----------------------------------|--------------------------------------|
63+
| Batch Gradient Descent | Computes gradient over entire dataset | Once per epoch | Stable convergence | Slow for large datasets |
64+
| Stochastic Gradient Descent (SGD) | Computes gradient for each training example | Once per training example | Faster updates, can escape local minima | Noisy updates, may not converge |
65+
| Mini-Batch Gradient Descent | Computes gradient over small batches of the dataset | Once per mini-batch | Balance between speed and stability | Requires tuning of mini-batch size |
66+
67+
## Implementation in Keras
68+
69+
Here is a simple implementation of Gradient Descent using Keras:
70+
71+
```python
72+
import numpy as np
73+
from keras.models import Sequential
74+
from keras.layers import Dense
75+
from keras.optimizers import SGD
76+
77+
# load data
78+
X_train = np.random.rand(1000, 20)
79+
y_train = np.random.randint(2, size=(1000, 1))
80+
81+
# Define model
82+
model = Sequential()
83+
model.add(Dense(64, activation='relu', input_dim=20))
84+
model.add(Dense(1, activation='sigmoid'))
85+
86+
# Stochastic Gradient Descent (SGD)
87+
optimizer = SGD(learning_rate=0.01)
88+
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
89+
90+
# finally Train the model
91+
model.fit(X_train, y_train, epochs=50, batch_size=32)
92+
```
93+
94+
In this example:
95+
- We generate some dummy data for training.
96+
- We define a simple neural network model with one hidden layer.
97+
- We compile the model using the SGD optimizer with a learning rate of 0.01.
98+
- We train the model for 50 epochs with a batch size of 32.
99+
100+
## Usage
101+
102+
To use this implementation, ensure you have the required dependencies installed:
103+
104+
```bash
105+
pip install numpy keras
106+
```
107+
108+
Then, you can run the provided script to train a model using Gradient Descent.
109+
110+
## Problems with Gradient Descent as a Deep Learning Optimizer
111+
112+
Gradient descent, while a fundamental optimization algorithm, faces several challenges in the context of deep learning:
113+
114+
### 1. Vanishing and Exploding Gradients
115+
* **Problem:** In deep neural networks, gradients can become extremely small (vanishing) or large (exploding) as they propagate through multiple layers.
116+
* **Impact:** This hinders the training process, making it difficult for the network to learn from earlier layers.
117+
118+
### 2. Saddle Points and Local Minima
119+
* **Problem:** The optimization landscape of deep neural networks often contains numerous saddle points (points where the gradient is zero but not a minimum or maximum) and local minima.
120+
* **Impact:** Gradient descent can easily get stuck at these points, preventing it from finding the global minimum.
121+
122+
### 3. Slow Convergence
123+
* **Problem:** Gradient descent can be slow to converge, especially for large datasets and complex models.
124+
* **Impact:** This increases training time and computational costs.
125+
126+
To address these issues, various optimization algorithms have been developed, such as Adam, and Adagrad, which incorporate techniques like momentum Which we'll see in next section .
127+
128+
129+
## Results
130+
131+
The results of the training process, including the loss and accuracy, will be displayed after each epoch. You can adjust the learning rate and other hyperparameters to see how they affect the training process.

0 commit comments

Comments
 (0)