I use relu.md to illstrate the main idea behind activation functions ReLU.
In the realm of machine learning, engineers have the liberty to select any differentiable function to serve as an activation function. The inclusion of non-linear elements within a neural network (
Activation functions can either take scalar or vector arguments. Scalar activations apply a function to a single number. Scalar activation functions are applied individually to each element of a vector, maintaining a direct relationship between each input and its corresponding output, which simplifies the computation of derivatives. Popular choices of scalar activation functions are Sigmoid, ReLU, Tanh, and GELU, as shown in the following figure.
If an activation function is applied element-wise to each individual element of
In this case, each element
The one-to-one correspondence in scalar activations like ReLU simplifies derivative calculations:
- For
$x1 = -1$ , derivative$ReLU'(x1) = 0$ because$x1$ is negative. - For
$x2 = 5$ , derivative$ReLU'(x2) = 1$ as$x2$ is positive. - For
$x3 = -3$ , derivative$ReLU'(x3) = 0$ due to$x3$ being negative.
This element-wise approach allows for straightforward computation of derivatives, a key aspect in neural network optimization through backpropagation.
The term "element-wise derivative" refers to the derivative calculated independently for each element of a function's output with respect to its corresponding input element. This concept is widely used in functions applied to vectors or matrices, such as activation functions in neural networks.
When computing the element-wise derivative of an activation function's output
Consider a simple example using a vector input and the ReLU activation function, defined as
For an input vector
The element-wise derivative of
For the ReLU function, this derivative is 1 for
In the context of neural networks, particularly during the backpropagation process, the element-wise derivative plays a crucial role in weight updates. It enables the network to understand how small changes in each weight influence the overall loss, allowing the gradient descent algorithm to minimize the loss function effectively. This individualized computation ensures that each weight is optimized based on its specific contribution to the network's performance.
On the other hand, vector activation functions like the Softmax involve outputs that are interdependent on all input elements, complicating the derivative computation process. If the activation function processes each row (or column) of
Consider a neural network output for a 3-class classification problem with logits
$\text{Softmax}(z_1) = \frac{e^2}{e^2 + e^1 + e^{-1}}$ $\text{Softmax}(z_2) = \frac{e^1}{e^2 + e^1 + e^{-1}}$ $\text{Softmax}(z_3) = \frac{e^{-1}}{e^2 + e^1 + e^{-1}}$
Each output probability depends on all input logits, illustrating the interconnected nature of vector activation functions.
The derivative of the Softmax function with respect to any input
For instance, the derivative of
- If
$i = k$ :$\frac{\partial \text{Softmax}(z_i)}{\partial z_k} = \text{Softmax}(z_i) \cdot (1 - \text{Softmax}(z_k))$ - If
$i \neq k$ :$\frac{\partial \text{Softmax}(z_i)}{\partial z_k} = -\text{Softmax}(z_i) \cdot \text{Softmax}(z_k)$
This reflects the fact that an increase in one logit not only increases its corresponding probability (assuming it's positive) but also decreases the probabilities of other classes due to the shared sum in the denominator.
This interconnectedness makes the derivative computation for vector activation functions like Softmax more complex compared to scalar activations like ReLU.
-
Class attributes:
- Activation functions have no trainable parameters.
- Variables stored during forward-propagation to compute derivatives during back-propagation: layer output
$A$ .
-
Class methods:
-
$forward$ : The forward method takes in a batch of data$Z$ of shape$N \times C$ (representing$N$ samples where each sample has$C$ features), and applies the activation function to$Z$ to compute output$A$ of shape$N \times C$ . -
$backward$ : The backward method takes in$dLdA$ , a measure of how the post-activations (output) affect the loss. Using this and the derivative of the activation function itself, the method calculates and returns$dLdZ$ , how changes in pre-activation features (input)$Z$ affect the loss$L$ . In the case of scalar activations,$dLdZ$ is computed as:$$dLdZ = dLdA \odot \frac{\partial A}{\partial Z}$$ Forward Example:
To illustrate this with an example, let's consider a simple case where we have a batch of 3 samples (
$N = 3$ ) and each sample has 2 features ($C=2$ ). So, our input matrix$Z$ could look something like this:
-
Applying the ReLU activation function, which is defined as
As you can see, each element in
Let's explore an example of the backward method in a neural network using the ReLU (Rectified Linear Unit) activation function. The ReLU function is defined as
Suppose we have a neuron that receives an input
During the backward pass, we compute
The derivative of the ReLU function with respect to its input
For
Let's say the backward pass provides us with
To find
Substituting the known values:
This result implies that in this scenario, where the ReLU function clamps the negative input to
This example demonstrates computing the gradient of the loss with respect to pre-activation inputs using the backward method and scalar activation functions.
For
To comprehend why, for a single input of size
The Jacobian matrix represents all first-order partial derivatives of a function that maps from
When scalar activation functions like ReLU or Sigmoid are applied element-wise to a vector
Consider the ReLU activation function applied to an input vector
The derivative of
For an input vector
The diagonal nature of the Jacobian matrix for element-wise scalar activations arises because each output element
The Jacobian of a vector activation function is not a diagonal matrix. For each input vector
After computing each
Specifically, the Jacobian matrix of a vector-valued function represents the collection of all first-order partial derivatives of the function's outputs with respect to its inputs. Each element in the Jacobian matrix is a partial derivative of one of the function's output components with respect to one of its input components.
In the context of a vector-valued function
Here,
In summary, the Jacobian matrix itself is a matrix of derivatives that describes how each component of the output vector changes with respect to changes in each component of the input vector. It is not the input but rather a mathematical object that characterizes the sensitivity of the output to changes in the input.
Consider a neural network layer with 2 samples in a batch (
Our batch of input vectors
Softmax is applied to each sample
After applying Softmax to each row of
For Softmax, the Jacobian
For our first sample
Calculating
For the first sample,
Using the Softmax derivative formula:
Thus, we have:
And for
So,
Similarly
Computing the Gradient
Let's assume we have the gradient of the loss with respect to the activation output dLdA for our batch as:
The gradient
This operation would be performed for each sample, and the resulting vectors
Due to the complexity of the Softmax derivatives and for brevity, detailed computations of each element of
Calculating
Multiplying
Multiplying
Stacking
This example illustrates the process of computing the gradient of the loss with respect to the inputs for a layer using a vector activation function, where the interdependence of the inputs in producing the outputs requires the computation of a full Jacobian matrix for each sample.
Consider the following class structure for the scalar activations:
class Activation:
def forward(self, Z):
self.A = # TODO
return self.A
def backward(self, dLdA):
dAdZ = # TODO
dLdZ = # TODO
return dLdZ
Code Name | Math | Type | Shape | Meaning |
---|---|---|---|---|
N | scalar | - | batch size | |
C | scalar | - | number of features | |
Z | matrix | batch of |
||
A | matrix | batch of |
||
dLdA | matrix | how changes in post-activation features affect loss | ||
dLdZ | matrix | how changes in pre-activation features affect loss |
The topology of the activation function is illustrated in Figure C. To understand its context within the broader network architecture, refer back to Figure A.
Note: In this document, we adhere to a specific convention:
-
$Z$ represents the output of a linear layer. -
$A$ signifies the input to a linear layer.
In this framework,
This equation highlights how the output
Note: dLdZ is used in the backward pass because it directly relates the loss to the parameters we want to optimize (weights and biases) through
$Z$ since$Z = W \cdot A_{prev} + b$ , and followed by$A = f(Z)$ , where$f$ is the activation function. In the case of scalar activations,$dLdZ$ is computed as:$$dLdZ = dLdA \odot \frac{\partial A}{\partial Z}$$ In the case of vector activation function,$dLdZ$ is computed as: For each input vector$Z^{(i)}$ of size$1 \times C$ and its corresponding output vector$A^{(i)}$ (also$1 \times C$ within the batch, the Jacobian matrix$J^{(i)}$ must be computed individually. This matrix holds dimensions$C \times C$ . Consequently, the gradient$dLdZ^{(i)}$ for each sample in the batch is determined by:$$dLdZ^{(i)} = dLdA^{(i)} \cdot J^{(i)}$$
Sigmoid: Suitable for binary classification, where you need to output a single probability score for the positive class.
Softmax: Suitable for multi-class classification, where you need to output a probability distribution over multiple classes.
- Context: Binary classification.
- Purpose: Converts a single logit into a probability.
- Operation: Takes a single logit (a raw score from the model) and maps it to a value between 0 and 1, which can be interpreted as a probability.
Given a logit
- Input: A single logit value (scalar).
- Output: A single probability value between 0 and 1.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
logit = 2.0
probability = sigmoid(logit)
print(probability) # Output: 0.8807970779778823
- Context: Multi-class classification.
- Purpose: Converts a vector of logits into a probability distribution.
- Operation: Takes a vector of logits (raw scores from the model) and maps them to a probability distribution over multiple classes, where the sum of probabilities is 1.
Given a vector of logits
$$\sigma(z)i = \frac{\exp(z_i)}{\sum{j=1}^k \exp(z_j)}$$
for
- Input: A vector of logits.
- Output: A vector of probabilities that sum to 1.
import numpy as np
def softmax(z):
exp_z = np.exp(z)
return exp_z / np.sum(exp_z, axis=0)
logits = np.array([2.0, 1.0, 0.1])
probabilities = softmax(logits)
print(probabilities) # Output: [0.65900114 0.24243297 0.09856589]
Both sigmoid and softmax functions operate on logits, but they are used in different scenarios:
- Sigmoid Function: Used in binary classification to map a single logit to a probability between 0 and 1.
- Softmax Function: Used in multi-class classification to map a vector of logits to a probability distribution over multiple classes.
- CMU_11785_Introduction_To_Deep_Learning