Neural networks are a foundational element of machine learning, inspired by the structure and function of the human brain. These computational models are designed to recognize patterns and make decisions based on input data. At the core of neural networks are layers of interconnected nodes or "neurons," each capable of performing simple calculations. The data flows through these layers, transforming through various operations, leading to the final output which can be used for tasks such as classification, regression, and more.
Among these layers, linear (or fully-connected) layers play a crucial role. They are so named because each neuron in a linear layer is connected to every neuron in the previous layer, forming a dense network. These layers apply linear transformations to the incoming data, effectively learning to weigh the importance of input features through adjustable parameters known as weights and biases.
Linear layers, also known as fully-connected layers, connect every input neuron to every output neuron and are commonly used in neural networks. Refer to end-to-end_topology figure for the visual representation of a linear layer.
-
Class attributes:
- Learnable model parameters weight
$W$ , bias$b$ . - Variables stored during forward-propagation to compute derivatives during back-propagation: layer input
$A$ , batch size$N$ . - Variables stored during backward-propagation to train model parameters
$dLdW$ ,$dLdb$ .
- Learnable model parameters weight
-
Class methods:
-
__init__
: Two parameters define a linear layer:in_feature
$(C_{in})$ andout_feature
$(C_{out})$ . Zero initialize weight$W$ and bias$b$ based on the inputs. Refer to Table 1(Linear Layer Components) to see how the shapes of$W$ and$b$ are related to the inputs (Hint: - Check the shapes of thein_feature
andout_feature
and create a numpy array with zeros based on the required shape of$W$ and$b$ given in Table 1). -
forward
: forward method takes in a batch of data A of shape$N \times C_{in}$ (representing$N$ samples where each sample has$C_{in}$ features), and computes output$Z$ of shape$N \times C_{out}$ – each data sample is now represented by$C_out$ features. -
backward
: backward method takes in input$dLdZ$ , how changes in its output$Z$ affect loss$L$ . It calculates and stores$dLdW$ ,$dLdb$ – how changes in the layer weights and bias affect loss, which are used to improve the model. It returns$dLdA$ , how changes in the layer inputs affect loss to enable downstream computation.
-
Please consider the following class structure:
class Linear:
def __init__(self, in_features, out_features):
self.W = # TODO
self.b = # TODO
def forward(self, A):
self.A = # TODO
self.N = # TODO: store the batch size
Z = # TODO
return Z
def backward(self, dLdZ):
dLdA = # TODO
dLdW = # TODO
dLdb = # TODO
self.dLdW = dLdW
self.dLdb = dLdb
return dLdA
Code Name | Math | Type | Shape | Meaning |
---|---|---|---|---|
N | N | scalar | - | batch size |
in_features | scalar | - | number of input features | |
out_features | scalar | - | number of output features | |
A | A | matrix | batch of N inputs each represented by |
|
Z | Z | matrix | batch of N outputs each represented by |
|
W | W | matrix | weight parameters | |
b | b | matrix | bias parameters | |
dLdZ | matrix | how changes in outputs affect loss | ||
dLdA | matrix | how changes in inputs affect loss | ||
dLdW | matrix | how changes in weights affect loss | ||
dLdb | matrix | how changes in bias affect loss |
Note: For dLdZ, its shape matches Z's shape because when you compute the gradient of the loss L with respect to the output Z (dLdZ), you are essentially asking, "How does each element of the loss change with respect to each element of the output Z?" Since Z has a shape of N x C_out (where N is the batch size and C_out is the number of output features), dLdZ must have the same shape to represent the gradient of the loss with respect to each individual element in Z.
During forward propagation, we apply a linear transformation to the incoming data A to obtain output data Z using a weight matrix W and a bias vector b. 1_N is a column vector of size N which contain all 1s, and is used for broadcasting the bias.
As mentioned earlier, the objective of backward propagation is to calculate the derivative of the loss with respect to the weight matrix, bias, and input to the linear layer, i.e.,
Given
In the above equations, dZdA, dZdW, and dZdb represent how the input, weights matrix, and bias respectively affect the output of the linear layer.
Now,
Note:
To illustrate why the collection of 2D arrays for dZ/db
forms a 3D tensor rather than a 2D tensor, let's use a simple numerical example.
Let's consider a linear layer with:
- 2 input features (
$C_in = 2$ ) - 3 output features (
$C_out = 3$ ) - A batch size of 2 (
$N = 2$ )
Suppose the bias vector
And the output matrix
After adding the bias Z
becomes:
The derivative dZ/db
represents how changes in each bias element affect each output in Z
. For each element of b
, we consider how changing that element affects all elements of Z
.
Changing b_1
affects the first column of Z
:
Changing b_2
affects the second column of Z
:
Changing b_3
affects the third column of Z
:
Each of these matrices is a 2D array showing how changing one element of b
affects all elements of Z
. Collectively, these 2D arrays can be thought of as "slices" or "layers" in a 3D structure, where each "layer" corresponds to the derivative with respect to one bias element:
This 3D tensor can be visualized as a stack of the three 2D arrays, one for each bias element. However, in practice, this tensor is often simplified to a 2D or even 1D structure for efficiency, as explained previously.
This example demonstrates why considering the derivative of each output element with respect to each bias element conceptually leads to a 3D tensor, even though the actual computation might be simplified.
For dzda, the follwing figure is my personal understanding:
Note:
For
Note: why we don't calculate
In the context of neural networks and backpropagation, the reason we return dLdA (the gradient of the loss with respect to the input of the layer) instead of dLdZ (the gradient of the loss with respect to the output of the layer) during the backward pass is due to the nature of backpropagation itself. Backpropagation is a method used to calculate the gradient of the loss function with respect to each weight in the network by propagating the error gradient backwards through the network.
Let's go through a specific example to illustrate why dLdA is returned.
Example: Imagine a simple neural network with two layers: a hidden linear layer (Layer 1) and an output linear layer (Layer 2). The output of Layer 1 (Z1) becomes the input to Layer 2 (A2), and the output of Layer 2 (Z2) is used to compute the loss (L).
Layer 1: Input A1, Output Z1 Layer 2: Input A2 (which is Z1 from Layer 1), Output Z2 When we perform backpropagation, we start from the output of the network and move backward:
Layer 2 (Output Layer): We first compute dLdZ2, which is the gradient of the loss with respect to the output of Layer 2. This is usually straightforward since we know the loss function and the output of the network.
Layer 2 to Layer 1: To update the weights and biases of Layer 2, we need to compute dLdW2 and dLdb2, which depend on dLdZ2. However, to propagate the error back to Layer 1, we need to compute dLdA2, which represents how changes in the input to Layer 2 (which is the output of Layer 1, Z1) affect the loss. This is why we compute dLdA in the backward method of Layer 2. dLdA2 is essentially dLdZ1 since A2 is Z1.
Layer 1 (Hidden Layer): Now that we have dLdZ1 (from dLdA2 of Layer 2), we can perform the backward pass through Layer 1. We use dLdZ1 to compute dLdW1 and dLdb1 for Layer 1. Additionally, if there were layers before Layer 1, we would also need to compute dLdA1 to propagate the error further back.
Why not return dLdZ?
Returning dLdZ of a layer does not make sense in the context of backpropagation because the error gradient relevant to the previous layer is with respect to its output (Z), which is the input (A) to the current layer. Hence, what is relevant to the previous layer is how changes in its output (which is the input to the current layer) affect the loss, and that is represented by dLdA of the current layer.
Specific Example:
Imagine you have a loss L at the output of the network, and you're currently backpropagating through Layer 2. You've computed dLdZ2 based on the loss function. To update the weights (W2) and biases (b2) of Layer 2, you use dLdZ2. However, to tell Layer 1 how it should adjust its weights (W1) and biases (b1), you need to communicate how changes in its outputs (Z1, which are inputs to Layer 2: A2) affect the loss. This is done by passing back dLdA2 (which is dLdZ1 from the perspective of Layer 1), allowing Layer 1 to adjust W1 and b1 effectively to minimize the loss.
In summary, during backpropagation, each layer needs to know how changes in its inputs affect the overall loss so that it can adjust its weights and biases accordingly. This is why dLdA is returned from the backward method of each layer and used as dLdZ for the preceding layer in the network.
Here's a breakdown of the process:
-
Layer 2 Backward Pass: For Layer 2, you compute
$\frac{\partial L}{\partial W_2}$ and$\frac{\partial L}{\partial b_2}$ using$\frac{\partial L}{\partial Z_2}$ (where$Z_2$ is the output of Layer 2). This step involves how changes in Layer 2's weights and biases affect the overall loss$L$ , allowing you to update$W_2$ and$b_2$ . -
Error Propagation to Layer 1: The gradient
$\frac{\partial L}{\partial A_2}$ is computed in Layer 2's backward pass and is essentially$\frac{\partial L}{\partial Z_1}$ from Layer 1's perspective, since$A_2 = Z_1$ . This gradient represents how changes in Layer 1's output affect the loss$L$ , not just the output$Z_2$ of Layer 2. -
Layer 1 Backward Pass: Using
$\frac{\partial L}{\partial Z_1}$ , you compute$\frac{\partial L}{\partial W_1}$ and$\frac{\partial L}{\partial b_1}$ for Layer 1. These calculations show how changes in Layer 1's weights and biases affect the overall loss$L$ , allowing you to update$W_1$ and$b_1$ .
import numpy as np
class Linear:
def __init__(self, in_features, out_features, debug=False):
"""
Initialize the weights and biases with zeros
Checkout np.zeros function.
Read the writeup to identify the right shapes for all.
"""
self.W = np.zeros((out_features,in_features))
self.b = np.zeros((out_features,))
self.debug = debug
def forward(self, A):
"""
:param A: Input to the linear layer with shape (N, C0)
:return: Output Z of linear layer with shape (N, C1)
Read the writeup for implementation details
"""
self.A = A
self.N = A.shape[0]
self.Ones = np.ones((self.N, 1))
Z = self.A.dot(self.W.T) + self.Ones.dot(self.b.T)
return Z
def backward(self, dLdZ):
dLdA = dLdZ.dot(self.W)
self.dLdW = dLdZ.T.dot(self.A)
self.dLdb = dLdZ.T.dot(self.Ones)
if self.debug:
self.dLdA = dLdA
return dLdA