Deep Learning

Lecture 2: Multi-layer perceptron

Prof. Gilles Louppe
g.louppe@uliege.be

Today

Explain and motivate the basic constructs of neural networks.

From linear discriminant analysis to logistic regression
Stochastic gradient descent
From logistic regression to the multi-layer perceptron
Vanishing gradients and rectified networks
Universal approximation theorem

Neural networks

Perceptron

The Mark I Perceptron (Rosenblatt, 1960) is one of the earliest instances of a neural network.

.center.width-80[]

???

A perceptron is a signal transmission network consisting of sensory units (S units), association units (A units), and output or response units (R units). The ‘retina’ of the perceptron is an array of sensory elements (photocells). An S-unit produces a binary output depending on whether or not it is excited. A randomly selected set of retinal cells is connected to the next level of the network, the A units. As originally proposed there were extensive connections among the A units, the R units, and feedback between the R units and the A units.

In essence an association unit is also an MCP neuron which is 1 if a single specific pattern of inputs is received, and it is 0 for all other possible patterns of inputs. Each association unit will have a certain number of inputs which are selected from all the inputs to the perceptron. So the number of inputs to a particular association unit does not have to be the same as the total number of inputs to the perceptron, but clearly the number of inputs to an association unit must be less than or equal to the total number of inputs to the perceptron. Each association unit's output then becomes the input to a single MCP neuron, and the output from this single MCP neuron is the output of the perceptron. So a perceptron consists of a "layer" of MCP neurons, and all of these neurons send their output to a single MCP neuron.

The Mark I Percetron was implemented in hardware.

The machine could learn to classify simple images.

The Mark I Perceptron is composed of association and response units (or "perceptrons"), each acting as a binary classifier that computes a linear combination of its inputs and applies a step function to the result.

In the modern sense, given an input $\mathbf{x} \in \mathbb{R}^p$, each unit computes its output as $$f(\mathbf{x}) = \begin{cases} 1 &\text{if } \sum_i w_i x_i + b \geq 0 \\ 0 &\text{otherwise} \end{cases}$$

The classification rule can be rewritten as $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b)$$ where $\text{sign}(x)$ is the non-linear activation function $$\text{sign}(x) = \begin{cases} 1 &\text{if } x \geq 0 \\ 0 &\text{otherwise} \end{cases}$$

Computational graphs

.grid[ .kol-3-5[.width-90[]] .kol-2-5[ The computation of $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b)$$ can be represented as a computational graph where

white nodes correspond to inputs and outputs;
red nodes correspond to model parameters;
blue nodes correspond to intermediate operations. ] ]

???

Draw the NN diagram.

In terms of tensor operations, $f$ can be rewritten as $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b),$$ for which the corresponding computational graph of $f$ is:

.center.width-70[]

???

Ask about their intuition on the intuitive meaning of $f(x)$ (i.e., the product as a similarity measure).

Linear discriminant analysis

Consider training data $(\mathbf{x}, y) \sim p_{X,Y}$, with

$\mathbf{x} \in \mathbb{R}^p$,
$y \in \{0,1\}$.

Assume class populations are Gaussian, with same covariance matrix $\Sigma$ (homoscedasticity):

$$p(\mathbf{x}|y) = \frac{1}{\sqrt{(2\pi)^p |\Sigma|}} \exp \left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_y)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_y) \right)$$

???

Switch to blackboard.

Using the Bayes' rule, we have:

$$\begin{aligned} p(y=1|\mathbf{x}) &= \frac{p(\mathbf{x}|y=1) p(y=1)}{p(\mathbf{x})} \\\ &= \frac{p(\mathbf{x}|y=1) p(y=1)}{p(\mathbf{x}|y=0)p(y=0) + p(\mathbf{x}|y=1)p(y=1)} \\\ &= \frac{1}{1 + \frac{p(\mathbf{x}|y=0)p(y=0)}{p(\mathbf{x}|y=1)p(y=1)}}. \end{aligned}$$

--

It follows that with

$$\sigma(x) = \frac{1}{1 + \exp(-x)},$$

we get

$$p(y=1|\mathbf{x}) = \sigma\left(\log \frac{p(\mathbf{x}|y=1)}{p(\mathbf{x}|y=0)} + \log \frac{p(y=1)}{p(y=0)}\right).$$

Therefore,

$$\begin{aligned} &p(y=1|\mathbf{x}) \\\ &= \sigma\left(\log \frac{p(\mathbf{x}|y=1)}{p(\mathbf{x}|y=0)} + \underbrace{\log \frac{p(y=1)}{p(y=0)}}_{a}\right) \\\ &= \sigma\left(\log p(\mathbf{x}|y=1) - \log p(\mathbf{x}|y=0) + a\right) \\\ &= \sigma\left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_1)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_1) + \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_0)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_0) + a\right) \\\ &= \sigma\left(\underbrace{(\mu_1-\mu_0)^T \Sigma^{-1}}_{\mathbf{w}^T}\mathbf{x} + \underbrace{\frac{1}{2}(\mu_0^T \Sigma^{-1} \mu_0 - \mu_1^T \Sigma^{-1} \mu_1) + a}_{b} \right) \\\ &= \sigma\left(\mathbf{w}^T \mathbf{x} + b\right) \end{aligned}$$

Note that the sigmoid function $$\sigma(x) = \frac{1}{1 + \exp(-x)}$$ looks like a soft heavyside:

Therefore, the overall model $f(\mathbf{x};\mathbf{w},b) = \sigma(\mathbf{w}^T \mathbf{x} + b)$ is very similar to the perceptron.

.center.width-70[]

This unit is the main primitive of all neural networks!

Logistic regression

Same model $$p(y=1|\mathbf{x}) = \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$$ as for linear discriminant analysis.

But,

ignore model assumptions (Gaussian class populations, homoscedasticity);
instead, find $\mathbf{w}, b$ that maximizes the likelihood of the data.

???

Switch to blackboard.

We have,

$$\begin{aligned} &\arg \max_{\mathbf{w},b} p(\mathbf{d}|\mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y=y_i|\mathbf{x}_i, \mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \sigma(\mathbf{w}^T \mathbf{x}_i + b)^{y_i} (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))^{1-y_i} \\\ &= \arg \min_{\mathbf{w},b} \underbrace{\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} -{y_i} \log\sigma(\mathbf{w}^T \mathbf{x}_i + b) - {(1-y_i)} \log (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))}_{\mathcal{L}(\mathbf{w}, b) = \sum_i \ell(y_i, \hat{y}(\mathbf{x}_i; \mathbf{w}, b))} \end{aligned}$$

This loss is an instance of the cross-entropy $$H(p,q) = \mathbb{E}_p[-\log q]$$ for $p=p_{Y|\mathbf{x}_i}$ and $q=p_{\hat{Y}|\mathbf{x}_i}$.

Multi-layer perceptron

So far we considered the logistic unit $h=\sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$, where $h \in \mathbb{R}$, $\mathbf{x} \in \mathbb{R}^p$, $\mathbf{w} \in \mathbb{R}^p$ and $b \in \mathbb{R}$.

These units can be composed in parallel to form a layer with $q$ outputs: $$\mathbf{h} = \sigma(\mathbf{W}^T \mathbf{x} + \mathbf{b})$$ where $\mathbf{h} \in \mathbb{R}^q$, $\mathbf{x} \in \mathbb{R}^p$, $\mathbf{W} \in \mathbb{R}^{p\times q}$, $b \in \mathbb{R}^q$ and where $\sigma(\cdot)$ is upgraded to the element-wise sigmoid function.

.center.width-70[![](figures/lec2/graphs/layer.svg)]

???

Draw the NN diagram.

Similarly, layers can be composed in series, such that: $$\begin{aligned} \mathbf{h}_0 &= \mathbf{x} \\ \mathbf{h}_1 &= \sigma(\mathbf{W}_1^T \mathbf{h}_0 + \mathbf{b}_1) \\ ... \\ \mathbf{h}_L &= \sigma(\mathbf{W}_L^T \mathbf{h}_{L-1} + \mathbf{b}_L) \\ f(\mathbf{x}; \theta) = \hat{y} &= \mathbf{h}_L \end{aligned}$$ where $\theta$ denotes the model parameters $\{ \mathbf{W}_k, \mathbf{b}_k, ... | k=1, ..., L\}$.

This model is the multi-layer perceptron, also known as the fully connected feedforward network.

???

Draw the NN diagram.

Output layers

For binary classification, the width $q$ of the last layer $L$ is set to $1$ and the activation function is the sigmoid $\sigma(\cdot) = \frac{1}{1 + \exp(-\cdot)}$, which results in a single output $h_L \in [0,1]$ that models the probability $p(y=1|\mathbf{x})$.
For multi-class classification, the sigmoid activation $\sigma$ in the last layer can be generalized to produce a vector $\mathbf{h}_L \in \bigtriangleup^C$ of probability estimates $p(y=i|\mathbf{x})$. This activation is the $\text{Softmax}$ function, where its $i$-th output is defined as $$\text{Softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)},$$ for $i=1, ..., C$.
For regression, the width $q$ of the last layer $L$ is set to the dimensionality of the output $d_\text{out}$ and the activation function is the identity $\sigma(\cdot) = \cdot$, which results in a vector $\mathbf{h}_L \in \mathbb{R}^{d_\text{out}}$.

???

Draw each.

(demo)

Expressiveness

Let us consider the 1-hidden layer MLP $$f(x) = \sum w_i \text{sign}(x + b_i).$$ This model can approximate any smooth 1D function to arbitrary precision, provided enough hidden units.

.bold[Universal approximation theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a bounded, non-constant continuous function. Let $I_p$ denote the $p$-dimensional hypercube, and $C(I_p)$ denote the space of continuous functions on $I_p$. Given any $f \in C(I_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v_i, w_i, b_i, i=1, ..., q$ such that $$F(x) = \sum_{i \leq q} v_i \sigma(w_i^T x + b_i)$$ satisfies $$\sup_{x \in I_p} |f(x) - F(x)| < \epsilon.$$

It guarantees that even a single hidden-layer network can represent any classification problem in which the boundary is locally linear (smooth);
It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).

Training

Loss functions

The parameters (e.g., $\mathbf{W}_k$ and $\mathbf{b}_k$ for each layer $k$) of an MLP $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \{ (\mathbf{x}_j, \mathbf{y}_j) \}$ of input-output pairs.

The loss function is derived from the likelihood:

For classification, assuming a categorical likelihood, the loss is the cross-entropy $\mathcal{L}(\theta) = -\frac{1}{N} \sum_{(\mathbf{x}_j, \mathbf{y}_j) \in \mathbf{d}} \sum_{i=1}^C y_{ji} \log f_{i}(\mathbf{x}_j; \theta)$.
For regression, assuming a Gaussian likelihood, the loss is the mean squared error $\mathcal{L}(\theta) = \frac{1}{N} \sum_{(\mathbf{x}_j, \mathbf{y}_j) \in \mathbf{d}} (\mathbf{y}_j - f(\mathbf{x}_j; \theta))^2$.

???

Switch to blackboard.

Gradient descent

To minimize $\mathcal{L}(\theta)$, gradient descent uses local linear information to iteratively move towards a (local) minimum.

For $\theta_0 \in \mathbb{R}^d$, a first-order approximation around $\theta_0$ can be defined as $$\hat{\mathcal{L}}(\epsilon; \theta_0) = \mathcal{L}(\theta_0) + \epsilon^T\nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{2\gamma}||\epsilon||^2.$$

.center.width-60[]

???

Switch to blackboard.

A minimizer of the approximation $\hat{\mathcal{L}}(\epsilon; \theta_0)$ is given for $$\begin{aligned} \nabla_\epsilon \hat{\mathcal{L}}(\epsilon; \theta_0) &= 0 \\ &= \nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{\gamma} \epsilon, \end{aligned}$$ which results in the best improvement for the step $\epsilon = -\gamma \nabla_\theta \mathcal{L}(\theta_0)$.

Therefore, model parameters can be updated iteratively using the update rule $$\theta_{t+1} = \theta_t -\gamma \nabla_\theta \mathcal{L}(\theta_t),$$ where

$\theta_0$ are the initial parameters of the model;
$\gamma$ is the learning rate;
both are critical for the convergence of the update rule.