CMU-10707 Deep Learning. #40

NorbertZheng · 2022-10-23T07:23:13Z

Related Reference

Goodfellow I, Bengio Y, Courville A. Deep learning.
Bishop C M, Nasrabadi N M. Pattern recognition and machine learning.
Murphy K P. Machine learning: a probabilistic perspective.
Hastie T, Tibshirani R, Friedman J H, et al. The elements of statistical learning: data mining, inference, and prediction.
MacKay D J C, Mac Kay D J C. Information theory, inference and learning algorithms.
Bruna J. Stat212b: Topics Course on Deep Learning.
Larochelle H. Online Course on Neural Networks.
Anonymous. Deep Learning Summer School @ Montreal, Canada.

NorbertZheng · 2022-10-23T07:45:40Z

Introduction to Machine Learning, Regression

Salakhutdinov R. Introduction to Machine Learning, Regression.

NorbertZheng · 2022-10-23T08:06:45Z

Important Breakthroughs

Deep Belief Networks, 2006 (Unsupervised)
- Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets.
- Theoretical breakthrough: Adding additional layers improves variational lower-bound.
- Efficient learning and inference with multiple layers:
  - Efficient greedy layer-by-layer learning algorithm.
  - Inferring the states of the hidden variables in the top most layer is easy,
Deep Convolutional Nets for Vision (Supervised).
- Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks.
Deep Nets for Speech (Supervised).
- Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.

NorbertZheng · 2022-10-23T08:25:09Z

Representation Learning

Examples of Representation Learning

How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms.

Feature Engineering

Computer Vision Features.
Audio Features.
Representation Learning: Can we automatically learn these representations?

NorbertZheng · 2022-10-23T08:56:18Z

Types of Learning

Consider observing a series of input vectors:

$$
\mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot
$$

Supervised Learning: We are also given target outputs (labels, responses), $y_{1},y_{2},\cdot\cdot\cdot$, and the goal is to predict correct output given a new input.
- Classification: Target outputs $y_{i}$ are discrete class labels. The goal is to correctly classify new inputs.
- Regression: Target outputs $y_{i}$ are continuous. The goal is to predict the output given new inputs.
Unsupervised Learning: The goal is to build a statistical model of $\mathbf{x}$, which finds useful representation of data.
- Clustering.
- Dimensionality reduction.
- Modeling the data density.
- Finding hidden causes (useful explanation) of the data.
Reinforcement Learning: The model (agent) produces a set of actions, $a_{1},a_{2},\cdot\cdot\cdot$ that affect the state of the world, and receives rewards $r_{1},r_{2},\cdot\cdot\cdot$. The goal is to learn actions that maximize the reward.
Semi-supervised Learning: We are given only a limited amount of labels, but lots of unlabeled data.

NorbertZheng · 2022-10-23T09:32:21Z

Neural Networks I

Salakhutdinov R. Neural Networks I.

NorbertZheng · 2022-10-23T09:39:51Z

Feed-forward Neural Networks

NorbertZheng · 2022-10-23T11:51:01Z

Activation Function

Range is determined by $g(\cdot)$. Bias only changes the position of the riff.

Linear activation function: $g(a)=a$.
- No nonlinear transformation.
- No input squashing.
Sigmoid activation function: $g(a)=sigm(a)=\frac{1}{1+\exp(-a)}$.
- Squashes the neuron's output between 0 and 1.
- Always positive.
- Bounded.
- Strictly increasing.
Hyperbolic tangent ("tanh") activation function: $g(a)=tanh(a)=\frac{\exp(a)-\exp(-a)}{\exp(a)+\exp(-a)}=\frac{\exp(2a)-1}{\exp(2a)+1}$.
- Squashes the neuron's activation between -1 and 1.
- Can be positive or negative.
- Bounded.
- Strictly increasing (wrong plot).
Rectified linear (ReLU) activation function: $g(a)=reclin(a)=max(0,a)$.
- Bounded below by 0 (always non-negative).
- Tends to produce units with sparse activities.
- Not upper bounded.
- Strictly increasing.
Softmax Activation Function: $g(a)=softmax(a)=\left[\frac{\exp(a_{1})}{\sum_{c}\exp(a_{c})},\cdot\cdot\cdot,\frac{\exp(a_{C})}{\sum_{c}\exp(a_{c})}\right]^{T}$.
- Strictly positive.
- Sums to one.
- Support multi-way classification, e.g. discriminative learning.

NorbertZheng · 2022-10-23T11:56:21Z

Universal Approximation

Universal Approximation Theorem (Horni, 1991):
- A single hidden layer neural network with a linear output unit can approximate any continuous arbitrarily well, given enough units.
This applies for sigmoid, tanh and many other activation functions.
However, this does not mean that there is learning algorithm that can find the necessary parameter values.

NorbertZheng · 2022-10-23T13:14:25Z

Gradient Descend

Perform updates after seeing each examples:

Initialize: $\theta=\{\mathbf{W}^{(1)},\mathbf{b}^{(1)},\cdot\cdot\cdot,\mathbf{W}^{(L+1)},\mathbf{b}^{(L+1)}\}$.
For $t=1:T$
- For each training example $(\mathbf{x}^{(t)},y^{(t)})$, e.g. training epoch (iteration of all examples)
  - $\Delta=-\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})-\lambda\nabla_{\theta}\Omega(\theta)$
  - $\theta \leftarrow \theta + \alpha\Delta$

To train a neural net, we need:

Loss function & Regularizer: $l(\mathbf{f}(\mathbf{x}^{(t)};\theta),y^{(t)})$, $\Omega(\theta)$.
- Let us start by considering a classification problem with a softmax output layer.
- We need to estimate: $f_{c}(\mathbf{x})=p(y=c|\mathbf{x})$.
  - We can maximize the log-probability of the correct class given an input: $logp(y^{(t)}=c|\mathbf{x}^{t})$.
  - Alternatively, we can minimize the negative log-likelihood (e.g. cross-entropy function of multi-class classification problem): $l(f(\mathbf{x}),y)=-\sum_{c}1_{(y=c)}logf_{c}(\mathbf{x})=-logf_{y}(\mathbf{x})$.

NorbertZheng · 2022-10-23T14:17:03Z

A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$.

Consider a network with $L$ hidden layers.

Hidden layer ( $k \in [1,L]$ ) pre-activation:

$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$

Hidden layer ( $k \in [1,L]$ ) activation:

$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$

Output layer ( $k=L+1$ ) activation:

$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$

Loss gradient at output.

Partial derivative:

$$ \frac{\partial}{\partial f_{c}(\mathbf{x})}-logf_{y}(\mathbf{x})=-\frac{1_{y=c}}{f_{y}(\mathbf{x})} $$

Gradient:

$$ \begin{aligned} &\nabla_{f(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&-\frac{1}{f_{y}(\mathbf{x})}\left[\begin{matrix} 1_{(y=0)}\\ \cdot\\ \cdot\\ \cdot\\ 1_{(y=C-1)}\end{matrix}\right]\\ =&-\frac{\mathbf{e}(y)}{f_{y}(\mathbf{x})} \end{aligned} $$

Loss gradient at output pre-activation.

Partial derivative:

$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf_{y}(\mathbf{x})=-(1_{(y=c)}-f_{c}(\mathbf{x})) $$

Gradient:

$$ \begin{aligned} &\nabla_{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$

NorbertZheng · 2022-10-23T14:25:24Z

Note that we have the following equation:

$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$

Then we can derive the following equation:

$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}_{y}(\mathbf{x}))\\ \end{aligned} $$

NorbertZheng changed the title ~~CMU-10707 Deep Learning | Introduction to Machine Learning, Regression.~~ CMU-10707 Deep Learning. Oct 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMU-10707 Deep Learning. #40

CMU-10707 Deep Learning. #40

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022 •

edited

Loading

CMU-10707 Deep Learning. #40

CMU-10707 Deep Learning. #40

Comments

NorbertZheng commented Oct 23, 2022 • edited Loading

Related Reference

NorbertZheng commented Oct 23, 2022

Introduction to Machine Learning, Regression

NorbertZheng commented Oct 23, 2022 • edited Loading

Important Breakthroughs

NorbertZheng commented Oct 23, 2022

Representation Learning

Examples of Representation Learning

Feature Engineering

NorbertZheng commented Oct 23, 2022

Types of Learning

NorbertZheng commented Oct 23, 2022

Neural Networks I

NorbertZheng commented Oct 23, 2022 • edited Loading

Feed-forward Neural Networks

NorbertZheng commented Oct 23, 2022

Activation Function

NorbertZheng commented Oct 23, 2022 • edited Loading

Universal Approximation

NorbertZheng commented Oct 23, 2022

Gradient Descend

NorbertZheng commented Oct 23, 2022

NorbertZheng commented Oct 23, 2022 • edited Loading

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022 •

edited

Loading

NorbertZheng commented Oct 23, 2022 •

edited

Loading