Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMU-10707 Deep Learning. #40

Open
NorbertZheng opened this issue Oct 23, 2022 · 11 comments
Open

CMU-10707 Deep Learning. #40

NorbertZheng opened this issue Oct 23, 2022 · 11 comments

Comments

@NorbertZheng
Copy link
Owner

NorbertZheng commented Oct 23, 2022

Related Reference

@NorbertZheng NorbertZheng changed the title CMU-10707 Deep Learning | Introduction to Machine Learning, Regression. CMU-10707 Deep Learning. Oct 23, 2022
@NorbertZheng
Copy link
Owner Author

Introduction to Machine Learning, Regression

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Oct 23, 2022

Important Breakthroughs

@NorbertZheng
Copy link
Owner Author

Representation Learning

Examples of Representation Learning

How to perform feature transformation on the data so that the data can be linearly separable in the feature space (feature transformation of data vectors, such as polynomial features, kernel methods, etc.) for the following learning algorithms.
image

Feature Engineering

  • Computer Vision Features.
    image
  • Audio Features.
    image
  • Representation Learning: Can we automatically learn these representations?

@NorbertZheng
Copy link
Owner Author

Types of Learning

Consider observing a series of input vectors:

$$
\mathbf{x}{1},\mathbf{x}{2},\mathbf{x}{3},\mathbf{x}{4},\cdot\cdot\cdot
$$

  • Supervised Learning: We are also given target outputs (labels, responses), $y_{1},y_{2},\cdot\cdot\cdot$, and the goal is to predict correct output given a new input.

    • Classification: Target outputs $y_{i}$ are discrete class labels. The goal is to correctly classify new inputs.
      image
    • Regression: Target outputs $y_{i}$ are continuous. The goal is to predict the output given new inputs.
      image
  • Unsupervised Learning: The goal is to build a statistical model of $\mathbf{x}$, which finds useful representation of data.

    • Clustering.
    • Dimensionality reduction.
    • Modeling the data density.
    • Finding hidden causes (useful explanation) of the data.

    image

  • Reinforcement Learning: The model (agent) produces a set of actions, $a_{1},a_{2},\cdot\cdot\cdot$ that affect the state of the world, and receives rewards $r_{1},r_{2},\cdot\cdot\cdot$. The goal is to learn actions that maximize the reward.

  • Semi-supervised Learning: We are given only a limited amount of labels, but lots of unlabeled data.

@NorbertZheng
Copy link
Owner Author

Neural Networks I

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Oct 23, 2022

Feed-forward Neural Networks

image

@NorbertZheng
Copy link
Owner Author

Activation Function

Range is determined by $g(\cdot)$. Bias only changes the position of the riff.

  • Linear activation function: $g(a)=a$.

    • No nonlinear transformation.
    • No input squashing.

    image

  • Sigmoid activation function: $g(a)=sigm(a)=\frac{1}{1+\exp(-a)}$.

    • Squashes the neuron's output between 0 and 1.
    • Always positive.
    • Bounded.
    • Strictly increasing.

    image

  • Hyperbolic tangent ("tanh") activation function: $g(a)=tanh(a)=\frac{\exp(a)-\exp(-a)}{\exp(a)+\exp(-a)}=\frac{\exp(2a)-1}{\exp(2a)+1}$.

    • Squashes the neuron's activation between -1 and 1.
    • Can be positive or negative.
    • Bounded.
    • Strictly increasing (wrong plot).

    image

  • Rectified linear (ReLU) activation function: $g(a)=reclin(a)=max(0,a)$.

    • Bounded below by 0 (always non-negative).
    • Tends to produce units with sparse activities.
    • Not upper bounded.
    • Strictly increasing.

    image

  • Softmax Activation Function: $g(a)=softmax(a)=\left[\frac{\exp(a_{1})}{\sum_{c}\exp(a_{c})},\cdot\cdot\cdot,\frac{\exp(a_{C})}{\sum_{c}\exp(a_{c})}\right]^{T}$.

    • Strictly positive.
    • Sums to one.
    • Support multi-way classification, e.g. discriminative learning.

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Oct 23, 2022

Universal Approximation

  • Universal Approximation Theorem (Horni, 1991):
    • A single hidden layer neural network with a linear output unit can approximate any continuous arbitrarily well, given enough units.
  • This applies for sigmoid, tanh and many other activation functions.
  • However, this does not mean that there is learning algorithm that can find the necessary parameter values.

image
image
image

@NorbertZheng
Copy link
Owner Author

Gradient Descend

Perform updates after seeing each examples:

  • Initialize: $\theta=\{\mathbf{W}^{(1)},\mathbf{b}^{(1)},\cdot\cdot\cdot,\mathbf{W}^{(L+1)},\mathbf{b}^{(L+1)}\}$.
  • For $t=1:T$
    • For each training example $(\mathbf{x}^{(t)},y^{(t)})$, e.g. training epoch (iteration of all examples)
      • $\Delta=-\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})-\lambda\nabla_{\theta}\Omega(\theta)$
      • $\theta \leftarrow \theta + \alpha\Delta$

To train a neural net, we need:

  • Loss function & Regularizer: $l(\mathbf{f}(\mathbf{x}^{(t)};\theta),y^{(t)})$, $\Omega(\theta)$.
    • Let us start by considering a classification problem with a softmax output layer.
    • We need to estimate: $f_{c}(\mathbf{x})=p(y=c|\mathbf{x})$.
      • We can maximize the log-probability of the correct class given an input: $logp(y^{(t)}=c|\mathbf{x}^{t})$.
      • Alternatively, we can minimize the negative log-likelihood (e.g. cross-entropy function of multi-class classification problem): $l(f(\mathbf{x}),y)=-\sum_{c}1_{(y=c)}logf_{c}(\mathbf{x})=-logf_{y}(\mathbf{x})$.

@NorbertZheng
Copy link
Owner Author

A procedure to compute gradients: $\nabla_{\theta}l(f(\mathbf{x}^{(t)};\theta),y^{(t)})$.
image
Consider a network with $L$ hidden layers.

  • Hidden layer ( $k \in [1,L]$ ) pre-activation:

$$ \mathbf{a}^{(k)}(\mathbf{x})=\mathbf{b}^{(k)}+\mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x}) $$

  • Hidden layer ( $k \in [1,L]$ ) activation:

$$ \mathbf{h}^{(k)}(\mathbf{x})=\mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x})) $$

  • Output layer ( $k=L+1$ ) activation:

$$ \mathbf{h}^{(L+1)}(\mathbf{x})=\mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x}))=f(\mathbf{x}) $$

Loss gradient at output.
image

  • Partial derivative:

$$ \frac{\partial}{\partial f_{c}(\mathbf{x})}-logf_{y}(\mathbf{x})=-\frac{1_{y=c}}{f_{y}(\mathbf{x})} $$

  • Gradient:

$$ \begin{aligned} &\nabla_{f(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&-\frac{1}{f_{y}(\mathbf{x})}\left[\begin{matrix} 1_{(y=0)}\\ \cdot\\ \cdot\\ \cdot\\ 1_{(y=C-1)}\end{matrix}\right]\\ =&-\frac{\mathbf{e}(y)}{f_{y}(\mathbf{x})} \end{aligned} $$

Loss gradient at output pre-activation.
image

  • Partial derivative:

$$ \frac{\partial}{\partial a^{(L+1)}(\mathbf{x})}-logf_{y}(\mathbf{x})=-(1_{(y=c)}-f_{c}(\mathbf{x})) $$

  • Gradient:

$$ \begin{aligned} &\nabla_{\mathbf{a}^{(L+1)}(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&-(\mathbf{e}(y)-f(\mathbf{x})) \end{aligned} $$

@NorbertZheng
Copy link
Owner Author

NorbertZheng commented Oct 23, 2022

Note that we have the following equation:

$$ \frac{\partial \frac{g(x)}{h(x)}}{\partial x}=\frac{\partial g(x)}{\partial x}\frac{1}{h(x)}-\frac{g(x)}{h(x)^{2}}\frac{\partial h(x)}{\partial x} $$

Then we can derive the following equation:

$$ \begin{aligned} &\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}-logf_{y}(\mathbf{x})\\ =&\frac{-1}{f_{y}(\mathbf{x})}\frac{\partial}{\partial \mathbf{a}^{(L+1)}_{c}(\mathbf{x})}softmax(\mathbf{a}^{(L+1)}_{y}(\mathbf{x}))\\ \end{aligned} $$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant