forked from simurgh9/hw
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmain.tex
103 lines (87 loc) · 5.91 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
\documentclass{homework}
% \documentclass[UTF8]{ctexart}
\usepackage[UTF8]{ctex}
\usepackage{amsmath, amsthm, amssymb, bm, color, framed, graphicx, mathrsfs}
\author{Your Name 你的名字}
\class{CLASS NAME 课程名/课程号}
\date{\today}
\title{\Large \textbf{Homework-1}\\深度学习部分基础理论公式推导与分析}
% \address{YOUR ADDRESS 你的地址}
\definecolor{shadecolor}{RGB}{241, 241, 255}
\graphicspath{{./media/}}
\begin{document} \maketitle
% Draw a line
\rule[0ex]{\textwidth}{1.5pt}
% Draw background
\begin{shaded}
\question \textbf{MLP for Classification: }
In a multi-class classification problem, e.g. a problem with $M$ classes, we use a model $f_{\theta}$ to produce the output $\hat{y}$. $\hat{y}$ is a vector of dimension $M$ with sum 1, whose components indicate the probability of $x$ belong to each class.
In this section, you will use a simple model -- the 3-layer MLP -- to tackle this problem. Assume that we have $K$ labeled data $\left(x_{k}, y_{k}\right)$ with $k=1,2, \ldots, K$ where $x_{k} \in \mathbb{R}^{N}$ and $y_{k} \in \mathbb{R}^{M}$ is a $M$-dimensional one-hot vector.
\begin{enumerate}
\item[a)] Define the cross entropy loss for a multi-class classification problem.
\item[b)] A 3-layer MLP consists of an input layer, a hidden layer and an output layer with two learned parameter matrices $W^{1}, W^{2}$ between successive layers. Please note that there is generally a Softmax layer after the output layer to scale the output, which we omitted in this sub-question for simplicity. When given an input $x$, this model performs the following forward computations sequentially:
$$\begin{aligned}
& a_{1}=W^{1} x, \\
& h=\sigma\left(a_{1}\right), \\
& a_{2}=W^{2} h, \\
& \hat{y}=\sigma\left(a_{2}\right) .
\end{aligned}$$
where $W^{1} \in \mathbb{R}^{D \times N}, W^{2} \in \mathbb{R}^{M \times D}$ are parameter matrices and $\sigma(z)=\frac{1}{1+e^{-z}}$ is the element-wise Sigmoid function. Use the loss function you defined in sub-question a), apply an SGD update to the parameters $W^{1}$ and $W^{2}$ with learning rate $\eta$ via back-propagation. You must identify the closed-form gradient of each parameter, which is $\frac{\partial \mathcal{L}}{\partial W^{2}(m, d)}$ and $\frac{\partial \mathcal{L}}{\partial W^{1}(d, n)}$ in your derivation.
\end{enumerate}
\textbf{HINT:}
1. For simplicity, you can assume that the SGD program uses only one training data per batch.
2. Use formula $\sigma^{\prime}(z)=\sigma(z)(1-\sigma(z))$ for a scalar $z$.
\end{shaded}
a) 由题意,对于多类别分类问题,某一样本的真实标签为$y \in \mathbb{R}^M$,预测标签为$\hat{y} \in \mathbb{R}^M$,则交叉熵损失函数定义为:
$$\mathcal{L}(y, \hat{y})=-\sum_{i=1}^{M} y_i \log \hat{y}_i$$
b) 对于一个输入$x \in \mathbb{R}^N$,标签$y \in \mathbb{R}^M$的训练数据,模型的前向计算如下:
$$\begin{aligned}
& a_{1}=W^{1} x, \\
& h=\sigma\left(a_{1}\right), \\
& a_{2}=W^{2} h, \\
& \hat{y}=\sigma\left(a_{2}\right) .
\end{aligned}$$
其中$W^{1} \in \mathbb{R}^{D \times N}, W^{2} \in \mathbb{R}^{M \times D}$是待训练的权重矩阵,$\sigma(z)=\frac{1}{1+e^{-z}}$是Sigmoid函数。
使用交叉熵损失函数,可以将模型在$K$个训练数据上的损失表示为:
$$\mathcal{L}=\frac{1}{K} \sum_{k=1}^{K} \mathcal{L}\left(y_{k}, \hat{y}_{k}\right)=-\frac{1}{K} \sum_{k=1}^{K} \sum_{i=1}^{M} y_{k,i} \log \hat{y}_{k,i}$$
对于模型参数$W^{1}$和$W^{2}$,我们需要计算它们对损失函数的梯度,从而进行梯度下降更新。根据链式法则,可以将梯度计算分为两个部分,即输出层到隐藏层的梯度和隐藏层到输入层的梯度。
未待完续……
\newpage
\begin{shaded}
\question \textbf{Dropout and Regularization}
Dropout is a well-known way to prevent neural networks from overfitting. In this section, you will show this regularization explicitly for linear regression. Recall that linear regression optimizes $w \in \mathbb{R}^{d}$ to minimize the following MSE objective:
$$
\mathcal{L}(w)=\|y-X w\|^{2}
$$
where $y \in \mathbb{R}^{n}$ is the response to design matrix $X \in \mathbb{R}^{n \times d}$. One way of using dropout during training on the $d$-dimensional input features $x_{i}$ involves keeping each feature at random with probability $p$ (and zero out it if not kept).
\begin{enumerate}
\item[a)] Show that when we apply such dropout, the learning objective becomes
$$
\mathcal{L}(w)=\mathbb{E}_{M \sim \operatorname{Bernoulli}(p)}\|y-(M \odot X) w\|^{2}
$$
where $\odot$ denotes the element-wise product and $M \in\{0,1\}^{n \times d}$ is a random mask matrix whose element $m_{i, j}$ have $i . i . d$. Bernoulli distribution with success probability $p$.
\item[b)] Show that we can manipulate the dropout learning objective to a explicit regularized objective:
$$
\mathcal{L}(w)=\|y-p X w\|^{2}+p(1-p)\|\Gamma w\|^{2}
$$
and define a suitable matrix $\Gamma$.
\end{enumerate}
\end{shaded}
第二题的回答……
\newpage
\begin{shaded}
\question Figure \ref{wheel} shows two cipher wheels. The left one is from Jeffrey Hoffstein, et al. \cite{hoffstein2008introduction}. Write a Python 3 program that uses it to encrypt: \texttt{FOUR SCORE AND SEVEN YEARS AGO}.
\end{shaded}
\img<wheel>[0.26]{Cipher wheels.}{cipher.png, diagram.jpg}
The Python program is given in listing \ref{cpr} and the encryption is given in table \ref{enc}.
\lstinputlisting[language=Python, caption={Python 3 implementing figure \ref{wheel} left wheel.}, label=cpr]{code/prog.py}
\tbl<enc>{Caesar cipher} {
Plain Text & FOUR & SCORE & AND & SEVEN & YEARS & AGO \\
Cipher Text & KTZW & XHTWJ & FSI & XJAJS & DJFWX & FLT \\
}
% citations
\newpage
\large
\bibliographystyle{plain}
\bibliography{citations}
\end{document}