Skip to content

Commit f111be4

Browse files
authored
Create pixelrnn.md
1 parent 78717a7 commit f111be4

File tree

1 file changed

+142
-0
lines changed

1 file changed

+142
-0
lines changed

pixelrnn.md

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
PixelRNN
2+
========
3+
4+
We now give a brief overview of PixelRNN. PixelRNNs belongs to a family
5+
of explicit density models called **fully visible belief networks
6+
(FVBN)**. We can represent our model with the following equation:
7+
$$p(x) = p(x_1, x_2, \dots, x_n),$$ where the left hand side $p(x)$
8+
represents the likelihood of an entire image $x$, and the right hand
9+
side represents the joint likelihood of each pixel in the image. Using
10+
the Chain Rule, we can decompose this likelihood into a product of
11+
1-dimensional distributions:
12+
$$p(x) = \prod_{i = 1}^n p(x_i \mid x_1, \dots, x_{i - 1}).$$ Maximizing
13+
the likelihood of training data, we obtain our models PixelRNN.
14+
15+
Introduction
16+
============
17+
18+
PixelRNN, first introduced in van der Oord et al. 2016, uses an RNN-like
19+
structure, modeling the pixels one-by-one, to maximize the likelihood
20+
function given above. One of the more difficult tasks in generative
21+
modeling is to create a model that is tractable, and PixelRNN seeks to
22+
address that. It does so by tractably modeling a joint distribution of
23+
the pixels in the image, casting it as a product of conditional
24+
distributions. The factorization turns the joint modeling problem into
25+
one that relates to sequences, i.e., we have to predict the next pixel
26+
given all the previously generated pixels. Thus, we use Recurrent Neural
27+
Networks for this tasks as they learn sequentially. Those same
28+
principles apply here; more precisely, we generate image pixels starting
29+
from the top left corner, and we model each pixel’s dependency on
30+
previous pixels using an RNN (LSTM).
31+
32+
![(From CS231N Slides) Sequential Image Generation using
33+
PixelRNN](imports/pixelrnn.png "fig:") [fig:my~l~abel]
34+
35+
<div class="fig figcenter fighighlight">
36+
<img src="/assets/nn2/prepro1.jpeg">
37+
<div class="figcaption">Common data preprocessing pipeline. <b>Left</b>: Original toy, 2-dimensional input data. <b>Middle</b>: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. <b>Right</b>: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.</div>
38+
</div>
39+
40+
Specifically, the PixelRNN framework is made up of twelve
41+
two-dimensional LSTM layers, with convolutions applied to each dimension
42+
of the data. There are two types of layers here. One is the Row LSTM
43+
layer where the convolution is applied along each row. The second type
44+
is the Diagonal BiLSTM layer where the convolution is applied to the
45+
diagonals of the image. In addition, the pixel values are modeled as
46+
discrete values using a multinomial distribution implemented with a
47+
softmax layer. This is in contrast to many previous approaches, which
48+
model pixels as continuous values.
49+
50+
Model
51+
=====
52+
53+
The approach of the PixelRNN is as follows. The RNN scans the each
54+
individual pixel, going row-wise, predicting the conditional
55+
distribution over the possible pixel values given what context the
56+
network has. As mentioned before, PixelRNN uses a two-dimensional LSTM
57+
network which begins scanning at the top left of the image and makesits
58+
way to the bottom right. One of the reasons an LSTM is used is that it
59+
can better capture some longer range dependencies between pixels - this
60+
is essential for understanding image composition. The reason a
61+
two-dimensional structure is used is to ensure that the signals
62+
propagate in the left-to-right and top-to-bottom directions well.\
63+
64+
The input image to the network is represented by a 1D vector of pixel
65+
values $\{x_1,..., x_{n^2}\}$ for an $n$-by-$n$ sized image, where
66+
$\{x_1,..., x_{n}\}$ represents the pixels from the first row. Our goal
67+
is to use these pixel values to find a probability distribution $p(X)$
68+
for each image $X$. We define this probability as:
69+
$$p(x) = \prod_{i = 1}^{n^2} p(x_i \mid x_1, \dots, x_{i - 1}).$$
70+
71+
This is the product of the conditional distributions across all the
72+
pixels in the image - for pixel $x_i$, we have
73+
$p(x_i \mid x_1, \dots, x_{i - 1})$. In turn, each of these conditional
74+
distributions is determined by three values, associated with each of the
75+
color channels present in the image (red, green and blue). In other
76+
words:
77+
78+
$$p(x_i \mid x_1, \dots, x_{i - 1}) = p(x_{i,R} \mid \textbf{x}_{<i}) \cdot p(x_{i,G} \mid \textbf{x}_{<i}, x_{i,R}) \cdot p(x_{i,B} \mid \textbf{x}_{<i}, x_{i,R}, x_{i,G}).$$
79+
80+
In the next section we will see how these distributions are calculated
81+
and used within the Recurrent Neural Network framework proposed in
82+
PixelRNN.
83+
84+
Architecture
85+
------------
86+
87+
As we have seen, there are two distinct components to the
88+
“two-dimensional” LSTM, the Row LSTM and the Diagonal BiLSTM. Figure 2
89+
illustrates how each of these two LSTMs operates, when applied to an RGB
90+
image.
91+
92+
![Visualization of the mappings for Row LSTM and Diagonal
93+
BiLSTM](Screen Shot 2021-06-15 at 9.41.08 AM.png "fig:") [fig:my~l~abel]
94+
95+
**Row LSTM** is a unidirectional layer that processes the image row by
96+
row from top to bottom computing features for a whole row at once using
97+
a 1D convolution. As we can see in the image above, the Row LSTM
98+
captures a triangle-shaped context for a given pixel. An LSTM layer has
99+
an input-to-state component and a recurrent state-to-state component
100+
that together determine the four gates inside the LSTM core. In the Row
101+
LSTM, the input-to-state component is computed for the whole
102+
two-dimensional input map with a one-dimensional convolution, row-wise.
103+
The output of the convolution is a 4h × n × n tensor, where the first
104+
dimension represents the four gate vectors for each position in the
105+
input map (h here is the number of output feature maps). Below are the
106+
computations for this state-to-state component, using the previous
107+
hidden state ($h_{i-1}$) and previous cell state ($c_{i-1}$).
108+
$$[o_i, f_i, i_i, g_i] = \sigma(\textbf{K}^{ss} \circledast h_{i-1} + \textbf{K}^{is} \circledast \textbf{x}_{i})$$
109+
$$c_i = f_i \odot c_{i-1} + i_i \odot g_i$$
110+
$$h_i = o_i \odot \tanh(c_{i})$$
111+
112+
Here, $\textbf{x}_i$ is the row of the input representation and
113+
$\textbf{K}^{ss}$, $\textbf{K}^{is}$ are the kernel weights for
114+
state-to-state and input-to-state respectively. $g_i, o_i, f_i$ and
115+
$i_i$ are the content, output, forget and input gates. $\sigma$
116+
represents the activation function (tanh activation for the content
117+
gate, and sigmoid for the rest of the gates).\
118+
119+
**Diagonal BiLSTM** The Diagonal BiLSTM is able to capture the entire
120+
image context by scanning along both diagonals of the image, for each
121+
direction of the LSTM. We first compute the input-to-state and
122+
state-to-state components of the layer. For each of the directions, the
123+
input-to-state component is simply a 1×1 convolution $K^{is}$,
124+
generating a $4h × n × n$ tensor (Here again the dimension represents
125+
the four gate vectors for each position in the input map where h is the
126+
number of output feature maps). The state-to-state is calculated using
127+
the $K^{ss}$ that has a kernel of size 2 × 1. This step takes the
128+
previous hidden and cell states, combines the contribution of the
129+
input-to-state component and produces the next hidden and cell states,
130+
as explained in the equations for Row LSTM above. We repeat this process
131+
for each of the two directions.
132+
133+
Performance
134+
===========
135+
136+
When originally presented, the PixelRNN model’s performance was tested
137+
on some of the most prominent datasets in the computer vision space -
138+
ImageNet and CIFAR-10. The results in some cases were state-of-the-art.
139+
On the ImageNet data set, achieved an NLL score of 3.86 and 3.63 on the
140+
the 32x32 and 64x64 image sizes respectively. On CiFAR-10, it achievied
141+
a NLL score of 3.00, which was state-of-the-art at the time of
142+
publication.

0 commit comments

Comments
 (0)