|
| 1 | +PixelRNN |
| 2 | +======== |
| 3 | + |
| 4 | +We now give a brief overview of PixelRNN. PixelRNNs belongs to a family |
| 5 | +of explicit density models called **fully visible belief networks |
| 6 | +(FVBN)**. We can represent our model with the following equation: |
| 7 | +$$p(x) = p(x_1, x_2, \dots, x_n),$$ where the left hand side $p(x)$ |
| 8 | +represents the likelihood of an entire image $x$, and the right hand |
| 9 | +side represents the joint likelihood of each pixel in the image. Using |
| 10 | +the Chain Rule, we can decompose this likelihood into a product of |
| 11 | +1-dimensional distributions: |
| 12 | +$$p(x) = \prod_{i = 1}^n p(x_i \mid x_1, \dots, x_{i - 1}).$$ Maximizing |
| 13 | +the likelihood of training data, we obtain our models PixelRNN. |
| 14 | + |
| 15 | +Introduction |
| 16 | +============ |
| 17 | + |
| 18 | +PixelRNN, first introduced in van der Oord et al. 2016, uses an RNN-like |
| 19 | +structure, modeling the pixels one-by-one, to maximize the likelihood |
| 20 | +function given above. One of the more difficult tasks in generative |
| 21 | +modeling is to create a model that is tractable, and PixelRNN seeks to |
| 22 | +address that. It does so by tractably modeling a joint distribution of |
| 23 | +the pixels in the image, casting it as a product of conditional |
| 24 | +distributions. The factorization turns the joint modeling problem into |
| 25 | +one that relates to sequences, i.e., we have to predict the next pixel |
| 26 | +given all the previously generated pixels. Thus, we use Recurrent Neural |
| 27 | +Networks for this tasks as they learn sequentially. Those same |
| 28 | +principles apply here; more precisely, we generate image pixels starting |
| 29 | +from the top left corner, and we model each pixel’s dependency on |
| 30 | +previous pixels using an RNN (LSTM). |
| 31 | + |
| 32 | + [fig:my~l~abel] |
| 34 | + |
| 35 | +<div class="fig figcenter fighighlight"> |
| 36 | + <img src="/assets/nn2/prepro1.jpeg"> |
| 37 | + <div class="figcaption">Common data preprocessing pipeline. <b>Left</b>: Original toy, 2-dimensional input data. <b>Middle</b>: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. <b>Right</b>: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.</div> |
| 38 | +</div> |
| 39 | + |
| 40 | +Specifically, the PixelRNN framework is made up of twelve |
| 41 | +two-dimensional LSTM layers, with convolutions applied to each dimension |
| 42 | +of the data. There are two types of layers here. One is the Row LSTM |
| 43 | +layer where the convolution is applied along each row. The second type |
| 44 | +is the Diagonal BiLSTM layer where the convolution is applied to the |
| 45 | +diagonals of the image. In addition, the pixel values are modeled as |
| 46 | +discrete values using a multinomial distribution implemented with a |
| 47 | +softmax layer. This is in contrast to many previous approaches, which |
| 48 | +model pixels as continuous values. |
| 49 | + |
| 50 | +Model |
| 51 | +===== |
| 52 | + |
| 53 | +The approach of the PixelRNN is as follows. The RNN scans the each |
| 54 | +individual pixel, going row-wise, predicting the conditional |
| 55 | +distribution over the possible pixel values given what context the |
| 56 | +network has. As mentioned before, PixelRNN uses a two-dimensional LSTM |
| 57 | +network which begins scanning at the top left of the image and makesits |
| 58 | +way to the bottom right. One of the reasons an LSTM is used is that it |
| 59 | +can better capture some longer range dependencies between pixels - this |
| 60 | +is essential for understanding image composition. The reason a |
| 61 | +two-dimensional structure is used is to ensure that the signals |
| 62 | +propagate in the left-to-right and top-to-bottom directions well.\ |
| 63 | + |
| 64 | +The input image to the network is represented by a 1D vector of pixel |
| 65 | +values $\{x_1,..., x_{n^2}\}$ for an $n$-by-$n$ sized image, where |
| 66 | +$\{x_1,..., x_{n}\}$ represents the pixels from the first row. Our goal |
| 67 | +is to use these pixel values to find a probability distribution $p(X)$ |
| 68 | +for each image $X$. We define this probability as: |
| 69 | +$$p(x) = \prod_{i = 1}^{n^2} p(x_i \mid x_1, \dots, x_{i - 1}).$$ |
| 70 | + |
| 71 | +This is the product of the conditional distributions across all the |
| 72 | +pixels in the image - for pixel $x_i$, we have |
| 73 | +$p(x_i \mid x_1, \dots, x_{i - 1})$. In turn, each of these conditional |
| 74 | +distributions is determined by three values, associated with each of the |
| 75 | +color channels present in the image (red, green and blue). In other |
| 76 | +words: |
| 77 | + |
| 78 | +$$p(x_i \mid x_1, \dots, x_{i - 1}) = p(x_{i,R} \mid \textbf{x}_{<i}) \cdot p(x_{i,G} \mid \textbf{x}_{<i}, x_{i,R}) \cdot p(x_{i,B} \mid \textbf{x}_{<i}, x_{i,R}, x_{i,G}).$$ |
| 79 | + |
| 80 | +In the next section we will see how these distributions are calculated |
| 81 | +and used within the Recurrent Neural Network framework proposed in |
| 82 | +PixelRNN. |
| 83 | + |
| 84 | +Architecture |
| 85 | +------------ |
| 86 | + |
| 87 | +As we have seen, there are two distinct components to the |
| 88 | +“two-dimensional” LSTM, the Row LSTM and the Diagonal BiLSTM. Figure 2 |
| 89 | +illustrates how each of these two LSTMs operates, when applied to an RGB |
| 90 | +image. |
| 91 | + |
| 92 | + [fig:my~l~abel] |
| 94 | + |
| 95 | +**Row LSTM** is a unidirectional layer that processes the image row by |
| 96 | +row from top to bottom computing features for a whole row at once using |
| 97 | +a 1D convolution. As we can see in the image above, the Row LSTM |
| 98 | +captures a triangle-shaped context for a given pixel. An LSTM layer has |
| 99 | +an input-to-state component and a recurrent state-to-state component |
| 100 | +that together determine the four gates inside the LSTM core. In the Row |
| 101 | +LSTM, the input-to-state component is computed for the whole |
| 102 | +two-dimensional input map with a one-dimensional convolution, row-wise. |
| 103 | +The output of the convolution is a 4h × n × n tensor, where the first |
| 104 | +dimension represents the four gate vectors for each position in the |
| 105 | +input map (h here is the number of output feature maps). Below are the |
| 106 | +computations for this state-to-state component, using the previous |
| 107 | +hidden state ($h_{i-1}$) and previous cell state ($c_{i-1}$). |
| 108 | +$$[o_i, f_i, i_i, g_i] = \sigma(\textbf{K}^{ss} \circledast h_{i-1} + \textbf{K}^{is} \circledast \textbf{x}_{i})$$ |
| 109 | +$$c_i = f_i \odot c_{i-1} + i_i \odot g_i$$ |
| 110 | +$$h_i = o_i \odot \tanh(c_{i})$$ |
| 111 | + |
| 112 | +Here, $\textbf{x}_i$ is the row of the input representation and |
| 113 | +$\textbf{K}^{ss}$, $\textbf{K}^{is}$ are the kernel weights for |
| 114 | +state-to-state and input-to-state respectively. $g_i, o_i, f_i$ and |
| 115 | +$i_i$ are the content, output, forget and input gates. $\sigma$ |
| 116 | +represents the activation function (tanh activation for the content |
| 117 | +gate, and sigmoid for the rest of the gates).\ |
| 118 | + |
| 119 | +**Diagonal BiLSTM** The Diagonal BiLSTM is able to capture the entire |
| 120 | +image context by scanning along both diagonals of the image, for each |
| 121 | +direction of the LSTM. We first compute the input-to-state and |
| 122 | +state-to-state components of the layer. For each of the directions, the |
| 123 | +input-to-state component is simply a 1×1 convolution $K^{is}$, |
| 124 | +generating a $4h × n × n$ tensor (Here again the dimension represents |
| 125 | +the four gate vectors for each position in the input map where h is the |
| 126 | +number of output feature maps). The state-to-state is calculated using |
| 127 | +the $K^{ss}$ that has a kernel of size 2 × 1. This step takes the |
| 128 | +previous hidden and cell states, combines the contribution of the |
| 129 | +input-to-state component and produces the next hidden and cell states, |
| 130 | +as explained in the equations for Row LSTM above. We repeat this process |
| 131 | +for each of the two directions. |
| 132 | + |
| 133 | +Performance |
| 134 | +=========== |
| 135 | + |
| 136 | +When originally presented, the PixelRNN model’s performance was tested |
| 137 | +on some of the most prominent datasets in the computer vision space - |
| 138 | +ImageNet and CIFAR-10. The results in some cases were state-of-the-art. |
| 139 | +On the ImageNet data set, achieved an NLL score of 3.86 and 3.63 on the |
| 140 | +the 32x32 and 64x64 image sizes respectively. On CiFAR-10, it achievied |
| 141 | +a NLL score of 3.00, which was state-of-the-art at the time of |
| 142 | +publication. |
0 commit comments