You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+11-19
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ This is the first of a series of tutorials I plan to write about _implementing_
4
4
5
5
Basic knowledge of PyTorch, convolutional and recurrent neural networks is assumed.
6
6
7
-
Questions, suggestions, or corrections can be posted as issues.
7
+
Questions, suggestions, or corrections can be posted as issues.
8
8
9
9
I'm using `PyTorch 0.4` in `Python 3.6`.
10
10
@@ -251,48 +251,40 @@ See `Encoder` in `models.py`.
251
251
252
252
We use a pretrained ResNet-101 already available in PyTorch's `torchvision` module. Discard the last two layers (pooling and linear layers), since we only need to encode the image, and not classify it.
253
253
254
-
We do add an `AdaptiveAvgPool2d()` layer to resize the encoding to a fixed size. This makes it possible to feed images of variable size to the Encoder. (We did, however, resize our input images to `256, 256` because we had to store them together as a single tensor.)
254
+
We do add an `AdaptiveAvgPool2d()` layer to **resize the encoding to a fixed size**. This makes it possible to feed images of variable size to the Encoder. (We did, however, resize our input images to `256, 256` because we had to store them together as a single tensor.)
255
255
256
-
Since we may want to fine-tune the Encoder, we add a `fine_tune()` method which enables or disables the calculation of gradients for the Encoder's parameters. We only fine-tune convolutional blocks 2 through 4 in the ResNet, because the first convolutional block would have usually learned something very fundamental to image processing, such as detecting lines, edges, curves, etc. We don't mess with the foundations.
257
-
258
-
**The output of the Encoder is the encoded image with dimensions `N, 14, 14, 4096`**.
256
+
Since we may want to fine-tune the Encoder, we add a `fine_tune()` method which enables or disables the calculation of gradients for the Encoder's parameters. We **only fine-tune convolutional blocks 2 through 4 in the ResNet**, because the first convolutional block would have usually learned something very fundamental to image processing, such as detecting lines, edges, curves, etc. We don't mess with the foundations.
259
257
260
258
### Attention
261
259
262
260
See `Attention` in `models.py`.
263
261
264
262
The Attention network is simple - it's composed of only linear layers and a couple of activations.
265
263
266
-
Separate linear layers transform both the encoded image (flattened to `N, 14 * 14, 4096`) and the hidden state (output) from the Decoder to the same dimension, viz. the Attention size. They are then added and ReLU activated. A third linear layer transforms this result to a dimension of 1, whereupon we apply the softmax to generate the weights `alpha`.
267
-
268
-
**The outputs of the Attention network are (1) the weighted average with dimensions `N, 4096`, and (2) the weights with dimensions `N, 14 * 14**`
264
+
Separate linear layers **transform both the encoded image (flattened to `N, 14 * 14, 4096`) and the hidden state (output) from the Decoder to the same dimension**, viz. the Attention size. They are then added and ReLU activated. A third linear layer **transforms this result to a dimension of 1**, whereupon we **apply the softmax to generate the weights**`alpha`.
269
265
270
266
### Decoder
271
267
272
268
See `DecoderWithAttention` in `models.py`.
273
269
274
270
The output of the Encoder is received here and flattened to dimensions `N, 14 * 14, 4096`. This is just convenient and prevents having to reshape the tensor multiple times.
275
271
276
-
We initialize the hidden and cell state of the LSTM using the encoded image with the `init_hidden_state()` method, which uses two separate linear layers.
272
+
We **initialize the hidden and cell state of the LSTM** using the encoded image with the `init_hidden_state()` method, which uses two separate linear layers.
277
273
278
-
At the very outset, we sort the `N` images and captions by decreasing caption lengths. This is so that we can process only _valid_ timesteps, i.e., not process the `<pad>`s.
274
+
At the very outset, we **sort the `N` images and captions by decreasing caption lengths**. This is so that we can process only _valid_ timesteps, i.e., not process the `<pad>`s.
279
275
280
276

281
277
282
-
We can iterate over each timestep, processing only the colored regions, which are the _effective_ batch size `N_t` at that timestep. The sorting allows the top `N_t` at any timestep to align with the outputs from the previous step. At the third timestep, for example, we process only the top 5 images, using the top 5 outputs from the previous step.
278
+
We can iterate over each timestep, processing only the colored regions, which are the **_effective_ batch size**`N_t` at that timestep. The sorting allows the top `N_t` at any timestep to align with the outputs from the previous step. At the third timestep, for example, we process only the top 5 images, using the top 5 outputs from the previous step.
283
279
284
-
We iterate over the timesteps _manually_ in a `for` loop with a PyTorch [`LSTMCell`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM) instead of iterating automatically without a loop with a PyTorch [`LSTM`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM). This is because we need to execute the Attention mechanism between each decode step. An `LSTMCell` is a single timestep operation, whereas an `LSTM` would iterate over multiple timesteps continously and provide all outputs at once.
280
+
This **iteration is performed _manually_ in a `for` loop** with a PyTorch [`LSTMCell`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM) instead of iterating automatically without a loop with a PyTorch [`LSTM`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM). This is because we need to execute the Attention mechanism between each decode step. An `LSTMCell` is a single timestep operation, whereas an `LSTM` would iterate over multiple timesteps continously and provide all outputs at once.
285
281
286
-
We compute the weights and attention-weighted encoding at each timestep with the Attention network. In section `4.2.1` of the paper, they recommend passing the attention-weighted encoding through a filter or gate. This gate is a sigmoid activated linear transform of the Decoder's previous hidden state. The authors state that this helps the Attention network put more emphasis on the objects in the image.
282
+
We **compute the weights and attention-weighted encoding** at each timestep with the Attention network. In section `4.2.1` of the paper, they recommend **passing the attention-weighted encoding through a filter or gate**. This gate is a sigmoid activated linear transform of the Decoder's previous hidden state. The authors state that this helps the Attention network put more emphasis on the objects in the image.
287
283
288
-
We concatenate this filtered attention-weighted encoding with the embedding of the previous word (`<start>` to begin), and run the `LSTMCell` to generate the new hidden state (or output). A linear layer transforms this new hidden state into scores for each word in the vocabulary, which is stored.
284
+
We **concatenate this filtered attention-weighted encoding with the embedding of the previous word** (`<start>` to begin), and run the `LSTMCell` to **generate the new hidden state (or output)**. A linear layer **transforms this new hidden state into scores for each word in the vocabulary**, which is stored.
289
285
290
286
We also store the weights returned by the Attention network at each timestep. You will see why soon enough.
291
287
292
-
**The outputs of the Decoder are (1) predictions/scores at all timesteps with dimensions `N, L, V`, (2) weights at all timesteps with dimensions `N, L, P`** where`L` is the maximum caption decode-length in the batch, `V` is the vocabulary size, and `P` is the number of pixels.
293
-
294
-
We also output the sorted captions, their lengths, and sort indices.
295
-
296
288
# Training
297
289
298
290
See `train.py`.
@@ -301,7 +293,7 @@ See `train.py`.
301
293
302
294
Since we're generating a sequence of words, we use **[`CrossEntropyLoss`](https://pytorch.org/docs/master/nn.html#torch.nn.CrossEntropyLoss)**. You only need to submit the raw scores from the final layer in the Decoder, and the loss function will perform the softmax and log operations.
303
295
304
-
The authors of the paper recommend using a second loss - a "**doubly stochastic regularization**". We know the weights sum up to 1 at a given timestep. But we also encourage the weight at a single pixel to sum up to 1 across _all_ timesteps. This means we want the model to attend to every pixel over the course of generating the entire sequence. Towards this end, we try to **minimize the difference between 1 and the sum of a pixel's weights across all timesteps**.
296
+
The authors of the paper recommend using a second loss - a "**doubly stochastic regularization**". We know the weights sum to 1 at a given timestep. But we also encourage the weight at a single pixel to sum to 1 across _all_ timesteps. This means we want the model to attend to every pixel over the course of generating the entire sequence. Towards this end, we try to **minimize the difference between 1 and the sum of a pixel's weights across all timesteps**.
305
297
306
298
The sigmoid gate in the Decoder and this regularization are key to the model attending well to the relevant parts of the image.
0 commit comments