Skip to content

Commit 6fc9103

Browse files
committed
changes
1 parent 378469e commit 6fc9103

File tree

2 files changed

+11
-19
lines changed

2 files changed

+11
-19
lines changed

README.md

+11-19
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This is the first of a series of tutorials I plan to write about _implementing_
44

55
Basic knowledge of PyTorch, convolutional and recurrent neural networks is assumed.
66

7-
Questions, suggestions, or corrections can be posted as issues.
7+
Questions, suggestions, or corrections can be posted as issues.
88

99
I'm using `PyTorch 0.4` in `Python 3.6`.
1010

@@ -251,48 +251,40 @@ See `Encoder` in `models.py`.
251251

252252
We use a pretrained ResNet-101 already available in PyTorch's `torchvision` module. Discard the last two layers (pooling and linear layers), since we only need to encode the image, and not classify it.
253253

254-
We do add an `AdaptiveAvgPool2d()` layer to resize the encoding to a fixed size. This makes it possible to feed images of variable size to the Encoder. (We did, however, resize our input images to `256, 256` because we had to store them together as a single tensor.)
254+
We do add an `AdaptiveAvgPool2d()` layer to **resize the encoding to a fixed size**. This makes it possible to feed images of variable size to the Encoder. (We did, however, resize our input images to `256, 256` because we had to store them together as a single tensor.)
255255

256-
Since we may want to fine-tune the Encoder, we add a `fine_tune()` method which enables or disables the calculation of gradients for the Encoder's parameters. We only fine-tune convolutional blocks 2 through 4 in the ResNet, because the first convolutional block would have usually learned something very fundamental to image processing, such as detecting lines, edges, curves, etc. We don't mess with the foundations.
257-
258-
**The output of the Encoder is the encoded image with dimensions `N, 14, 14, 4096`**.
256+
Since we may want to fine-tune the Encoder, we add a `fine_tune()` method which enables or disables the calculation of gradients for the Encoder's parameters. We **only fine-tune convolutional blocks 2 through 4 in the ResNet**, because the first convolutional block would have usually learned something very fundamental to image processing, such as detecting lines, edges, curves, etc. We don't mess with the foundations.
259257

260258
### Attention
261259

262260
See `Attention` in `models.py`.
263261

264262
The Attention network is simple - it's composed of only linear layers and a couple of activations.
265263

266-
Separate linear layers transform both the encoded image (flattened to `N, 14 * 14, 4096`) and the hidden state (output) from the Decoder to the same dimension, viz. the Attention size. They are then added and ReLU activated. A third linear layer transforms this result to a dimension of 1, whereupon we apply the softmax to generate the weights `alpha`.
267-
268-
**The outputs of the Attention network are (1) the weighted average with dimensions `N, 4096`, and (2) the weights with dimensions `N, 14 * 14**`
264+
Separate linear layers **transform both the encoded image (flattened to `N, 14 * 14, 4096`) and the hidden state (output) from the Decoder to the same dimension**, viz. the Attention size. They are then added and ReLU activated. A third linear layer **transforms this result to a dimension of 1**, whereupon we **apply the softmax to generate the weights** `alpha`.
269265

270266
### Decoder
271267

272268
See `DecoderWithAttention` in `models.py`.
273269

274270
The output of the Encoder is received here and flattened to dimensions `N, 14 * 14, 4096`. This is just convenient and prevents having to reshape the tensor multiple times.
275271

276-
We initialize the hidden and cell state of the LSTM using the encoded image with the `init_hidden_state()` method, which uses two separate linear layers.
272+
We **initialize the hidden and cell state of the LSTM** using the encoded image with the `init_hidden_state()` method, which uses two separate linear layers.
277273

278-
At the very outset, we sort the `N` images and captions by decreasing caption lengths. This is so that we can process only _valid_ timesteps, i.e., not process the `<pad>`s.
274+
At the very outset, we **sort the `N` images and captions by decreasing caption lengths**. This is so that we can process only _valid_ timesteps, i.e., not process the `<pad>`s.
279275

280276
![](./img/sorted.jpg)
281277

282-
We can iterate over each timestep, processing only the colored regions, which are the _effective_ batch size `N_t` at that timestep. The sorting allows the top `N_t` at any timestep to align with the outputs from the previous step. At the third timestep, for example, we process only the top 5 images, using the top 5 outputs from the previous step.
278+
We can iterate over each timestep, processing only the colored regions, which are the **_effective_ batch size** `N_t` at that timestep. The sorting allows the top `N_t` at any timestep to align with the outputs from the previous step. At the third timestep, for example, we process only the top 5 images, using the top 5 outputs from the previous step.
283279

284-
We iterate over the timesteps _manually_ in a `for` loop with a PyTorch [`LSTMCell`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM) instead of iterating automatically without a loop with a PyTorch [`LSTM`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM). This is because we need to execute the Attention mechanism between each decode step. An `LSTMCell` is a single timestep operation, whereas an `LSTM` would iterate over multiple timesteps continously and provide all outputs at once.
280+
This **iteration is performed _manually_ in a `for` loop** with a PyTorch [`LSTMCell`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM) instead of iterating automatically without a loop with a PyTorch [`LSTM`](https://pytorch.org/docs/master/nn.html#torch.nn.LSTM). This is because we need to execute the Attention mechanism between each decode step. An `LSTMCell` is a single timestep operation, whereas an `LSTM` would iterate over multiple timesteps continously and provide all outputs at once.
285281

286-
We compute the weights and attention-weighted encoding at each timestep with the Attention network. In section `4.2.1` of the paper, they recommend passing the attention-weighted encoding through a filter or gate. This gate is a sigmoid activated linear transform of the Decoder's previous hidden state. The authors state that this helps the Attention network put more emphasis on the objects in the image.
282+
We **compute the weights and attention-weighted encoding** at each timestep with the Attention network. In section `4.2.1` of the paper, they recommend **passing the attention-weighted encoding through a filter or gate**. This gate is a sigmoid activated linear transform of the Decoder's previous hidden state. The authors state that this helps the Attention network put more emphasis on the objects in the image.
287283

288-
We concatenate this filtered attention-weighted encoding with the embedding of the previous word (`<start>` to begin), and run the `LSTMCell` to generate the new hidden state (or output). A linear layer transforms this new hidden state into scores for each word in the vocabulary, which is stored.
284+
We **concatenate this filtered attention-weighted encoding with the embedding of the previous word** (`<start>` to begin), and run the `LSTMCell` to **generate the new hidden state (or output)**. A linear layer **transforms this new hidden state into scores for each word in the vocabulary**, which is stored.
289285

290286
We also store the weights returned by the Attention network at each timestep. You will see why soon enough.
291287

292-
**The outputs of the Decoder are (1) predictions/scores at all timesteps with dimensions `N, L, V`, (2) weights at all timesteps with dimensions `N, L, P`** where`L` is the maximum caption decode-length in the batch, `V` is the vocabulary size, and `P` is the number of pixels.
293-
294-
We also output the sorted captions, their lengths, and sort indices.
295-
296288
# Training
297289

298290
See `train.py`.
@@ -301,7 +293,7 @@ See `train.py`.
301293

302294
Since we're generating a sequence of words, we use **[`CrossEntropyLoss`](https://pytorch.org/docs/master/nn.html#torch.nn.CrossEntropyLoss)**. You only need to submit the raw scores from the final layer in the Decoder, and the loss function will perform the softmax and log operations.
303295

304-
The authors of the paper recommend using a second loss - a "**doubly stochastic regularization**". We know the weights sum up to 1 at a given timestep. But we also encourage the weight at a single pixel to sum up to 1 across _all_ timesteps. This means we want the model to attend to every pixel over the course of generating the entire sequence. Towards this end, we try to **minimize the difference between 1 and the sum of a pixel's weights across all timesteps**.
296+
The authors of the paper recommend using a second loss - a "**doubly stochastic regularization**". We know the weights sum to 1 at a given timestep. But we also encourage the weight at a single pixel to sum to 1 across _all_ timesteps. This means we want the model to attend to every pixel over the course of generating the entire sequence. Towards this end, we try to **minimize the difference between 1 and the sum of a pixel's weights across all timesteps**.
305297

306298
The sigmoid gate in the Decoder and this regularization are key to the model attending well to the relevant parts of the image.
307299

img/sorted2.jpg

318 KB
Loading

0 commit comments

Comments
 (0)