Skip to content

Commit 1431f90

Browse files
authored
Update README.md
1 parent 30e90c3 commit 1431f90

File tree

1 file changed

+28
-22
lines changed

1 file changed

+28
-22
lines changed

README.md

+28-22
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,11 @@ Here are some captions generated on _test_ images not seen during training or va
3434

3535
---
3636

37-
![](./img/boats.png)
37+
![](./img/plane.png)
3838

39-
Notice how the model is looking at the boats when it says `boats`, and the sand when it says `the beach`.
39+
---
40+
41+
![](./img/boats.png)
4042

4143
---
4244

@@ -58,66 +60,68 @@ Notice how the model is looking at the boats when it says `boats`, and the sand
5860

5961
There are more examples at the [end of the tutorial]().
6062

63+
---
64+
6165
# Concepts
6266

63-
* **Image captioning**. You will learn the elements of captioning models, how they work, and how to implement them.
67+
* **Image captioning**. You will learn about the general structure of captioning models, how they work, and their implementation.
6468

65-
* **Encoder-Decoder architecture**. Any model that generates a sequence will use an encoder to encode the input into a fixed form, and a decoder to decode it, word by word, into a sequence.
69+
* **Encoder-Decoder architecture**. Any model that generates sequences will use an Encoder to encode the input into a fixed form and a Decoder to decode it, word by word, into a sequence.
6670

67-
* **Attention**. Ilya Sutskever, the research director at OpenAI, [has said](https://www.re-work.co/blog/deep-learning-ilya-sutskever-google-openai) attention will play a very important role in the future of deep learning. If you're not too sure what "attention" really means, just know for now that it's how our image captioning model will _pay attention_ to the important regions of the image. The cool part is the same attention mechanism you see used here is employed in other tasks, usually in sequence-to-sequence models like machine translation.
71+
* **Attention**. The use of Attention networks is widespread in deep learning, and with good reason. This is a way for a model to choose only those parts of the encoding that it thinks is relevant to the task at hand. The same mechanism you see employed here can be used in any model where the Encoder's output has multiple points in space or time. In image captioning, you consider some pixels more important than others. In sequence to sequence tasks like machine translation, you consider some words more important than others.
6872

69-
* **Transfer Learning** and fine-tuning. Transfer learning is when you use parts of another model that has already been trained on a different or similar task. You're transferring the knowledge that has been learned elsewhere to your current task. This is almost always better than training your model from scratch (i.e., knowing nothing). You can also fine-tune this second-hand knowledge to the task at hand if you wish to improve your overall performance.
73+
* **Transfer Learning**. This is when you borrow from an existing model by using parts of it in a new model. This is almost always better than training a new model from scratch (i.e., knowing nothing). As you will see, you can always fine-tune this second-hand knowledge to the specific task at hand. Using pretrained word embeddings is a dumb but valid example. For our image captioning problem, we will use a pretrained Encoder, and then fine-tune it as needed.
7074

71-
* **Beam Search**. This is a much better way to construct the caption/sequence from the decoder's output than to just choose the words with the highest scores.
75+
* **Beam Search**. This is where you don't let your Decoder be lazy and simply choose the words with the _best_ score at each decode-step.
7276

7377
# Overview
7478

75-
A broad overview of the model.
79+
In this section, I will present a broad overview of this model. I don't really get into the _minutiae_ here - feel free to skip to the implementation section and commented code for details.
7680

7781
### Encoder
7882

79-
The encoder **encodes the input image with 3 color channels into a smaller image with _learned_ channels**.
83+
The Encoder **encodes the input image with 3 color channels into a smaller image with "learned" channels**.
84+
85+
This smaller encoded image is a summary representation of all that's useful in the original image.
8086

81-
This smaller encoded image contains all the useful information we need to generate a caption. Think of it as a summary representation of the original image.
87+
Since we want to encode images, we use Convolutional Neural Networks (CNNs).
8288

83-
Since we want to encode images, we use **Convolutional Neural Networks (CNNs)**.
89+
We don't need to train an encoder from scratch. Why? Because there are already CNNs trained to represent images.
8490

85-
But we don't need to train an encoder from scratch. Why? Because there are already CNNs trained to represent all that's useful in an image. For years, people have been building models that are extraordinarily good at classifying an image into one of a thousand categories. It stands to reason that these models capture the essence of an image very well.
91+
For years, people have been building models that are extraordinarily good at classifying an image into one of a thousand categories. It stands to reason that these models capture the essence of an image very well.
8692

87-
I have chosen to use the **101 layered Residual Network trained on the ImageNet classification task**, which is already available in PyTorch. This is an example of __transfer learning__.
93+
I have chosen to use the **101 layered Residual Network trained on the ImageNet classification task**, available in PyTorch. As stated earlier, this is an example of Transfer Learning. You have the option of fine-tuning it to improve performance.
8894

8995
![ResNet Encoder](./img/encoder.png)
9096
<p align="center">
9197
*ResNet-101 Encoder*
9298
</p>
9399

94-
These models progressively create smaller and smaller representations of the original image, and each subsequent representation is more "learned" with a greater number of channels. The **final encoding produced by our ResNet-101 encoder has a size of 14x14 with 4096 channels**, i.e., a 4096x14x14 size tensor.
100+
These models progressively create smaller and smaller representations of the original image, and each subsequent representation is more "learned", with a greater number of channels. The final encoding produced by our ResNet-101 encoder has a size of 14x14 with 4096 channels, i.e., a `4096, 14, 14` size tensor.
95101

96-
Feel free to experiment with other pretrained architectures. Modifications are necessary. Since the last layer or two of these models are linear layers coupled with Softmax activation for classification, we strip these away.
102+
I encourage you to experiment with other pre-trained architectures. The paper uses a VGGnet, also pretrained on ImageNet, but without fine-tuning. Either way, modifications are necessary. Since the last layer or two of these models are linear layers coupled with softmax activation for classification, we strip 'em away.
97103

98104
### Decoder
99105

100-
The decoder's job is to **look at the encoded image and generate a caption word by word**.
106+
The Decoder's job is to look at the encoded image and generate a caption word by word.
101107

102-
Since it is generating a sequence, it would need to be a **Recurrent Neural Network (RNN)**. We will use the LSTM flavor of RNNs.
108+
Since it's generating a sequence, it would need to be a Recurrent Neural Network (RNN). We will try the LSTM flavor.
103109

104-
In a typical setting without Attention, you could take the simple average of the encoded image across all pixels. You could then feed this into a decoder LSTM, with or without a linear transformation, as its first hidden state and generate a caption word by word. Each predicted word is fed into the RNN to generate the next word.
110+
In a typical setting without Attention, you could simply average the encoded image across all pixels. You could then feed this, with or without a linear transformation, into the Decoder as its first hidden state and generate the caption. Each predicted word is used to generate the next word.
105111

106112
![Decoder without Attention](./img/decoder_no_att.png)
107113
<p align="center">
108114
*Decoding without Attention*
109115
</p>
110116

111-
In a setting _with_ Attention, we want to look at _different_ parts of the image at different points in generating the sequence. For example, while generating the word `football` `in` `a` `man` `holds` `a` `football`, the decoder would know to _look_ at the hat part of the image. It's discarding the less useful parts of the image.
117+
In a setting _with_ Attention, we want the Decoder to be able to look at different parts of the image at different points in the sequence. For example, while generating the word `football` in `a man holds a football`, the Decoder would know to focus on the - you guessed it - football!
112118

113119
![Decoding with Attention](./img/decoder_att.png)
114120
<p align="center">
115121
*Decoding with Attention*
116122
</p>
117123

118-
We're using the _weighted_ average across all pixels, with the weights of the important pixels being higher. This weighted average of the image can be concatenated with the previously generated word at each step to generate the next word.
119-
120-
How do we estimate the weights for each pixel? Enter the Attention mechanism...
124+
Instead of the simple average, we use the _weighted_ average across all pixels, with the weights of the important pixels being greater. This weighted representation of the image can be concatenated with the previously generated word at each step to generate the next word.
121125

122126
### Attention
123127

@@ -134,6 +138,8 @@ This is exactly what the attention mechanism does - it considers the sequence ge
134138

135139
We will use the _soft_ Attention, where the weights of the pixels add up to 1. You could interpret this as finding the probability that a certain pixel is _the_ important part of the image to generate the next word.
136140

141+
(Funny story - when I was a kid growing up in India doing drills at school, the PE teacher would
142+
137143
### Putting it all together
138144

139145
It might be clear by now what our combined network looks like.

0 commit comments

Comments
 (0)