Skip to content

Commit 69e3c40

Browse files
authored
Add VQA, TensorFlow implementation (pclubiitk#38)
* Add VQA TensorFlow * Add comments and readability improvements * Add results to README * Update README.md
1 parent bb43442 commit 69e3c40

File tree

7 files changed

+419
-0
lines changed

7 files changed

+419
-0
lines changed
+100
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# TensorFlow Implementation of VQA
2+
## Usage
3+
```bash
4+
$ python3 main.py
5+
```
6+
> **_NOTE:_** on Colab Notebook use following command:
7+
```python
8+
!git clone link-to-repo
9+
%run main.py
10+
```
11+
12+
## Help Log
13+
```
14+
usage: main.py [-h] [--type TYPE] [--base_path BASE_PATH] [--epochs EPOCHS]
15+
[--batch_size BATCH_SIZE] [--data_limit DATA_LIMIT]
16+
[--weights_load WEIGHTS_LOAD] [--weight_path WEIGHT_PATH]
17+
18+
optional arguments:
19+
-h, --help show this help message and exit
20+
--type TYPE Whether you want to train or validate, default train
21+
--base_path BASE_PATH
22+
Relative path to location where to download data,
23+
default '.'
24+
--epochs EPOCHS Number of epochs, default 10
25+
--batch_size BATCH_SIZE
26+
Batch Size, default 256
27+
--data_limit DATA_LIMIT
28+
Number of data points to feed for training, default
29+
215359 (size of dataset)
30+
--weights_load WEIGHTS_LOAD
31+
Boolean to say whether to load pretrained model or
32+
train new model, default False
33+
--weight_path WEIGHT_PATH
34+
Relative path to location of saved weights, default
35+
'.'
36+
```
37+
38+
## Contributed by:
39+
* [Atharv Singh Patlan](https://github.com/AthaSSiN)
40+
41+
## References
42+
43+
* **Title**: VQA: Visual Question Answering
44+
* **Authors**: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
45+
* **Link**: https://arxiv.org/pdf/1505.00468
46+
* **Tags**: Neural Network, Computer Vision, Natural Language Processing
47+
* **Year**: 2015
48+
49+
# Summary
50+
51+
## Introduction
52+
53+
Visual Question Answering (VQA) is the task of answering questions about a given piece of visual content such as an image, video or infographic. It involves answering questions about visual content requires a variety of skills include recognizing entities and objects, reasoning about their interactions with each other, both spatially and temporally, reading text, parsing audio, interpreting abstract and graphical illustrations as well as using external knowledge not directly present in the given content.
54+
55+
It is a combination of Natural Language Processing and Computer Vision, enabling our model to inpterpret the questions posed by the user and also search for the answers in the input picture.
56+
57+
![Egs](https://miro.medium.com/max/552/1*jLshTllNrGvpXjJWkSDFSQ.png)
58+
59+
While seemingly an easy task for humans, VQA affords several challenges to AI systems spanning the fields of natural language processing, computer vision, audio processing, knowledge representation and reasoning. Over the past few years, the advent of deep learning, availability of large datasets to train VQA models as well as the hosting of a number of benchmarking contests have contributed to a surge of interest in VQA amongst researchers in the above disciplines.
60+
61+
One of the early and popular datasets for this task was the VQA-v1 dataset. The VQA-v1 dataset is a very large dataset consisting of two types of images: natural images (referred to as real images) as well as synthetic images (referred to as abstract scenes) and comes in two answering modalities: multiple choice question answering (the task of selecting the right answer amongst a set of choices) as well as open ended question answering (the task of generating an answer with an open ended vocabulary).
62+
63+
The real images fraction of the VQA-v1 dataset consists of over 200,000 natural images sourced from the MS-COCO dataset, a large scale dataset of images used to benchmark tasks such as object detection, segmentation and image captioning. Each image is paired with 3 questions written down by crowdsourced annotators.
64+
65+
The dataset contains a variety of question types such as: What color, What kind, Why, How many, Is the, etc. To account for potential disagreements between humans for some questions, as well as account for crowd sourcing noise, each question is accompanied by 10 answers.
66+
67+
Given an image and a question, the goal of a VQA system is to produce an answer that matches those provided by human annotators. For the open ended answering modality, the evaluation metric used is:
68+
69+
__accuracy__ =min(# annotator answers matching the generated answer/3, 1)
70+
71+
The intuition behind this metric is as follows:
72+
If a system generated answer matches one produced by at least 3 unique annotators, it gets a maximum score of 1 on account of producing a popular answer. If it generates an answer that isn’t present amongst the 10 candidates, it gets a score of 0, and it is assigned a fractional score if it produces an answer that is deemed rare. If the denominator 3 is lowered, wrong and noisy answers in the dataset (often present due to annotation noise) will receive a high credit. Conversely, if it is raised towards 10, a system producing the right answer may only receive partial credit, if the answer choices consist of synonyms or happen to contain a few noisy answers.
73+
74+
## Implementation
75+
76+
Our implementation, which is the standard and best implementation described by the authors of the VGQ paper uses Image embeddings from a pretrained VGG net and using the word encodings by passing the GloVe word embeddings of the input words through 2 layers of LSTM In contrast to averaging, using an LSTM preserves information regarding the order of the words in the question and leads to an improved VQA accuracy.
77+
The other possible variants of VQA are
78+
![Model](https://miro.medium.com/max/1104/1*OULUt5c9t_MvGMWLmnPGTA.png)
79+
80+
We use the following default configuration:
81+
- Pretrained VGG Activations of VQA input images
82+
- 2 Layers of LSTMs which follows pretrained GloVe word embeddings of input texts
83+
- Concatenation of the outputs of the NLP and Vision parts into a single dense layer.
84+
85+
## Results
86+
87+
Here are the results for the above specified SOTA model.
88+
89+
This figure shows P(model is correct | answer) for 50 most frequent ground truth answers on the VQA validation set (plot is sorted by accuracy, not
90+
frequency).
91+
![sysans](./assets/sys_give_ans.png)
92+
93+
This figure shows P(answer | model is correct) for 50 most frequently predicted answers on the VQA validation set (plot is sorted by prediction
94+
frequency, not accuracy).
95+
![anssys](./assets/ans_giv_sys.png)
96+
97+
# Sources
98+
99+
- [Vanilla VQA](https://medium.com/ai2-blog/vanilla-vqa-adcaaaa94336)
100+
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
import tensorflow as tf
2+
import numpy as np
3+
from tensorflow import keras
4+
from keras.utils.np_utils import to_categorical
5+
import json
6+
import h5py
7+
import os
8+
9+
############################################
10+
11+
def right_align(seq,lengths):
12+
# Align the input questions to the right side (pad on left with zeros)
13+
v = np.zeros(np.shape(seq))
14+
N = np.shape(seq)[1]
15+
for i in range(np.shape(seq)[0]):
16+
v[i][N-lengths[i]:N]=seq[i][0:lengths[i]]
17+
return v
18+
19+
#############################################
20+
21+
def read_data(data_img, data_prepro, data_limit):
22+
print("Reading Data...")
23+
img_data = h5py.File(data_img, 'r')
24+
ques_data = h5py.File(data_prepro, 'r')
25+
26+
#Reading upto data limit images
27+
img_data = np.array(img_data['images_train'])
28+
img_pos_train = ques_data['img_pos_train'][:data_limit]
29+
train_img_data = np.array([img_data[_-1,:] for _ in img_pos_train])
30+
31+
# Normalizing images
32+
tem = np.sqrt(np.sum(np.multiply(train_img_data, train_img_data), axis=1))
33+
train_img_data = np.divide(train_img_data, np.transpose(np.tile(tem,(4096,1))))
34+
35+
#shifting padding to left side
36+
ques_train = np.array(ques_data['ques_train'])[:data_limit, :]
37+
ques_length_train = np.array(ques_data['ques_length_train'])[:data_limit]
38+
ques_train = right_align(ques_train, ques_length_train)
39+
40+
train_X = [train_img_data, ques_train]
41+
42+
# All validation answers which are not in training set have been labelled as 1
43+
train_y = to_categorical(ques_data['answers'])[:data_limit, :]
44+
45+
return train_X, train_y
46+
47+
########################################
48+
49+
def get_val_data(val_annotations_path, data_img, data_prepro, data_prepro_meta):
50+
img_data = h5py.File(data_img, 'r')
51+
ques_data = h5py.File(data_prepro, 'r')
52+
metadata = get_metadata(data_prepro_meta)
53+
with open(val_annotations_path, 'r') as an_file:
54+
annotations = json.loads(an_file.read())
55+
56+
img_data = np.array(img_data['images_test'])
57+
img_pos_train = ques_data['img_pos_test']
58+
train_img_data = np.array([img_data[_-1,:] for _ in img_pos_train])
59+
tem = np.sqrt(np.sum(np.multiply(train_img_data, train_img_data), axis=1))
60+
train_img_data = np.divide(train_img_data, np.transpose(np.tile(tem,(4096,1))))
61+
62+
ques_train = np.array(ques_data['ques_test'])
63+
ques_length_train = np.array(ques_data['ques_length_test'])
64+
ques_train = right_align(ques_train, ques_length_train)
65+
66+
# Convert all last indices to 0, as embeddings were made that way
67+
for _ in ques_train:
68+
if 12602 in _:
69+
_[_==12602] = 0
70+
71+
val_X = [train_img_data, ques_train]
72+
73+
ans_to_ix = {str(ans):int(i) for i,ans in metadata['ix_to_ans'].items()}
74+
ques_annotations = {}
75+
for _ in annotations['annotations']:
76+
idx = ans_to_ix.get(_['multiple_choice_answer'].lower())
77+
_['multiple_choice_answer_idx'] = 1 if idx in [None, 1000] else idx
78+
ques_annotations[_['question_id']] = _
79+
80+
abs_val_y = [ques_annotations[ques_id]['multiple_choice_answer_idx'] for ques_id in ques_data['question_id_test']]
81+
abs_val_y = to_categorical(np.array(abs_val_y))
82+
83+
multi_val_y = [list(set([ans_to_ix.get(_['answer'].lower()) for _ in ques_annotations[ques_id]['answers']])) for ques_id in ques_data['question_id_test']]
84+
for i,_ in enumerate(multi_val_y):
85+
multi_val_y[i] = [1 if ans in [None, 1000] else ans for ans in _]
86+
87+
return val_X, abs_val_y, multi_val_y
88+
89+
###############################################
90+
91+
def get_metadata(data_prepro_meta):
92+
meta_data = json.load(open(data_prepro_meta, 'r'))
93+
meta_data['ix_to_word'] = {str(word):int(i) for i,word in meta_data['ix_to_word'].items()}
94+
return meta_data
95+
96+
###############################################
97+
98+
def prepare_embeddings(num_words, embedding_dim, metadata, glove_path, train_questions_path, embedding_matrix_filename):
99+
if os.path.exists(embedding_matrix_filename):
100+
with h5py.File(embedding_matrix_filename) as f:
101+
return np.array(f['embedding_matrix'])
102+
103+
print("Embedding Data...")
104+
with open(train_questions_path, 'r') as qs_file:
105+
questions = json.loads(qs_file.read())
106+
texts = [str(_['question']) for _ in questions['questions']]
107+
108+
embeddings_index = {}
109+
with open(glove_path, 'r') as glove_file:
110+
for line in glove_file:
111+
values = line.split()
112+
word = values[0]
113+
coefs = np.asarray(values[1:], dtype='float32')
114+
embeddings_index[word] = coefs
115+
116+
embedding_matrix = np.zeros((num_words, embedding_dim))
117+
word_index = metadata['ix_to_word']
118+
119+
for word, i in word_index.items():
120+
embedding_vector = embeddings_index.get(word)
121+
if embedding_vector is not None:
122+
embedding_matrix[i] = embedding_vector
123+
124+
with h5py.File(embedding_matrix_filename, 'w') as f:
125+
f.create_dataset('embedding_matrix', data=embedding_matrix)
126+
127+
return embedding_matrix
128+
129+
+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
import numpy as np
2+
import tensorflow as tf
3+
from tensorflow import keras
4+
from keras.models import model_from_json
5+
from keras.callbacks import ModelCheckpoint
6+
import os
7+
import argparse
8+
from model import VQA
9+
from dataloader import read_data, get_val_data, get_metadata, prepare_embeddings
10+
from utils import get_data
11+
12+
############# Parsing Arguments ##################
13+
14+
parser = argparse.ArgumentParser()
15+
16+
parser.add_argument('--type', type=str, default='train', help = "Whether you want to train or validate, default train")
17+
parser.add_argument('--base_path', default = ".", help = "Relative path to location where to download data, default '.'")
18+
parser.add_argument('--epochs', type=int, default=10, help = "Number of epochs, default 10")
19+
parser.add_argument('--batch_size', type=int, default=256, help = "Batch Size, default 256")
20+
parser.add_argument('--data_limit', type=int, default=215359, help="Number of data points to feed for training, default 215359 (size of dataset)")
21+
parser.add_argument('--weights_load', default=False, help="Boolean to say whether to load pretrained model or train new model, default False")
22+
parser.add_argument('--weight_path', help = "Relative path to location of saved weights, default '.'")
23+
24+
args = parser.parse_args()
25+
26+
##### Setting global variables and file paths #######
27+
seq_length = 26
28+
embedding_dim = 300
29+
30+
glove_path = args.base_path + "/data/glove/glove.6B.300d.txt"
31+
train_questions_path = args.base_path + "/data/train_ques/MultipleChoice_mscoco_train2014_questions.json"
32+
val_annotations_path = args.base_path + "/data/val_annotations/mscoco_val2014_annotations.json"
33+
ckpt_model_weights_filename = args.base_path + "/data/ckpts/model_weights.h5"
34+
data_img = args.base_path + "/data/image_data/data_img.h5"
35+
data_prepro = args.base_path + "/data/image_data/data_prepro.h5"
36+
data_prepro_meta = args.base_path + "/data/image_data/data_prepro.json"
37+
embedding_matrix_filename = args.base_path + "/data/ckpts/embeddings_%s.h5"%embedding_dim
38+
save_dest = args.base_path + "/data/model/saved_model.h5"
39+
40+
41+
#####################################################
42+
def get_model(dropout_rate, model_weights_filename, weights_load):
43+
44+
print("Creating Model...")
45+
metadata = get_metadata(data_prepro_meta)
46+
num_classes = len(metadata['ix_to_ans'].keys())
47+
num_words = len(metadata['ix_to_word'].keys())
48+
49+
embedding_matrix = prepare_embeddings(num_words, embedding_dim, metadata, glove_path, train_questions_path, embedding_matrix_filename)
50+
model = VQA(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes)
51+
if (weights_load and os.path.exists(model_weights_filename)):
52+
print("Loading Weights...")
53+
model.load_weights(model_weights_filename)
54+
55+
return model
56+
57+
##################################################
58+
59+
def train(args):
60+
dropout_rate = 0.5
61+
train_X, train_y = read_data(data_img, data_prepro, args.data_limit)
62+
model = get_model(dropout_rate, args.weight_path, args.weights_load)
63+
checkpointer = ModelCheckpoint(filepath=ckpt_model_weights_filename,verbose=1)
64+
model.fit(train_X, train_y, epochs=args.epochs, batch_size=args.batch_size, callbacks=[checkpointer], shuffle="batch")
65+
if not os.path.exists(args.base_path + "/data/model"):
66+
os.makedirs(args.base_path + "/data/model")
67+
68+
model.save_weights(save_dest, overwrite=True)
69+
70+
##################################################
71+
72+
def val():
73+
val_X, val_y, multi_val_y = get_val_data(val_annotations_path, data_img, data_prepro, data_prepro_meta)
74+
model = get_model(0.0, args.weight_path, args.weights_load)
75+
print("Evaluating Accuracy on validation set:")
76+
metric_vals = model.evaluate(val_X, val_y)
77+
print("")
78+
for metric_name, metric_val in zip(model.metrics_names, metric_vals):
79+
print(metric_name, " is ", metric_val)
80+
81+
# Comparing prediction against multiple choice answers
82+
true_positive = 0
83+
preds = model.predict(val_X)
84+
pred_classes = [np.argmax(_) for _ in preds]
85+
for i, _ in enumerate(pred_classes):
86+
if _ in multi_val_y[i]:
87+
true_positive += 1
88+
print("true positive rate: ", np.float(true_positive)/len(pred_classes))
89+
90+
##################################################
91+
92+
if __name__ == "__main__":
93+
94+
get_data(args.base_path)
95+
if args.type == 'train':
96+
train(args)
97+
elif args.type == 'val':
98+
val()
+54
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import tensorflow as tf
2+
from tensorflow import keras
3+
from keras.models import Sequential, Model
4+
from keras.layers import Dense, Activation, Dropout, LSTM, Flatten, Embedding, Concatenate, Input
5+
import h5py
6+
7+
#############################################
8+
9+
def Word2Vec(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate):
10+
11+
# Text model
12+
13+
w2v_input = Input((seq_length,))
14+
w2v_embed = Embedding(input_dim=num_words, output_dim=embedding_dim, input_length=seq_length,
15+
weights=[embedding_matrix],trainable=False)(w2v_input)
16+
w2v_lstm1 = LSTM(512, input_shape=(seq_length, embedding_dim),return_sequences=True)(w2v_embed)
17+
w2v_drop1 = Dropout(dropout_rate)(w2v_lstm1)
18+
w2v_lstm2 = LSTM(512, return_sequences=False)(w2v_drop1)
19+
w2v_drop2 = Dropout(dropout_rate)(w2v_lstm2)
20+
w2v_dense = Dense(1024, activation='tanh')(w2v_drop2)
21+
22+
model = Model(w2v_input, w2v_dense)
23+
return model
24+
25+
#############################################
26+
27+
def FromVGG(dropout_rate):
28+
29+
#Image model
30+
vgg_input = Input((4096,))
31+
vgg_dense = Dense(1024, activation='tanh')(vgg_input)
32+
33+
model = Model(vgg_input, vgg_dense)
34+
return model
35+
36+
##############################################
37+
38+
def VQA(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes):
39+
40+
# VQA model
41+
vgg_model = FromVGG(dropout_rate)
42+
lstm_model = Word2Vec(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate)
43+
44+
concat = Concatenate()([vgg_model.output, lstm_model.output])
45+
drop1 = Dropout(dropout_rate)(concat)
46+
dense1 = Dense(1000, activation='tanh')(drop1)
47+
drop2 = Dropout(dropout_rate)(dense1)
48+
dense2 = Dense(num_classes, activation='softmax')(drop2)
49+
50+
model = Model([vgg_model.input, lstm_model.input], dense2)
51+
model.compile(optimizer='adam', loss='categorical_crossentropy',
52+
metrics=['accuracy'])
53+
return model
54+

0 commit comments

Comments
 (0)