library-of-code
diff --git a/‎multimodal_models/VQA_TensorFlow/README.md
Lines changed: 100 additions & 0 deletions b/‎multimodal_models/VQA_TensorFlow/README.md
Lines changed: 100 additions & 0 deletions
diff --git a/‎multimodal_models/VQA_TensorFlow/assets/ans_giv_sys.png
87.7 KB b/‎multimodal_models/VQA_TensorFlow/assets/ans_giv_sys.png
87.7 KB
diff --git a/‎multimodal_models/VQA_TensorFlow/assets/sys_give_ans.png
92 KB b/‎multimodal_models/VQA_TensorFlow/assets/sys_give_ans.png
92 KB
diff --git a/‎multimodal_models/VQA_TensorFlow/dataloader.py
Lines changed: 129 additions & 0 deletions b/‎multimodal_models/VQA_TensorFlow/dataloader.py
Lines changed: 129 additions & 0 deletions
diff --git a/‎multimodal_models/VQA_TensorFlow/main.py
Lines changed: 98 additions & 0 deletions b/‎multimodal_models/VQA_TensorFlow/main.py
Lines changed: 98 additions & 0 deletions
diff --git a/‎multimodal_models/VQA_TensorFlow/model.py
Lines changed: 54 additions & 0 deletions b/‎multimodal_models/VQA_TensorFlow/model.py
Lines changed: 54 additions & 0 deletions
@@ -0,0 +1,100 @@
+# TensorFlow Implementation of VQA
+## Usage
+```bash
+$ python3 main.py
+```
+> **_NOTE:_** on Colab Notebook use following command:
+```python
+!git clone link-to-repo
+%run main.py
+```
+
+## Help Log
+```
+usage: main.py [-h] [--type TYPE] [--base_path BASE_PATH] [--epochs EPOCHS]
+                [--batch_size BATCH_SIZE] [--data_limit DATA_LIMIT]
+                [--weights_load WEIGHTS_LOAD] [--weight_path WEIGHT_PATH]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --type TYPE           Whether you want to train or validate, default train
+  --base_path BASE_PATH
+                        Relative path to location where to download data,
+                        default '.'
+  --epochs EPOCHS       Number of epochs, default 10
+  --batch_size BATCH_SIZE
+                        Batch Size, default 256
+  --data_limit DATA_LIMIT
+                        Number of data points to feed for training, default
+                        215359 (size of dataset)
+  --weights_load WEIGHTS_LOAD
+                        Boolean to say whether to load pretrained model or
+                        train new model, default False
+  --weight_path WEIGHT_PATH
+                        Relative path to location of saved weights, default
+                        '.'
+```
+
+## Contributed by:
+* [Atharv Singh Patlan](https://github.com/AthaSSiN)
+
+## References
+
+* **Title**: VQA: Visual Question Answering
+* **Authors**: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
+* **Link**: https://arxiv.org/pdf/1505.00468
+* **Tags**: Neural Network, Computer Vision, Natural Language Processing
+* **Year**: 2015
+
+# Summary 
+
+## Introduction
+
+Visual Question Answering (VQA) is the task of answering questions about a given piece of visual content such as an image, video or infographic. It involves answering questions about visual content requires a variety of skills include recognizing entities and objects, reasoning about their interactions with each other, both spatially and temporally, reading text, parsing audio, interpreting abstract and graphical illustrations as well as using external knowledge not directly present in the given content.
+
+It is a combination of Natural Language Processing and Computer Vision, enabling our model to inpterpret the questions posed by the user and also search for the answers in the input picture.  
+
+![Egs](https://miro.medium.com/max/552/1*jLshTllNrGvpXjJWkSDFSQ.png)
+
+While seemingly an easy task for humans, VQA affords several challenges to AI systems spanning the fields of natural language processing, computer vision, audio processing, knowledge representation and reasoning. Over the past few years, the advent of deep learning, availability of large datasets to train VQA models as well as the hosting of a number of benchmarking contests have contributed to a surge of interest in VQA amongst researchers in the above disciplines.
+
+One of the early and popular datasets for this task was the VQA-v1 dataset. The VQA-v1 dataset is a very large dataset consisting of two types of images: natural images (referred to as real images) as well as synthetic images (referred to as abstract scenes) and comes in two answering modalities: multiple choice question answering (the task of selecting the right answer amongst a set of choices) as well as open ended question answering (the task of generating an answer with an open ended vocabulary).  
+
+The real images fraction of the VQA-v1 dataset consists of over 200,000 natural images sourced from the MS-COCO dataset, a large scale dataset of images used to benchmark tasks such as object detection, segmentation and image captioning. Each image is paired with 3 questions written down by crowdsourced annotators.  
+
+The dataset contains a variety of question types such as: What color, What kind, Why, How many, Is the, etc. To account for potential disagreements between humans for some questions, as well as account for crowd sourcing noise, each question is accompanied by 10 answers.  
+
+Given an image and a question, the goal of a VQA system is to produce an answer that matches those provided by human annotators. For the open ended answering modality, the evaluation metric used is:  
+
+__accuracy__ =min(# annotator answers matching the generated answer/3, 1)  
+
+The intuition behind this metric is as follows:  
+If a system generated answer matches one produced by at least 3 unique annotators, it gets a maximum score of 1 on account of producing a popular answer. If it generates an answer that isn’t present amongst the 10 candidates, it gets a score of 0, and it is assigned a fractional score if it produces an answer that is deemed rare. If the denominator 3 is lowered, wrong and noisy answers in the dataset (often present due to annotation noise) will receive a high credit. Conversely, if it is raised towards 10, a system producing the right answer may only receive partial credit, if the answer choices consist of synonyms or happen to contain a few noisy answers.
+
+## Implementation
+
+Our implementation, which is the standard and best implementation described by the authors of the VGQ paper uses Image embeddings from a pretrained VGG net and using the word encodings by passing the GloVe word embeddings of the input words through 2 layers of LSTM In contrast to averaging, using an LSTM preserves information regarding the order of the words in the question and leads to an improved VQA accuracy.  
+The other possible variants of VQA are  
+![Model](https://miro.medium.com/max/1104/1*OULUt5c9t_MvGMWLmnPGTA.png)
+
+We use the following default configuration: 
+- Pretrained VGG Activations of VQA input images
+- 2 Layers of LSTMs which follows pretrained GloVe word embeddings of input texts
+- Concatenation of the outputs of the NLP and Vision parts into a single dense layer.
+
+## Results
+
+Here are the results for the above specified SOTA model.  
+
+This figure shows P(model is correct | answer) for 50 most frequent ground truth answers on the VQA validation set (plot is sorted by accuracy, not
+frequency).  
+![sysans](./assets/sys_give_ans.png)  
+
+This figure shows P(answer | model is correct) for 50 most frequently predicted answers on the VQA validation set (plot is sorted by prediction
+frequency, not accuracy).  
+![anssys](./assets/ans_giv_sys.png)
+
+# Sources
+
+- [Vanilla VQA](https://medium.com/ai2-blog/vanilla-vqa-adcaaaa94336)  
+
@@ -0,0 +1,129 @@
+import tensorflow as tf
+import numpy as np
+from tensorflow import keras
+from keras.utils.np_utils import to_categorical
+import json
+import h5py
+import os
+
+############################################
+
+def right_align(seq,lengths):
+	# Align the input questions to the right side (pad on left with zeros)
+    v = np.zeros(np.shape(seq))
+    N = np.shape(seq)[1]
+    for i in range(np.shape(seq)[0]):
+        v[i][N-lengths[i]:N]=seq[i][0:lengths[i]]
+    return v
+
+#############################################
+
+def read_data(data_img, data_prepro, data_limit):
+    print("Reading Data...")
+    img_data = h5py.File(data_img, 'r')
+    ques_data = h5py.File(data_prepro, 'r')
+
+    #Reading upto data limit images
+    img_data = np.array(img_data['images_train'])
+    img_pos_train = ques_data['img_pos_train'][:data_limit]
+    train_img_data = np.array([img_data[_-1,:] for _ in img_pos_train])
+
+    # Normalizing images
+    tem = np.sqrt(np.sum(np.multiply(train_img_data, train_img_data), axis=1))
+    train_img_data = np.divide(train_img_data, np.transpose(np.tile(tem,(4096,1))))
+
+    #shifting padding to left side
+    ques_train = np.array(ques_data['ques_train'])[:data_limit, :]
+    ques_length_train = np.array(ques_data['ques_length_train'])[:data_limit]
+    ques_train = right_align(ques_train, ques_length_train)
+
+    train_X = [train_img_data, ques_train]
+
+    # All validation answers which are not in training set have been labelled as 1
+    train_y = to_categorical(ques_data['answers'])[:data_limit, :]
+
+    return train_X, train_y
+
+########################################
+
+def get_val_data(val_annotations_path, data_img, data_prepro, data_prepro_meta):
+    img_data = h5py.File(data_img, 'r')
+    ques_data = h5py.File(data_prepro, 'r')
+    metadata = get_metadata(data_prepro_meta)
+    with open(val_annotations_path, 'r') as an_file:
+        annotations = json.loads(an_file.read())
+
+    img_data = np.array(img_data['images_test'])
+    img_pos_train = ques_data['img_pos_test']
+    train_img_data = np.array([img_data[_-1,:] for _ in img_pos_train])
+    tem = np.sqrt(np.sum(np.multiply(train_img_data, train_img_data), axis=1))
+    train_img_data = np.divide(train_img_data, np.transpose(np.tile(tem,(4096,1))))
+
+    ques_train = np.array(ques_data['ques_test'])
+    ques_length_train = np.array(ques_data['ques_length_test'])
+    ques_train = right_align(ques_train, ques_length_train)
+
+    # Convert all last indices to 0, as embeddings were made that way
+    for _ in ques_train:
+        if 12602 in _:
+            _[_==12602] = 0
+
+    val_X = [train_img_data, ques_train]
+
+    ans_to_ix = {str(ans):int(i) for i,ans in metadata['ix_to_ans'].items()}
+    ques_annotations = {}
+    for _ in annotations['annotations']:
+        idx = ans_to_ix.get(_['multiple_choice_answer'].lower())
+        _['multiple_choice_answer_idx'] = 1 if idx in [None, 1000] else idx
+        ques_annotations[_['question_id']] = _
+
+    abs_val_y = [ques_annotations[ques_id]['multiple_choice_answer_idx'] for ques_id in ques_data['question_id_test']]
+    abs_val_y = to_categorical(np.array(abs_val_y))
+
+    multi_val_y = [list(set([ans_to_ix.get(_['answer'].lower()) for _ in ques_annotations[ques_id]['answers']])) for ques_id in ques_data['question_id_test']]
+    for i,_ in enumerate(multi_val_y):
+        multi_val_y[i] = [1 if ans in [None, 1000] else ans for ans in _]
+
+    return val_X, abs_val_y, multi_val_y
+
+###############################################
+
+def get_metadata(data_prepro_meta):
+    meta_data = json.load(open(data_prepro_meta, 'r'))
+    meta_data['ix_to_word'] = {str(word):int(i) for i,word in meta_data['ix_to_word'].items()}
+    return meta_data
+
+###############################################
+
+def prepare_embeddings(num_words, embedding_dim, metadata, glove_path, train_questions_path, embedding_matrix_filename):
+    if os.path.exists(embedding_matrix_filename):
+        with h5py.File(embedding_matrix_filename) as f:
+            return np.array(f['embedding_matrix'])
+
+    print("Embedding Data...")
+    with open(train_questions_path, 'r') as qs_file:
+        questions = json.loads(qs_file.read())
+        texts = [str(_['question']) for _ in questions['questions']]
+    
+    embeddings_index = {}
+    with open(glove_path, 'r') as glove_file:
+        for line in glove_file:
+            values = line.split()
+            word = values[0]
+            coefs = np.asarray(values[1:], dtype='float32')
+            embeddings_index[word] = coefs
+
+    embedding_matrix = np.zeros((num_words, embedding_dim))
+    word_index = metadata['ix_to_word']
+
+    for word, i in word_index.items():
+        embedding_vector = embeddings_index.get(word)
+        if embedding_vector is not None:
+            embedding_matrix[i] = embedding_vector
+   
+    with h5py.File(embedding_matrix_filename, 'w') as f:
+        f.create_dataset('embedding_matrix', data=embedding_matrix)
+
+    return embedding_matrix
+
+
@@ -0,0 +1,98 @@
+import numpy as np
+import tensorflow as tf
+from tensorflow import keras
+from keras.models import model_from_json
+from keras.callbacks import ModelCheckpoint
+import os
+import argparse
+from model import VQA
+from dataloader import read_data, get_val_data, get_metadata, prepare_embeddings
+from utils import get_data
+
+############# Parsing Arguments ##################
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument('--type', type=str, default='train', help = "Whether you want to train or validate, default train")
+parser.add_argument('--base_path', default = ".", help = "Relative path to location where to download data, default '.'")
+parser.add_argument('--epochs', type=int, default=10, help = "Number of epochs, default 10")
+parser.add_argument('--batch_size', type=int, default=256, help = "Batch Size, default 256")
+parser.add_argument('--data_limit', type=int, default=215359, help="Number of data points to feed for training, default 215359 (size of dataset)")
+parser.add_argument('--weights_load', default=False, help="Boolean to say whether to load pretrained model or train new model, default False")
+parser.add_argument('--weight_path', help = "Relative path to location of saved weights, default '.'")
+
+args = parser.parse_args()
+
+##### Setting global variables and file paths #######
+seq_length = 26
+embedding_dim = 300
+
+glove_path = args.base_path + "/data/glove/glove.6B.300d.txt"
+train_questions_path = args.base_path + "/data/train_ques/MultipleChoice_mscoco_train2014_questions.json"
+val_annotations_path = args.base_path + "/data/val_annotations/mscoco_val2014_annotations.json"
+ckpt_model_weights_filename = args.base_path + "/data/ckpts/model_weights.h5"
+data_img = args.base_path + "/data/image_data/data_img.h5"
+data_prepro = args.base_path + "/data/image_data/data_prepro.h5"
+data_prepro_meta = args.base_path + "/data/image_data/data_prepro.json"
+embedding_matrix_filename = args.base_path + "/data/ckpts/embeddings_%s.h5"%embedding_dim
+save_dest = args.base_path + "/data/model/saved_model.h5"
+
+
+#####################################################
+def get_model(dropout_rate, model_weights_filename, weights_load):
+
+    print("Creating Model...")
+    metadata = get_metadata(data_prepro_meta)
+    num_classes = len(metadata['ix_to_ans'].keys())
+    num_words = len(metadata['ix_to_word'].keys())
+
+    embedding_matrix = prepare_embeddings(num_words, embedding_dim, metadata, glove_path, train_questions_path, embedding_matrix_filename)
+    model = VQA(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes)
+    if (weights_load and os.path.exists(model_weights_filename)):
+        print("Loading Weights...")
+        model.load_weights(model_weights_filename)
+    
+    return model
+
+##################################################
+
+def train(args):
+    dropout_rate = 0.5
+    train_X, train_y = read_data(data_img, data_prepro, args.data_limit)    
+    model = get_model(dropout_rate, args.weight_path, args.weights_load)
+    checkpointer = ModelCheckpoint(filepath=ckpt_model_weights_filename,verbose=1)
+    model.fit(train_X, train_y, epochs=args.epochs, batch_size=args.batch_size, callbacks=[checkpointer], shuffle="batch")
+    if not os.path.exists(args.base_path + "/data/model"):
+        os.makedirs(args.base_path + "/data/model")
+        
+    model.save_weights(save_dest, overwrite=True)
+
+##################################################
+
+def val():
+    val_X, val_y, multi_val_y = get_val_data(val_annotations_path, data_img, data_prepro, data_prepro_meta) 
+    model = get_model(0.0, args.weight_path, args.weights_load)
+    print("Evaluating Accuracy on validation set:")
+    metric_vals = model.evaluate(val_X, val_y)
+    print("")
+    for metric_name, metric_val in zip(model.metrics_names, metric_vals):
+        print(metric_name, " is ", metric_val)
+
+    # Comparing prediction against multiple choice answers
+    true_positive = 0
+    preds = model.predict(val_X)
+    pred_classes = [np.argmax(_) for _ in preds]
+    for i, _ in enumerate(pred_classes):
+        if _ in multi_val_y[i]:
+            true_positive += 1
+    print("true positive rate: ", np.float(true_positive)/len(pred_classes))
+
+##################################################
+
+if __name__ == "__main__":
+
+    get_data(args.base_path)
+    if args.type == 'train':
+        train(args)
+    elif args.type == 'val':
+        val()
@@ -0,0 +1,54 @@
+import tensorflow as tf
+from tensorflow import keras
+from keras.models import Sequential, Model
+from keras.layers import Dense, Activation, Dropout, LSTM, Flatten, Embedding, Concatenate, Input
+import h5py
+
+#############################################
+
+def Word2Vec(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate):
+    
+    # Text model
+
+    w2v_input = Input((seq_length,))
+    w2v_embed = Embedding(input_dim=num_words, output_dim=embedding_dim, input_length=seq_length,
+                          weights=[embedding_matrix],trainable=False)(w2v_input)
+    w2v_lstm1 = LSTM(512, input_shape=(seq_length, embedding_dim),return_sequences=True)(w2v_embed)
+    w2v_drop1 = Dropout(dropout_rate)(w2v_lstm1)
+    w2v_lstm2 = LSTM(512, return_sequences=False)(w2v_drop1)
+    w2v_drop2 = Dropout(dropout_rate)(w2v_lstm2)
+    w2v_dense = Dense(1024, activation='tanh')(w2v_drop2)
+
+    model = Model(w2v_input, w2v_dense)
+    return model
+
+#############################################
+
+def FromVGG(dropout_rate):
+
+    #Image model
+    vgg_input = Input((4096,))
+    vgg_dense = Dense(1024, activation='tanh')(vgg_input)
+
+    model = Model(vgg_input, vgg_dense)
+    return model
+
+##############################################
+
+def VQA(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes):
+    
+    # VQA model
+    vgg_model = FromVGG(dropout_rate)
+    lstm_model = Word2Vec(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate)
+    
+    concat = Concatenate()([vgg_model.output, lstm_model.output])
+    drop1 = Dropout(dropout_rate)(concat)
+    dense1 = Dense(1000, activation='tanh')(drop1)
+    drop2 = Dropout(dropout_rate)(dense1)
+    dense2 = Dense(num_classes, activation='softmax')(drop2)
+
+    model = Model([vgg_model.input, lstm_model.input], dense2)
+    model.compile(optimizer='adam', loss='categorical_crossentropy',
+        metrics=['accuracy'])
+    return model
+