library-of-code
diff --git a/‎NLP/BERT_TensorFlow/README.md
Lines changed: 159 additions & 0 deletions b/‎NLP/BERT_TensorFlow/README.md
Lines changed: 159 additions & 0 deletions
diff --git a/‎NLP/BERT_TensorFlow/assets/acc.png
12.5 KB b/‎NLP/BERT_TensorFlow/assets/acc.png
12.5 KB
diff --git a/‎NLP/BERT_TensorFlow/assets/loss.png
12.2 KB b/‎NLP/BERT_TensorFlow/assets/loss.png
12.2 KB
diff --git a/‎NLP/BERT_TensorFlow/finetune.py
Lines changed: 50 additions & 0 deletions b/‎NLP/BERT_TensorFlow/finetune.py
Lines changed: 50 additions & 0 deletions
diff --git a/‎NLP/BERT_TensorFlow/pretrain.py
Lines changed: 134 additions & 0 deletions b/‎NLP/BERT_TensorFlow/pretrain.py
Lines changed: 134 additions & 0 deletions
@@ -0,0 +1,159 @@
+# TensorFlow Implementation of BERT
+
+## Dependencies
+[Transformer implementation by HuggingFace](https://huggingface.co/transformers/)  
+This has been used for loading the WordPiece tokens for the pretraining task, and for providing the pre-trained models in the finetuning task.  
+```bash
+$ pip install transformers
+```
+> **_NOTE:_** on Colab Notebook use following command:
+```python
+!pip install transformers
+```
+## Usage
+### 1. Pretraining (on user defined corpus)
+```bash
+$ python3 pretrain.py --train_corpus path/to/file.txt
+```
+> **_NOTE:_** on Colab Notebook use following command:
+```python
+!git clone link-to-repo
+%run pretrain.py --train_corpus path/to/file.txt
+```
+#### Help Log
+```
+usage: pretrain.py [-h] [--num_layers NUM_LAYERS] [--epochs EPOCHS]
+                   [--hidden_size HIDDEN_SIZE] [--num_heads NUM_HEADS]
+                   [--max_length MAX_LENGTH] [--batch_size BATCH_SIZE]
+                   --train_corpus TRAIN_CORPUS
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --num_layers NUM_LAYERS
+                        Number of Encoder layers, default 12
+  --epochs EPOCHS       Number of epochs in pretrain, default 40
+  --hidden_size HIDDEN_SIZE
+                        Number of neurons in hidden feed forward layer,
+                        default 512
+  --num_heads NUM_HEADS
+                        Number of heads used in multi headed attention layer,
+                        default 12
+  --max_length MAX_LENGTH
+                        Maximum token count of input sentence, default 512 
+                        (Note: if number of token exceeds max length, an error 
+                        will be thrown)
+  --batch_size BATCH_SIZE
+                        Batch size, default 2 (WARN! using batch size > 2 on
+                        just one GPU can cause OOM)
+  --train_corpus TRAIN_CORPUS
+                        Path to training corpus, required argument.
+```
+
+#### Datasets
+To replicate the no longer publicly available Toronto BookCorpus Dataset follow the instructions in [this github repository](https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
+
+This relatively small [BookCorpus](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) can also be downloaded directly as an alternative to the above dataset.
+
+To prepare the corpus from Wikipedia articles (on which BERT was originally trained) follow [this link](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html)
+
+### 2. Finetuning (On IMDB dataset)
+```bash
+$ python3 finetune.py
+```
+> **_NOTE:_** on Colab Notebook use following command:
+```python
+!git clone link-to-repo
+%run finetune.py
+```
+#### Help Log
+```
+usage: finetune.py [-h] [--epochs EPOCHS] [--lr LR] [--batch_size BATCH_SIZE]
+                   [--max_length MAX_LENGTH] [--train_samples TRAIN_SAMPLES]
+                   [--test_samples TEST_SAMPLES]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --epochs EPOCHS       Number of epochs in finetuning, default 2
+  --lr LR               Learning rate for finetune, default 2e-5
+  --batch_size BATCH_SIZE
+                        Batch Size, default 32 (WARN! using batch size > 32 on
+                        just one GPU can cause OOM)
+  --max_length MAX_LENGTH
+                        Maximum length of input string to bert, default 128
+  --train_samples TRAIN_SAMPLES
+                        Number of training samples, default (max): 25000
+  --test_samples TEST_SAMPLES
+                        Number of test samples, default (max): 25000
+```
+
+## Contributed by:
+* [Atharv Singh Patlan](https://github.com/AthaSSiN)
+
+## References
+
+* **Title**: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
+* **Authors**: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
+* **Link**: https://arxiv.org/pdf/1810.04805.pdf
+* **Tags**: Neural Network, Natural Language Processing
+* **Year**: 2018
+
+# Summary 
+
+## Introduction
+
+Vaswani et. al. introduced the concept of transformers in their seminal paper "Attention is all you need", which shook the NLP community, and marked a sharp evolution in the field of NLP (It’s been referred to as NLP’s ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks)
+
+This evolution was further accelerated due to the development of BERT, which is actually not a new model, but a training strategy for transformer encoders. 
+
+BERT is a clever combination of up and coming NLP ideas in 2018, which in the right blend, produced very impressive results!
+
+![BERTTL](http://jalammar.github.io/images/bert-transfer-learning.png)
+
+## BERT
+
+Traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary which means the word “right” would have the same context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would represent based on both previous and next context making it bidirectional. While the concept of bidirectional was around for a long time, BERT was first on its kind to successfully pre-train bidirectional in a deep neural network.
+
+![Embsum](https://miro.medium.com/max/552/1*8416XWqbuR2SDgCY61gFHw.png)
+
+Models like Transformer and Open-AI GPT did not use the idea of bidirectionality simply because is they did use bidirectionality, is that the network were being trained on the problem of next word prediction. Hence, if bidirectionality was used, the model would eventually learn that the next word in the input is actually the output, and the task would become trivial, which is not desired.
+
+However, BERT uses Mask Language Model (MLM) — by Masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token(actually, 80% of that 15%, while 10% are replaced with a random token and remaining are kept the same). The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. This can not be made trivial by masking!
+
+![MLM](https://miro.medium.com/max/552/1*icb8KIyD7MGKVKf39-TO1A.png)  
+
+The second technique is the Next Sentence Prediction (NSP), where BERT learns to model relationships between sentences. In the training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Let’s consider two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
+![NSP](https://miro.medium.com/max/552/1*K1em9OWRbZsA8f3IisUCig.png)  
+
+When training the BERT model, both the techniques are trained together, thus minimizing the combined loss function of the two strategies.
+
+## Implementation
+
+The BERT architecture builds on top of Transformer. There are two variants available:
+- BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
+- BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
+
+![Model](https://miro.medium.com/max/552/1*IOskqRtq3UOjvchtFxe-AA.png )
+
+By default, our code implements the Bert Base model on both pretrain and finetune problems.
+
+We use the following default configuration: 
+- Binary CE to calculate the correctness of next sentence prediction problem
+- Categorical CE to calculate the loss in masked word prediction
+- Learning rate scheduling, such that the learning rate increases linearly for the first 10000 minibatches to let the model warm up, and subsequently reduces inversely proportional to the current iteration value
+
+# Results
+
+> **_NOTE:_** BERT is a very large model, hence training for too many epochs on a small dataset like IMDB or COLa causes overfitting, hence it is best to finetune bert on them for 2-4 epochs only
+
+The results after training for 2 epochs on the IMDB dataset were:  
+1. Plot of losses:  
+![loss](https://github.com/AthaSSiN/model-zoo/blob/master/NLP/BERT_TensorFlow/assets/loss.png)
+
+2. Plot of accuracy:  
+![acc](https://github.com/AthaSSiN/model-zoo/blob/master/NLP/BERT_TensorFlow/assets/acc.png)
+
+# Sources
+
+- [Understanding BERT: Is it a Game Changer in NLP?](https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad)  
+- Template on which the code was built:  [Transformer on TensorFlow tutorials](https://www.tensorflow.org/tutorials/text/transformer)
+
@@ -0,0 +1,50 @@
+import tensorflow as tf
+import tensorflow_datasets as tfds
+from transformers import TFBertForSequenceClassification
+from transformers import BertTokenizer
+from utils import encode_examples
+import argparse
+import matplotlib.pyplot as plt
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument('--epochs', type = int, default = 2, help = "Number of epochs in finetuning, default 2")
+parser.add_argument('--lr', type = float, default = 2e-5, help = "Learning rate for finetune, default 2e-5")
+parser.add_argument('--batch_size', type = int, default = 4, help = "Batch Size, default 32 (WARN! using batch size > 32 on just one GPU can cause OOM) ")
+parser.add_argument('--max_length', type = int, default = 128, help = "Maximum length of input string to bert, default 128")
+parser.add_argument('--train_samples', type = int, default = 25000, help = "Number of training samples, default (max): 25000")
+parser.add_argument('--test_samples', type = int, default = 25000, help = "Number of test samples, default (max): 25000")
+
+args = parser.parse_args()
+
+######## Define Tokenizer ################
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
+
+######## Get IMDB dataset from tfds ######
+(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', 
+          split = (tfds.Split.TRAIN, tfds.Split.TEST),
+          as_supervised=True,
+          with_info=True)
+print('info', ds_info)
+
+######## Encode dataset in bert format ####
+# train dataset
+ds_train_encoded = encode_examples(ds_train, tokenizer, args.max_length, args.train_samples).shuffle(25000).batch(args.batch_size)
+# test dataset
+ds_test_encoded = encode_examples(ds_test, tokenizer, args.max_length, args.test_samples).shuffle(25000).batch(args.batch_size)
+
+######### Define bert model ###############
+model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+######### Define optimizer ################
+optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr, epsilon=1e-08)
+
+######### Define Loss function and metrics #########
+loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
+
+######### Compile model ###################
+model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
+
+######### Get predictions #################
+history = model.fit(ds_train_encoded, epochs=args.epochs, validation_data=ds_test_encoded)
@@ -0,0 +1,134 @@
+import tensorflow as tf
+from pretrain_preprocess import preprocess
+from pretrain_model import BertModel
+from transformers import BertTokenizer
+import time
+import argparse
+from utils import CustomSchedule
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument('--num_layers', type = int, default = 12, help = "Number of Encoder layers, default 12")
+parser.add_argument('--epochs', type = int, default = 40, help = "Number of epochs in pretrain, default 40")
+parser.add_argument('--hidden_size', type = int, default = 512, help = "Number of neurons in hidden feed forward layer, default 512")
+parser.add_argument('--num_heads', type = int, default = 12, help = "Number of heads used in multi headed attention layer, default 12")
+parser.add_argument('--max_length', type = int, default = 512, help = "Maximum token count of input sentence, default 512 (Note: if number of token exceeds max length, an error will be thrown)")
+parser.add_argument('--batch_size', type = int, default = 2, help = "Batch size, default 2 (WARN! using batch size > 2 on just one GPU can cause OOM)")
+parser.add_argument('--train_corpus', type = str, required = True, help = "Path to training corpus, required argument.")
+
+############ PARSING ARGUMENTS ###########
+args = parser.parse_args()
+
+num_layers = args.num_layers
+hidden_size = args.hidden_size
+dff = 4 * hidden_size
+num_heads = args.num_heads
+max_length = args.max_length
+
+BATCH_SIZE = args.batch_size
+EPOCHS = args.epochs
+
+########### Define Tokenizer ############
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
+
+input_vocab_size = len(tokenizer.vocab)
+
+########### Define Learning rate and optimzer ########
+learning_rate = CustomSchedule(hidden_size)
+
+optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-9) # values as in the paper
+
+########### Define Loss function #######
+bce_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True) # for NSP
+sce_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True) # for MLM
+
+def loss_function(nsp, mlm, is_next, seg_input, masked):
+  nsp_result = bce_loss(is_next, nsp)
+
+  mlm_result = 0
+  for i in range(len(masked)):
+    seg_val = 0
+    for j in range(len(masked[i])):
+      if(seg_input[i][j] < seg_val):
+        break
+      seg_val = seg_input[i][j]
+      if masked[i][j] is not 0:
+        mlm_result += sce_loss(masked[i][j], mlm[i,j])
+
+  return nsp_result + mlm_result
+
+########### Define Metrics ##############
+train_loss = tf.keras.metrics.Mean(name='train_loss')
+train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
+    name='train_accuracy')
+nsp_accuracy = tf.keras.metrics.BinaryAccuracy( name = 'nsp_accuracy', threshold=0.0)
+
+########## Define Model ################
+BertPretrain = BertModel(num_layers, hidden_size, num_heads, 
+                         dff, input_vocab_size,
+                         max_length)
+
+########## Define Checkpoints ##########
+checkpoint_path = "./checkpoints/train"
+
+ckpt = tf.train.Checkpoint(model=BertPretrain,
+                           optimizer=optimizer)
+
+ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
+
+# if a checkpoint exists, restore the latest checkpoint.
+if ckpt_manager.latest_checkpoint:
+  ckpt.restore(ckpt_manager.latest_checkpoint)
+  print ('Latest checkpoint restored!!')
+  
+########## Define Training step #########
+def train_step(model, index_input, seg_input, mask_input, is_next, is_masked):
+  
+  with tf.GradientTape() as tape:
+    nsp, mlm = model(index_input, training=False, seg = seg_input, mask = mask_input)
+    loss = loss_function(nsp, mlm, is_next, seg_input, masked = is_masked)
+
+  gradients = tape.gradient(loss, model.trainable_variables)
+  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
+  train_loss(loss)
+  for i in range(len(is_masked)):
+    seg_val = 0
+    for j in range(len(is_masked[i])):
+      if(seg_input[i][j] < seg_val):
+        break
+      seg_val = seg_input[i][j]
+      if is_masked[i][j] is not 0:
+        train_accuracy(is_masked[i][j], mlm[i,j])
+  
+  nsp_accuracy(is_next,nsp)
+  
+######## Get preprocessed training dataset #
+train_dataset = preprocess(args.train_corpus, BATCH_SIZE, max_length, tokenizer)
+
+##########################################
+
+def main():
+  # Train Loop
+  for epoch in range(EPOCHS):
+    start = time.time()
+    
+    train_loss.reset_states()
+    train_accuracy.reset_states()
+    nsp_accuracy.reset_states()
+    
+    for ind, batch in enumerate(train_dataset):
+      input_ids, segments, masks, is_next_list, is_masked = batch
+      train_step(BertPretrain, input_ids, segments, masks, is_next_list, is_masked)
+      
+      print ('Epoch {} batch {} BatchLoss {:.4f} MLM Accuracy {:.4f} NSP Accuracy {:.4f}'.format(epoch + 1, ind, train_loss.result(), train_accuracy.result(), nsp_accuracy.result()))
+        
+    if (epoch + 1) % 5 == 0:
+      ckpt_save_path = ckpt_manager.save()
+      print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))
+      
+    print ('Epoch {} Loss {:.4f} MLM Accuracy {:.4f} NSP Accuracy {:.4f}'.format(epoch + 1,train_loss.result(), train_accuracy.result(), nsp_accuracy.result()))
+  
+    print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))
+    
+if __name__ == '__main__':
+    main()