Skip to content

Commit cf306f8

Browse files
authored
Add BERT TensorFlow implementation (pclubiitk#27)
* Add BERT TensorFlow * Update README.md
1 parent 1235348 commit cf306f8

File tree

8 files changed

+708
-0
lines changed

8 files changed

+708
-0
lines changed

NLP/BERT_TensorFlow/README.md

+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# TensorFlow Implementation of BERT
2+
3+
## Dependencies
4+
[Transformer implementation by HuggingFace](https://huggingface.co/transformers/)
5+
This has been used for loading the WordPiece tokens for the pretraining task, and for providing the pre-trained models in the finetuning task.
6+
```bash
7+
$ pip install transformers
8+
```
9+
> **_NOTE:_** on Colab Notebook use following command:
10+
```python
11+
!pip install transformers
12+
```
13+
## Usage
14+
### 1. Pretraining (on user defined corpus)
15+
```bash
16+
$ python3 pretrain.py --train_corpus path/to/file.txt
17+
```
18+
> **_NOTE:_** on Colab Notebook use following command:
19+
```python
20+
!git clone link-to-repo
21+
%run pretrain.py --train_corpus path/to/file.txt
22+
```
23+
#### Help Log
24+
```
25+
usage: pretrain.py [-h] [--num_layers NUM_LAYERS] [--epochs EPOCHS]
26+
[--hidden_size HIDDEN_SIZE] [--num_heads NUM_HEADS]
27+
[--max_length MAX_LENGTH] [--batch_size BATCH_SIZE]
28+
--train_corpus TRAIN_CORPUS
29+
30+
optional arguments:
31+
-h, --help show this help message and exit
32+
--num_layers NUM_LAYERS
33+
Number of Encoder layers, default 12
34+
--epochs EPOCHS Number of epochs in pretrain, default 40
35+
--hidden_size HIDDEN_SIZE
36+
Number of neurons in hidden feed forward layer,
37+
default 512
38+
--num_heads NUM_HEADS
39+
Number of heads used in multi headed attention layer,
40+
default 12
41+
--max_length MAX_LENGTH
42+
Maximum token count of input sentence, default 512
43+
(Note: if number of token exceeds max length, an error
44+
will be thrown)
45+
--batch_size BATCH_SIZE
46+
Batch size, default 2 (WARN! using batch size > 2 on
47+
just one GPU can cause OOM)
48+
--train_corpus TRAIN_CORPUS
49+
Path to training corpus, required argument.
50+
```
51+
52+
#### Datasets
53+
To replicate the no longer publicly available Toronto BookCorpus Dataset follow the instructions in [this github repository](https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
54+
55+
This relatively small [BookCorpus](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) can also be downloaded directly as an alternative to the above dataset.
56+
57+
To prepare the corpus from Wikipedia articles (on which BERT was originally trained) follow [this link](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html)
58+
59+
### 2. Finetuning (On IMDB dataset)
60+
```bash
61+
$ python3 finetune.py
62+
```
63+
> **_NOTE:_** on Colab Notebook use following command:
64+
```python
65+
!git clone link-to-repo
66+
%run finetune.py
67+
```
68+
#### Help Log
69+
```
70+
usage: finetune.py [-h] [--epochs EPOCHS] [--lr LR] [--batch_size BATCH_SIZE]
71+
[--max_length MAX_LENGTH] [--train_samples TRAIN_SAMPLES]
72+
[--test_samples TEST_SAMPLES]
73+
74+
optional arguments:
75+
-h, --help show this help message and exit
76+
--epochs EPOCHS Number of epochs in finetuning, default 2
77+
--lr LR Learning rate for finetune, default 2e-5
78+
--batch_size BATCH_SIZE
79+
Batch Size, default 32 (WARN! using batch size > 32 on
80+
just one GPU can cause OOM)
81+
--max_length MAX_LENGTH
82+
Maximum length of input string to bert, default 128
83+
--train_samples TRAIN_SAMPLES
84+
Number of training samples, default (max): 25000
85+
--test_samples TEST_SAMPLES
86+
Number of test samples, default (max): 25000
87+
```
88+
89+
## Contributed by:
90+
* [Atharv Singh Patlan](https://github.com/AthaSSiN)
91+
92+
## References
93+
94+
* **Title**: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
95+
* **Authors**: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
96+
* **Link**: https://arxiv.org/pdf/1810.04805.pdf
97+
* **Tags**: Neural Network, Natural Language Processing
98+
* **Year**: 2018
99+
100+
# Summary
101+
102+
## Introduction
103+
104+
Vaswani et. al. introduced the concept of transformers in their seminal paper "Attention is all you need", which shook the NLP community, and marked a sharp evolution in the field of NLP (It’s been referred to as NLP’s ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks)
105+
106+
This evolution was further accelerated due to the development of BERT, which is actually not a new model, but a training strategy for transformer encoders.
107+
108+
BERT is a clever combination of up and coming NLP ideas in 2018, which in the right blend, produced very impressive results!
109+
110+
![BERTTL](http://jalammar.github.io/images/bert-transfer-learning.png)
111+
112+
## BERT
113+
114+
Traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary which means the word “right” would have the same context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would represent based on both previous and next context making it bidirectional. While the concept of bidirectional was around for a long time, BERT was first on its kind to successfully pre-train bidirectional in a deep neural network.
115+
116+
![Embsum](https://miro.medium.com/max/552/1*8416XWqbuR2SDgCY61gFHw.png)
117+
118+
Models like Transformer and Open-AI GPT did not use the idea of bidirectionality simply because is they did use bidirectionality, is that the network were being trained on the problem of next word prediction. Hence, if bidirectionality was used, the model would eventually learn that the next word in the input is actually the output, and the task would become trivial, which is not desired.
119+
120+
However, BERT uses Mask Language Model (MLM) — by Masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token(actually, 80% of that 15%, while 10% are replaced with a random token and remaining are kept the same). The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. This can not be made trivial by masking!
121+
122+
![MLM](https://miro.medium.com/max/552/1*icb8KIyD7MGKVKf39-TO1A.png)
123+
124+
The second technique is the Next Sentence Prediction (NSP), where BERT learns to model relationships between sentences. In the training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Let’s consider two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
125+
![NSP](https://miro.medium.com/max/552/1*K1em9OWRbZsA8f3IisUCig.png)
126+
127+
When training the BERT model, both the techniques are trained together, thus minimizing the combined loss function of the two strategies.
128+
129+
## Implementation
130+
131+
The BERT architecture builds on top of Transformer. There are two variants available:
132+
- BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
133+
- BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
134+
135+
![Model](https://miro.medium.com/max/552/1*IOskqRtq3UOjvchtFxe-AA.png )
136+
137+
By default, our code implements the Bert Base model on both pretrain and finetune problems.
138+
139+
We use the following default configuration:
140+
- Binary CE to calculate the correctness of next sentence prediction problem
141+
- Categorical CE to calculate the loss in masked word prediction
142+
- Learning rate scheduling, such that the learning rate increases linearly for the first 10000 minibatches to let the model warm up, and subsequently reduces inversely proportional to the current iteration value
143+
144+
# Results
145+
146+
> **_NOTE:_** BERT is a very large model, hence training for too many epochs on a small dataset like IMDB or COLa causes overfitting, hence it is best to finetune bert on them for 2-4 epochs only
147+
148+
The results after training for 2 epochs on the IMDB dataset were:
149+
1. Plot of losses:
150+
![loss](https://github.com/AthaSSiN/model-zoo/blob/master/NLP/BERT_TensorFlow/assets/loss.png)
151+
152+
2. Plot of accuracy:
153+
![acc](https://github.com/AthaSSiN/model-zoo/blob/master/NLP/BERT_TensorFlow/assets/acc.png)
154+
155+
# Sources
156+
157+
- [Understanding BERT: Is it a Game Changer in NLP?](https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad)
158+
- Template on which the code was built: [Transformer on TensorFlow tutorials](https://www.tensorflow.org/tutorials/text/transformer)
159+

NLP/BERT_TensorFlow/assets/acc.png

12.5 KB
Loading

NLP/BERT_TensorFlow/assets/loss.png

12.2 KB
Loading

NLP/BERT_TensorFlow/finetune.py

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import tensorflow as tf
2+
import tensorflow_datasets as tfds
3+
from transformers import TFBertForSequenceClassification
4+
from transformers import BertTokenizer
5+
from utils import encode_examples
6+
import argparse
7+
import matplotlib.pyplot as plt
8+
9+
parser = argparse.ArgumentParser()
10+
11+
parser.add_argument('--epochs', type = int, default = 2, help = "Number of epochs in finetuning, default 2")
12+
parser.add_argument('--lr', type = float, default = 2e-5, help = "Learning rate for finetune, default 2e-5")
13+
parser.add_argument('--batch_size', type = int, default = 4, help = "Batch Size, default 32 (WARN! using batch size > 32 on just one GPU can cause OOM) ")
14+
parser.add_argument('--max_length', type = int, default = 128, help = "Maximum length of input string to bert, default 128")
15+
parser.add_argument('--train_samples', type = int, default = 25000, help = "Number of training samples, default (max): 25000")
16+
parser.add_argument('--test_samples', type = int, default = 25000, help = "Number of test samples, default (max): 25000")
17+
18+
args = parser.parse_args()
19+
20+
######## Define Tokenizer ################
21+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
22+
23+
######## Get IMDB dataset from tfds ######
24+
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
25+
split = (tfds.Split.TRAIN, tfds.Split.TEST),
26+
as_supervised=True,
27+
with_info=True)
28+
print('info', ds_info)
29+
30+
######## Encode dataset in bert format ####
31+
# train dataset
32+
ds_train_encoded = encode_examples(ds_train, tokenizer, args.max_length, args.train_samples).shuffle(25000).batch(args.batch_size)
33+
# test dataset
34+
ds_test_encoded = encode_examples(ds_test, tokenizer, args.max_length, args.test_samples).shuffle(25000).batch(args.batch_size)
35+
36+
######### Define bert model ###############
37+
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
38+
39+
######### Define optimizer ################
40+
optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr, epsilon=1e-08)
41+
42+
######### Define Loss function and metrics #########
43+
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
44+
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
45+
46+
######### Compile model ###################
47+
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
48+
49+
######### Get predictions #################
50+
history = model.fit(ds_train_encoded, epochs=args.epochs, validation_data=ds_test_encoded)

NLP/BERT_TensorFlow/pretrain.py

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
import tensorflow as tf
2+
from pretrain_preprocess import preprocess
3+
from pretrain_model import BertModel
4+
from transformers import BertTokenizer
5+
import time
6+
import argparse
7+
from utils import CustomSchedule
8+
9+
parser = argparse.ArgumentParser()
10+
11+
parser.add_argument('--num_layers', type = int, default = 12, help = "Number of Encoder layers, default 12")
12+
parser.add_argument('--epochs', type = int, default = 40, help = "Number of epochs in pretrain, default 40")
13+
parser.add_argument('--hidden_size', type = int, default = 512, help = "Number of neurons in hidden feed forward layer, default 512")
14+
parser.add_argument('--num_heads', type = int, default = 12, help = "Number of heads used in multi headed attention layer, default 12")
15+
parser.add_argument('--max_length', type = int, default = 512, help = "Maximum token count of input sentence, default 512 (Note: if number of token exceeds max length, an error will be thrown)")
16+
parser.add_argument('--batch_size', type = int, default = 2, help = "Batch size, default 2 (WARN! using batch size > 2 on just one GPU can cause OOM)")
17+
parser.add_argument('--train_corpus', type = str, required = True, help = "Path to training corpus, required argument.")
18+
19+
############ PARSING ARGUMENTS ###########
20+
args = parser.parse_args()
21+
22+
num_layers = args.num_layers
23+
hidden_size = args.hidden_size
24+
dff = 4 * hidden_size
25+
num_heads = args.num_heads
26+
max_length = args.max_length
27+
28+
BATCH_SIZE = args.batch_size
29+
EPOCHS = args.epochs
30+
31+
########### Define Tokenizer ############
32+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
33+
34+
input_vocab_size = len(tokenizer.vocab)
35+
36+
########### Define Learning rate and optimzer ########
37+
learning_rate = CustomSchedule(hidden_size)
38+
39+
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-9) # values as in the paper
40+
41+
########### Define Loss function #######
42+
bce_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True) # for NSP
43+
sce_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True) # for MLM
44+
45+
def loss_function(nsp, mlm, is_next, seg_input, masked):
46+
nsp_result = bce_loss(is_next, nsp)
47+
48+
mlm_result = 0
49+
for i in range(len(masked)):
50+
seg_val = 0
51+
for j in range(len(masked[i])):
52+
if(seg_input[i][j] < seg_val):
53+
break
54+
seg_val = seg_input[i][j]
55+
if masked[i][j] is not 0:
56+
mlm_result += sce_loss(masked[i][j], mlm[i,j])
57+
58+
return nsp_result + mlm_result
59+
60+
########### Define Metrics ##############
61+
train_loss = tf.keras.metrics.Mean(name='train_loss')
62+
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
63+
name='train_accuracy')
64+
nsp_accuracy = tf.keras.metrics.BinaryAccuracy( name = 'nsp_accuracy', threshold=0.0)
65+
66+
########## Define Model ################
67+
BertPretrain = BertModel(num_layers, hidden_size, num_heads,
68+
dff, input_vocab_size,
69+
max_length)
70+
71+
########## Define Checkpoints ##########
72+
checkpoint_path = "./checkpoints/train"
73+
74+
ckpt = tf.train.Checkpoint(model=BertPretrain,
75+
optimizer=optimizer)
76+
77+
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
78+
79+
# if a checkpoint exists, restore the latest checkpoint.
80+
if ckpt_manager.latest_checkpoint:
81+
ckpt.restore(ckpt_manager.latest_checkpoint)
82+
print ('Latest checkpoint restored!!')
83+
84+
########## Define Training step #########
85+
def train_step(model, index_input, seg_input, mask_input, is_next, is_masked):
86+
87+
with tf.GradientTape() as tape:
88+
nsp, mlm = model(index_input, training=False, seg = seg_input, mask = mask_input)
89+
loss = loss_function(nsp, mlm, is_next, seg_input, masked = is_masked)
90+
91+
gradients = tape.gradient(loss, model.trainable_variables)
92+
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
93+
train_loss(loss)
94+
for i in range(len(is_masked)):
95+
seg_val = 0
96+
for j in range(len(is_masked[i])):
97+
if(seg_input[i][j] < seg_val):
98+
break
99+
seg_val = seg_input[i][j]
100+
if is_masked[i][j] is not 0:
101+
train_accuracy(is_masked[i][j], mlm[i,j])
102+
103+
nsp_accuracy(is_next,nsp)
104+
105+
######## Get preprocessed training dataset #
106+
train_dataset = preprocess(args.train_corpus, BATCH_SIZE, max_length, tokenizer)
107+
108+
##########################################
109+
110+
def main():
111+
# Train Loop
112+
for epoch in range(EPOCHS):
113+
start = time.time()
114+
115+
train_loss.reset_states()
116+
train_accuracy.reset_states()
117+
nsp_accuracy.reset_states()
118+
119+
for ind, batch in enumerate(train_dataset):
120+
input_ids, segments, masks, is_next_list, is_masked = batch
121+
train_step(BertPretrain, input_ids, segments, masks, is_next_list, is_masked)
122+
123+
print ('Epoch {} batch {} BatchLoss {:.4f} MLM Accuracy {:.4f} NSP Accuracy {:.4f}'.format(epoch + 1, ind, train_loss.result(), train_accuracy.result(), nsp_accuracy.result()))
124+
125+
if (epoch + 1) % 5 == 0:
126+
ckpt_save_path = ckpt_manager.save()
127+
print ('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))
128+
129+
print ('Epoch {} Loss {:.4f} MLM Accuracy {:.4f} NSP Accuracy {:.4f}'.format(epoch + 1,train_loss.result(), train_accuracy.result(), nsp_accuracy.result()))
130+
131+
print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))
132+
133+
if __name__ == '__main__':
134+
main()

0 commit comments

Comments
 (0)