You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--epochs EPOCHS Number of epochs in pretrain, default 40
35
+
--hidden_size HIDDEN_SIZE
36
+
Number of neurons in hidden feed forward layer,
37
+
default 512
38
+
--num_heads NUM_HEADS
39
+
Number of heads used in multi headed attention layer,
40
+
default 12
41
+
--max_length MAX_LENGTH
42
+
Maximum token count of input sentence, default 512
43
+
(Note: if number of token exceeds max length, an error
44
+
will be thrown)
45
+
--batch_size BATCH_SIZE
46
+
Batch size, default 2 (WARN! using batch size > 2 on
47
+
just one GPU can cause OOM)
48
+
--train_corpus TRAIN_CORPUS
49
+
Path to training corpus, required argument.
50
+
```
51
+
52
+
#### Datasets
53
+
To replicate the no longer publicly available Toronto BookCorpus Dataset follow the instructions in [this github repository](https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
54
+
55
+
This relatively small [BookCorpus](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) can also be downloaded directly as an alternative to the above dataset.
56
+
57
+
To prepare the corpus from Wikipedia articles (on which BERT was originally trained) follow [this link](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html)
58
+
59
+
### 2. Finetuning (On IMDB dataset)
60
+
```bash
61
+
$ python3 finetune.py
62
+
```
63
+
> **_NOTE:_** on Colab Notebook use following command:
***Title**: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
95
+
***Authors**: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
96
+
***Link**: https://arxiv.org/pdf/1810.04805.pdf
97
+
***Tags**: Neural Network, Natural Language Processing
98
+
***Year**: 2018
99
+
100
+
# Summary
101
+
102
+
## Introduction
103
+
104
+
Vaswani et. al. introduced the concept of transformers in their seminal paper "Attention is all you need", which shook the NLP community, and marked a sharp evolution in the field of NLP (It’s been referred to as NLP’s ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks)
105
+
106
+
This evolution was further accelerated due to the development of BERT, which is actually not a new model, but a training strategy for transformer encoders.
107
+
108
+
BERT is a clever combination of up and coming NLP ideas in 2018, which in the right blend, produced very impressive results!
Traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary which means the word “right” would have the same context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would represent based on both previous and next context making it bidirectional. While the concept of bidirectional was around for a long time, BERT was first on its kind to successfully pre-train bidirectional in a deep neural network.
Models like Transformer and Open-AI GPT did not use the idea of bidirectionality simply because is they did use bidirectionality, is that the network were being trained on the problem of next word prediction. Hence, if bidirectionality was used, the model would eventually learn that the next word in the input is actually the output, and the task would become trivial, which is not desired.
119
+
120
+
However, BERT uses Mask Language Model (MLM) — by Masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token(actually, 80% of that 15%, while 10% are replaced with a random token and remaining are kept the same). The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. This can not be made trivial by masking!
The second technique is the Next Sentence Prediction (NSP), where BERT learns to model relationships between sentences. In the training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Let’s consider two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
By default, our code implements the Bert Base model on both pretrain and finetune problems.
138
+
139
+
We use the following default configuration:
140
+
- Binary CE to calculate the correctness of next sentence prediction problem
141
+
- Categorical CE to calculate the loss in masked word prediction
142
+
- Learning rate scheduling, such that the learning rate increases linearly for the first 10000 minibatches to let the model warm up, and subsequently reduces inversely proportional to the current iteration value
143
+
144
+
# Results
145
+
146
+
> **_NOTE:_** BERT is a very large model, hence training for too many epochs on a small dataset like IMDB or COLa causes overfitting, hence it is best to finetune bert on them for 2-4 epochs only
147
+
148
+
The results after training for 2 epochs on the IMDB dataset were:
parser.add_argument('--num_layers', type=int, default=12, help="Number of Encoder layers, default 12")
12
+
parser.add_argument('--epochs', type=int, default=40, help="Number of epochs in pretrain, default 40")
13
+
parser.add_argument('--hidden_size', type=int, default=512, help="Number of neurons in hidden feed forward layer, default 512")
14
+
parser.add_argument('--num_heads', type=int, default=12, help="Number of heads used in multi headed attention layer, default 12")
15
+
parser.add_argument('--max_length', type=int, default=512, help="Maximum token count of input sentence, default 512 (Note: if number of token exceeds max length, an error will be thrown)")
16
+
parser.add_argument('--batch_size', type=int, default=2, help="Batch size, default 2 (WARN! using batch size > 2 on just one GPU can cause OOM)")
17
+
parser.add_argument('--train_corpus', type=str, required=True, help="Path to training corpus, required argument.")
0 commit comments