Skip to content

Commit 08a0b56

Browse files
authored
Added Word2Vec in Tensorflow (pclubiitk#11)
1 parent e7e9205 commit 08a0b56

File tree

8 files changed

+440
-0
lines changed

8 files changed

+440
-0
lines changed

NLP/word2vec/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
__pycache__

NLP/word2vec/README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Tensorflow Implementation of Word2Vec (Dataset from kaggle file)
2+
3+
## Usage
4+
### To train
5+
```bash
6+
$ python3 main.py --epochs 100 --optimizer "adam" --batch_size 2000 --dim_embedding 100
7+
```
8+
### To getSimilarity between two words - word1 and word2
9+
```bash
10+
$ python3 main.py --mode "getSimilarity" --word1 "window" --word2 "house"
11+
```
12+
### To getTenClosestWords to a given word
13+
```bash
14+
$ python3 main.py --mode "getTenClosestWords" --word "window"
15+
```
16+
### To use analogy and get word in ;- word1_ : word2_ :: word3_ : word4
17+
```bash
18+
$ python3 main.py --mode "analogy" --word1_ "window" --word2_ "house" --word3_ "door"
19+
```
20+
### To plot the embeddings in 2D
21+
```bash
22+
$ python3 main.py --mode "plot"
23+
```
24+
## References
25+
* [Stanford CS224n](http://web.stanford.edu/class/cs224n/)
26+
* [Stanford Word2Vec Notes](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf)
27+
28+
* [Original Word2Vec paper](http://arxiv.org/pdf/1301.3781.pdf)
29+
30+
* [Google Word2vec paper which suggested improvements in training using negative sampling and sub-sampling](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
31+
32+
## Contributed by:
33+
* [Aashish Patel](https://github.com/aashishpiitk/)
34+
35+
# Summary
36+
37+
Word2Vec is a model in which a network is trained to represent each word in the text corpus in form of an embedding, which is a vector containing numbers.
38+
These embeddings can be used to perform a variety of tasks such a :-
39+
```
40+
• finding similarity between two words
41+
• searching for top ten most similar words to a given words
42+
• finding analogies
43+
```
44+
45+
There are two main approaches while training this model and creating word embeddings:-
46+
```
47+
• Skip-gram
48+
• Continuous Bag of Words
49+
```
50+
51+
### Skip-gram
52+
```
53+
Input – a single word
54+
Output – probability of each word in corpus being in context of the given input word
55+
```
56+
### Continuous Bag Of Words(CBOW)
57+
```
58+
Inputs – context of a word in sentence/phrase
59+
Output – a single word(in one-hot encoded form, each value being a number from probability distribution)
60+
Loss – categorical cross entropy loss
61+
```
62+
In my current implementation I have used CBOW approach to train
63+
64+
### Preparing data for CBOW
65+
```
66+
1. converting each vector in vocabulary to one-hot encoded representation
67+
2. forming a list of (context word, target word) , choosing a suitable window size
68+
```
69+
## Architecture of CBOW
70+
```
71+
1. This is a three layer neural network with the last one being the output layer
72+
2. The weights of the first layer are the actual embeddings which will be used in further tasks
73+
3. The output is of size of vocab with the entries being the softmax output
74+
4. Cross Entropy loss is used
75+
```
76+
## Instructions for using a custom dataset to train the model
77+
```
78+
1. In the `dataloader.py` file on the `30th` line change the name of the file and path(file must be csv)
79+
2. On the `31st` line change the name of the column to the name of the `column` in your `.csv` file
80+
3. On the next line choose the number of example sentences to choose from the `.csv` file.This option is useful when there is not enough RAM on your machine to load all the lines
81+
```
82+
## Instructions for using kaggle dataset
83+
```
84+
1. !kaggle datasets download -d harmanpreet93/hotelreviews
85+
2. unzip the dataset and keep it in a folder named hotelreviews
86+
3. if you want to change the folder name then follow the guidelines above this block
87+
```
88+
## Examples
89+
```
90+
getSimilarity("window","door")
91+
result 0.067170754
92+
getSimilarity("window","house")
93+
result 0.029237064
94+
getSimilarity("vegas","girls")
95+
result 0.10303633
96+
getSimilarity("vegas","money")
97+
result 0.22041301
98+
getSimilarity("vegas","gold")
99+
result 0.072522774
100+
print(getSimilarity("good","bad"))
101+
0.23856965
102+
103+
print(getTenClosestWords("water"))
104+
result [['guess', 0.30290997], ['disappointed', 0.29180372], ['understand', 0.29148042], ['also', 0.27842966], ['earth', 0.2709463], ['water', 0.26725885], ['one', 0.2629741], ['power', 0.25627777], ['unbelievably', 0.25437462], ['spouse', 0.2514236]]
105+
106+
print(getTenClosestWords("money"))
107+
result [['need', 0.3438718], ['chose', 0.34114993], ['money', 0.332256], ['nearby', 0.3176784], ['think', 0.31027701], ['heading', 0.3087694], ['although', 0.30681318], ['understands', 0.3010662], ['lodging', 0.29836887], ['must', 0.29601774]]
108+
109+
110+
```

NLP/word2vec/dataloader.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import pandas as pd
2+
from nltk.tokenize import RegexpTokenizer
3+
from nltk.corpus import stopwords
4+
import nltk
5+
nltk.download('stopwords')
6+
7+
# Downloading the dataset
8+
# !kaggle datasets download -d harmanpreet93/hotelreviews
9+
# unzip the dataset and keep it in a folder named hotelreviews
10+
11+
12+
13+
14+
def tokenizeData(indv_lines):
15+
review_data_list = list()
16+
for line in indv_lines:
17+
tokenizer = RegexpTokenizer('\w+')
18+
tokens = tokenizer.tokenize(line)
19+
20+
words = [word.lower() for word in tokens]
21+
22+
stop_word_list = set(stopwords.words('english'))
23+
words = [w for w in words if not w in stop_word_list]
24+
25+
review_data_list.append(words)
26+
27+
return review_data_list
28+
29+
def performTokenization():
30+
hotel_data = pd.read_csv('./hotelreviews/hotel-reviews.csv')
31+
hotel_data = hotel_data['Description'].tolist()
32+
hotel_data = hotel_data[0:100]#you can increase the upper limit depending on your ram size
33+
34+
indv_lines = hotel_data
35+
36+
return tokenizeData(indv_lines)

NLP/word2vec/embedding.npy

175 KB
Binary file not shown.

NLP/word2vec/evaluation.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import numpy as np
2+
from sklearn.manifold import TSNE
3+
import matplotlib.pyplot as plt
4+
from sklearn import preprocessing
5+
import argparse
6+
7+
def getSimilarity(word1, word2, data, emb):
8+
word_to_id = data["word_to_id"]
9+
word1_emb = emb[word_to_id[word1],:]
10+
word2_emb = emb[word_to_id[word2],:]
11+
12+
similarity = np.dot(word1_emb,word2_emb.T)/(np.abs(np.dot(word1_emb,word1_emb.T))*np.abs(np.dot(word2_emb,word2_emb.T)))
13+
return similarity
14+
15+
def getSimilarityByEmbedding(emb1, emb2):
16+
similarity = np.dot(emb1,emb2.T)/(np.abs(np.dot(emb1,emb1.T))*np.abs(np.dot(emb2,emb2.T)))
17+
return similarity
18+
19+
def getTenClosestWords(search, vocab, data, emb):
20+
topTen = list()
21+
for word in vocab:
22+
topTen.append([word, getSimilarity(search, word, data, emb)])
23+
topTen.sort(key = lambda x: x[1],reverse=True)
24+
return topTen[:10]
25+
26+
def analogy(word1, word2, word3, data, vocab, emb):
27+
word_to_id = data["word_to_id"]
28+
word4_emb = emb[word_to_id[word1],:] - emb[word_to_id[word2],:] + emb[word_to_id[word3],:]
29+
30+
topTen = list()
31+
for word in vocab:
32+
topTen.append([word, getSimilarityByEmbedding(word4_emb,emb[word_to_id[word]])])
33+
topTen.sort(key = lambda x: x[1],reverse=True)
34+
return topTen[:10]
35+
36+
def plotEmbeddingsIn2D(emb, data):
37+
plt.figure(figsize=(10,20))
38+
word_to_id = data["word_to_id"]
39+
vocab = list(data["vocab"])[:100]
40+
model = TSNE(n_components=2, random_state=0)
41+
np.set_printoptions(suppress=True)
42+
vectors = model.fit_transform(emb)
43+
normalizer = preprocessing.Normalizer()
44+
vectors = normalizer.fit_transform(vectors, 'l2')
45+
fig, ax = plt.subplots()
46+
for word in vocab:
47+
print(word, vectors[word_to_id[word]][1])
48+
ax.annotate(word, (vectors[word_to_id[word]][0],vectors[word_to_id[word]][1] ))
49+
plt.show()

NLP/word2vec/main.py

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
from model import Word2Vec, ScoringLayer, EmbeddingLayer
2+
from utils import constructBagOfWordsInWindowSize, contextPairToOneHot, OneHotOfAllInVocab
3+
from keras.callbacks import TensorBoard
4+
from dataloader import tokenizeData, performTokenization
5+
import argparse
6+
import datetime
7+
from numpy import save, load
8+
from evaluation import getSimilarity, getSimilarityByEmbedding, getTenClosestWords, analogy, plotEmbeddingsIn2D
9+
from collections import OrderedDict
10+
11+
def parse_args():
12+
parser = argparse.ArgumentParser()
13+
14+
#optim config
15+
parser.add_argument('--epochs', type=int, default=100)
16+
parser.add_argument('--batch_size', type=int, default=2000)
17+
parser.add_argument('--optimizer', type=str, default="adam")
18+
#Model config
19+
parser.add_argument('--dim_embedding', type=int, default=100)
20+
#evaluation
21+
parser.add_argument('--mode', default="train", type=str)
22+
#getSimilarity
23+
parser.add_argument('--word1', type=str, default="window")
24+
parser.add_argument('--word2', type=str, default="hoouse")
25+
#getTenClosestWords
26+
parser.add_argument('--word', type=str, default="window")
27+
#analogy
28+
parser.add_argument('--word1_', type=str, default="window")
29+
parser.add_argument('--word2_', type=str, default="hoouse")
30+
parser.add_argument('--word3_', type=str, default="door")
31+
#wordIsInVocab
32+
parser.add_argument('--word_', type=str)
33+
34+
args = parser.parse_args()
35+
36+
optim_config = OrderedDict([
37+
('epochs', args.epochs),
38+
('batch_size', args.batch_size),
39+
('optimizer', args.optimizer)
40+
])
41+
42+
model_config = OrderedDict([
43+
('dim_embedding', args.dim_embedding)
44+
])
45+
46+
evaluation_config = OrderedDict([
47+
('word1', args.word1),
48+
('word2', args.word2),
49+
('word', args.word),
50+
('word1_', args.word1_),
51+
('word2_', args.word2_),
52+
('word3_', args.word3_),
53+
('word_', args.word_),
54+
])
55+
56+
57+
config = OrderedDict([
58+
('optim_config', optim_config),
59+
('evaluation_config', evaluation_config),
60+
('model_config', model_config),
61+
('mode', args.mode),
62+
])
63+
64+
return config
65+
66+
config = parse_args()
67+
68+
model_config = config['model_config']
69+
optim_config = config['optim_config']
70+
evaluation_config = config['evaluation_config']
71+
mode = config['mode']
72+
# log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
73+
# tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)
74+
75+
76+
tokenized_data = performTokenization()
77+
context_tuple_list = constructBagOfWordsInWindowSize(tokenized_data)
78+
oneHotNumpy, data = contextPairToOneHot(context_tuple_list, tokenized_data)
79+
print("The total number of words in corpus size are: ",data["vocabSize"])
80+
81+
82+
if(mode == "train"):
83+
def train():
84+
85+
dimensionality_of_embeddings = model_config['dim_embedding']
86+
optimizer = optim_config['optimizer']
87+
epochs = optim_config['epochs']
88+
batch_size = optim_config['batch_size']
89+
90+
model = Word2Vec(input_dim=data['vocabSize'], units = int(dimensionality_of_embeddings))
91+
model.compile(loss='categorical_crossentropy',
92+
optimizer= optimizer,
93+
metrics= ['accuracy'])
94+
model.fit(oneHotNumpy[:,0,:],oneHotNumpy[:,1,:],
95+
epochs = epochs,
96+
batch_size = batch_size)
97+
98+
99+
emb = model.get_weights()[0]
100+
save("word2vec_embeddings.npy",emb)
101+
102+
elif(mode == "help"):
103+
print("$ python3 main.py --epochs 100 --optimizer \"adam\" --batch_size 2000 --dim_embedding 100\n")
104+
print("$ python3 main.py --mode \"getSimilarity\" --word1 \"window\" --word2 \"house\"\n")
105+
print("$ python3 main.py --mode \"getTenClosestWords\" --word \"window\"\n")
106+
print("$ python3 main.py --mode \"analogy\" --word1_ \"window\" --word2_ \"house\" --word3_ \"door\"\n")
107+
print("$ python3 main.py --mode \"plot\"")
108+
print("$ python3 main.py --mode \"help\"")
109+
print("$ python3 main.py --mode \"wordIsInVocab\" --word_ \"window\"")
110+
111+
else:
112+
emb = load("embeddings.npy")
113+
114+
if(mode == "getSimilarity"):
115+
word1 = evaluation_config['word1']
116+
word2 = evaluation_config['word2']
117+
118+
print(getSimilarity(word1, word2, data, emb))
119+
120+
if(mode == "getTenClosestWords"):
121+
word = evaluation_config['word']
122+
123+
print(getTenClosestWords(word, data['vocab'], data, emb))
124+
125+
if(mode == "analogy"):
126+
word1_ = evaluation_config['word1_']
127+
word2_ = evaluation_config['word2_']
128+
word3_ = evaluation_config['word3_']
129+
130+
print(analogy(word1_, word2_, word3_, data, data['vocab'], emb))
131+
132+
if(mode == "wordIsInVocab"):
133+
word_ = evaluation_config['word_']
134+
vocabList = data['vocab'].tolist()
135+
136+
if word_ in vocabList:
137+
print("YES")
138+
else:
139+
print("NO")
140+
141+
if(mode == "plot"):
142+
plotEmbeddingsIn2D(emb, data)

NLP/word2vec/model.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import tensorflow as tf
2+
from keras import layers
3+
import keras
4+
5+
6+
class EmbeddingLayer(layers.Layer):
7+
8+
def __init__(self, units, input_dim):
9+
super(EmbeddingLayer, self).__init__()
10+
11+
self.input_dim = input_dim
12+
13+
w_init = tf.random_normal_initializer()
14+
self.w = tf.Variable(initial_value=w_init(shape=(input_dim, units),
15+
dtype= 'float32'),
16+
trainable=True,
17+
name="emb")
18+
19+
def call(self, inputs):
20+
embedding = tf.matmul(inputs, self.w)
21+
return embedding
22+
23+
class ScoringLayer(layers.Layer):
24+
25+
def __init__(self, units, input_dim):
26+
super(ScoringLayer, self).__init__()
27+
28+
w_init = tf.random_normal_initializer()
29+
self.w = tf.Variable(initial_value=w_init(shape=(input_dim, units),
30+
dtype='float32'),
31+
trainable=True)
32+
33+
def call(self, inputs):
34+
output = tf.matmul(inputs, self.w)
35+
softmax = tf.nn.softmax(output, axis=-1)
36+
return softmax
37+
38+
class Word2Vec(keras.Model):
39+
40+
def __init__(self, units, input_dim):
41+
super(Word2Vec, self).__init__()
42+
43+
self.embedding = EmbeddingLayer(units, input_dim)
44+
self.scoring = ScoringLayer(input_dim,units)
45+
46+
def call(self, inputs):
47+
embedding = self.embedding(inputs)
48+
output = self.scoring(embedding)
49+
return outputs

0 commit comments

Comments
 (0)