-
Notifications
You must be signed in to change notification settings - Fork 9
Document Embeddings with Doc2Vec
This page is intended on building a similarity analyzer (like Topic Modelling with LDA) based on Doc2Vec algorithm contained in the gensim
package. This page will guide you through the training of a Doc2Vec model.
- Build the codifier using the provided data.
>>> import codifier
>>> cod = codifier.build(start=1999, end=2018, data_dir='<data-dir>')
- Export the whole corpus via
>>> cod.export_codifier_corpus('corpus.txt', 'labels.txt')
This will export each statute in a separate line and will generate corpus.txt
and labels.txt
containing the corpus and the labels respectively.
- Import gensim
import gensim.models as g
import logging
import sys
import tokenizer
from gensim.models.doc2vec import TaggedDocument
from multiprocessing import cpu_count
#enable logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
- Define parameters (you can adjust them to obtain different results
#doc2vec parameters
vector_size = 150
window_size = 8
min_count = 4
sampling_threshold = 1e-5
negative_size = 5
train_epoch = 50
dm = 0 #0 = dbow; 1 = dmpv
worker_count = cpu_count() - 1
- Get the labels
with open('labels.txt', 'r') as f:
labels = f.read().splitlines()
- Create
TaggedDocument
objects using the labels as single tags:
taggeddocs = []
with open('corpus', 'r') as f:
docs = f.read().splitlines()
for label, doc in zip(labels, docs):
td = TaggedDocument(words=tokenizer.tokenizer.split(doc.lower(), delimiter=' '), tags=[label])
#print(td)
taggeddocs.append(td)
- Train the model with:
model = g.Doc2Vec(taggeddocs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=None, iter=train_epoch)
#save model
model.save('laws_model.bin')
- The full script is located at
train_doc2vec.py
- Most similar words via
model.most_similar
>>> model.most_similar('υπουργός', topn=10)
[('υφυπουργός', 0.5026417374610901), ('αναθέτει', 0.48663175106048584), ('συμβούλιο', 0.4686082601547241), ('ραδιοτηλεόρασης.', 0.4630734920501709), ('υπουργό', 0.4535461366176605), ('αποφαινόμενο', 0.4525904059410095), ('προσφύγει', 0.4489518404006958), ('έγγραφες', 0.4454580545425415), ('αποφασίζει', 0.44335660338401794), ('αρμόδιος', 0.4433564841747284)]
>>>
- Or documents via
model.docvecs.most_similar
:
>>> model.docvecs.most_similar('ν. 4009/2011')
[('π.δ. 127/2010', 0.29893651604652405), ('π.δ. 58/2015', 0.24261419475078583), ('π.δ. 13/2013', 0.23538821935653687), ('ν. 4001/2011', 0.22230146825313568), ('ν. 4321/2015', 0.19321933388710022), ('ν. 4420/2016', 0.19074279069900513), ('π.δ. 48/1999', 0.18942637741565704), ('π.δ. 148/2010', 0.18506628274917603), ('π.δ. 172/2014', 0.17951558530330658), ('π.δ. 134/2017', 0.17921951413154602)]
During this year's GSOC we decided to train a doc2vec model on the corpus and include it in the project. This model can be found in the 'models' directory and contains roughly 3000 vectors. We will continue to grow the model in size.
Any future expansion of the legislation extraction capabilities of the project could also result in a much bigger model containing administrative and parliamentary decisions. You can find more information on how to contribute in the relevant sention of the wiki.
- Getting started
- Algorithms
- Datasets and Continuous Integration
- Documentation
- Development