Skip to content

Commit 7f3e9f8

Browse files
authored
Merge pull request #8 from raszidzie/master
Added new similarity algorithm using Gensim Library
2 parents cbcba02 + 317c5d1 commit 7f3e9f8

File tree

3 files changed

+57
-2
lines changed

3 files changed

+57
-2
lines changed

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,19 @@ print(fourgram.distance(s1, s2))
304304

305305
```
306306

307+
## Gensim
308+
Gensim is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But its practically much more than that.
309+
310+
If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from large volumes of text. Gensim provides algorithms like LDA and LSI (which we will see later in this post) and the necessary sophistication to build high-quality topic models.
311+
312+
You may argue that topic models and word embedding are available in other packages like scikit, R etc. But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing.
313+
314+
It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models.
315+
316+
Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory
317+
318+
Gensim Tutorial – A Complete Beginners Guide: https://www.machinelearningplus.com/nlp/gensim-tutorial/
319+
307320
## Shingle (n-gram) based algorithms
308321
A few algorithms work by converting strings into sets of n-grams (sequences of n characters, also sometimes called k-shingles). The similarity or distance between the strings is then the similarity or distance between the sets.
309322

@@ -366,8 +379,6 @@ SIFT4 is a general purpose string distance algorithm inspired by JaroWinkler and
366379

367380
**Not implemented yet**
368381

369-
370-
371382
## Users
372383
* [StringSimilarity.NET](https://github.com/feature23/StringSimilarity.NET) a .NET port of java-string-similarity
373384

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
gensim
2+
nltk

similarity/gensim_similarity.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
import gensim
2+
from nltk.tokenize import word_tokenize
3+
4+
class GensimSimilarity:
5+
def __init__(self):
6+
self.raw_documents = ["I'm taking the show on the road.",
7+
"My socks are a force multiplier.",
8+
"I am the barber who cuts everyone's hair who doesn't cut their own.",
9+
"Legend has it that the mind is a mad monkey.",
10+
"I make my own fun."]
11+
12+
def getSimilarity(gen):
13+
gen_docs = [[w.lower() for w in word_tokenize(text)]
14+
for text in gen.raw_documents]
15+
print(gen_docs)
16+
dictionary = gensim.corpora.Dictionary(gen_docs)
17+
print("Number of words in dictionary:",len(dictionary))
18+
19+
for i in range(len(dictionary)):
20+
print(i, dictionary[i])
21+
22+
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
23+
print(corpus)
24+
25+
tf_idf = gensim.models.TfidfModel(corpus)
26+
print(tf_idf)
27+
s = 0
28+
for i in corpus:
29+
s += len(i)
30+
print(s)
31+
32+
sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],num_features=len(dictionary))
33+
34+
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
35+
print(query_doc)
36+
query_doc_bow = dictionary.doc2bow(query_doc)
37+
print(query_doc_bow)
38+
query_doc_tf_idf = tf_idf[query_doc_bow]
39+
print(f'Result: {sims[query_doc_tf_idf]}')
40+
41+
similarity = GensimSimilarity()
42+
similarity.getSimilarity()

0 commit comments

Comments
 (0)