Message from @cubiedev
Hello again!
@chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request #10, and though I've been busy until now, I finally got around to testing it.
This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.
Is it equivalent?
I created the following file in the root folder of the project, where it could read from test_values.py
.
#!/usr/bin/env python
# encoding: utf-8
from difflib import SequenceMatcher
from Levenshtein import ratio
import unittest
class TestWordForms(unittest.TestCase):
"""
Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
"""
def __init__(self, text_input: str, expected_output: dict, description: str = ""):
super().__init__()
self.text_input = text_input
self.expected_output = expected_output
self.description = description
def setUp(self):
pass
def tearDown(self):
pass
def runTest(self):
self.assertEqual(
SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
ratio(self.text_input, self.expected_output),
self.description,
)
if __name__ == "__main__":
from test_values import test_values
suite = unittest.TestSuite()
suite.addTests(
TestWordForms(
inp,
value,
f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
)
for inp, out in test_values
for values in out.values()
for value in values
)
unittest.TextTestRunner().run(suite)
In short, this takes all input values from test_values.py
, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using difflib.SequenceMatcher().ratio()
vs Levenshtein.ratio()
. The output is the following:
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 621 tests in 0.080s
OK
So, the Levenshtein.ratio()
is indeed equivalent to difflib.SequenceMatcher().ratio()
for these test cases.
But is it faster?
Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with word_forms
. The script is:
#!/usr/bin/env python
# encoding: utf-8
from timeit import timeit
from difflib import SequenceMatcher
from Levenshtein import ratio
from test_values import test_values
test_cases = [
(
inp,
value,
)
for inp, out in test_values
for values in out.values()
for value in values
]
n = 100
ratio_list = []
for str_one, str_two in test_cases:
diff = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
leven = timeit(lambda: ratio(str_one, str_two), number=n)
ratio_list.append(diff / leven)
#print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")
Which outputted:
Minimum performance gain (ratio): 21.2509 times as fast
Maximum performance gain (ratio): 194.4625 times as fast
Median performance gain (ratio) : 78.2975 times as fast
So, yes, it is much faster. Will it be a noticable performance increase when implemented in word_forms
? Well, I made a quick script for that too.
Is it actually noticably faster in get_word_forms?
#!/usr/bin/env python
# encoding: utf-8
from timeit import timeit
from word_forms.word_forms import get_word_forms
from test_values import test_values
n = 200
speed_list = []
for test_case in test_values:
str_one = test_case[0]
speed = timeit(lambda: get_word_forms(str_one), number=n)
speed_list.append(speed)
#print(f"Took {speed:.8f}s")
print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")
Which outputted the following when using difflib.SequenceMatcher().ratio()
: (Note, the execution time is for calling the function 200 times)
Minimum execution time (seconds): 0.01940580s
Maximum execution time (seconds): 1.16317950s
Median execution time (seconds) : 0.07265300s
and the following for Levenshtein.ratio()
:
Minimum execution time (seconds): 0.01647300s
Maximum execution time (seconds): 1.23246420s
Median execution time (seconds) : 0.05827050s
When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using Levenshtein.ratio()
is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.
Thank you for the recommendation @chrisjbryant
All tests still pass, as expected.
- Tom Aarsen
Message from @gutfeeling
- Added a simple lemmatizer in
word_forms.lemmatizer
. It usesget_word_forms()
to generate all forms of the word and then picks the shortest form that appears first in the dictionary (i.e. in alphabetically sorted order). - Updated dependencies.
Unipath
has been replaced by Python3'spathlib
.NLTK
andinflect
versions have been bumped.