Skip to content

v2.1.0: Lemmatizer added, performance improved, dependencies updated

Latest
Compare
Choose a tag to compare
@gutfeeling gutfeeling released this 18 Sep 11:30
· 21 commits to master since this release

Message from @cubiedev

Hello again!

@chrisjbryant suggested the use of python-Levenshtein for getting similarity ratios between strings in the comments of pull request #10, and though I've been busy until now, I finally got around to testing it.
This pull request has some scripts in the Markdown. They are included in case you want to reproduce my tests and results. However, I would recommend not giving the scripts themselves much attention, and focusing on the explanations and outputs.

Is it equivalent?

I created the following file in the root folder of the project, where it could read from test_values.py.

#!/usr/bin/env python
# encoding: utf-8

from difflib import SequenceMatcher
from Levenshtein import ratio
import unittest


class TestWordForms(unittest.TestCase):
    """
    Simple TestCase for a specific input to output, one instance generated per test case for use in a TestSuite
    """

    def __init__(self, text_input: str, expected_output: dict, description: str = ""):
        super().__init__()
        self.text_input = text_input
        self.expected_output = expected_output
        self.description = description

    def setUp(self):
        pass

    def tearDown(self):
        pass

    def runTest(self):
        self.assertEqual(
            SequenceMatcher(None, self.text_input, self.expected_output).ratio(),
            ratio(self.text_input, self.expected_output),
            self.description,
        )


if __name__ == "__main__":
    from test_values import test_values

    suite = unittest.TestSuite()
    suite.addTests(
        TestWordForms(
            inp,
            value,
            f"difflib.SequenceMatcher(None, {repr(inp)}, {repr(value)}) ~= Levenshtein.ratio({repr(inp)}, {repr(value)})",
        )
        for inp, out in test_values
        for values in out.values()
        for value in values
    )
    unittest.TextTestRunner().run(suite)

In short, this takes all input values from test_values.py, and all individual outputs for these inputs, and checks whether the similarity ratio between these two is identical when using difflib.SequenceMatcher().ratio() vs Levenshtein.ratio(). The output is the following:

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 621 tests in 0.080s

OK

So, the Levenshtein.ratio() is indeed equivalent to difflib.SequenceMatcher().ratio() for these test cases.

But is it faster?

Again, I wrote a quick script for this. None of these are included in the actual commits as they should not be packaged with word_forms. The script is:

#!/usr/bin/env python
# encoding: utf-8

from timeit import timeit
from difflib import SequenceMatcher
from Levenshtein import ratio

from test_values import test_values
test_cases = [
    (
        inp,
        value,
    )
    for inp, out in test_values
    for values in out.values()
    for value in values
]

n = 100
ratio_list = []
for str_one, str_two in test_cases:
    diff  = timeit(lambda: SequenceMatcher(None, str_one, str_two).ratio(), number=n)
    leven = timeit(lambda: ratio(str_one, str_two), number=n)
    ratio_list.append(diff / leven)
    #print(f"Levenshtein.ratio() is {ratio_list[-1]:.4f} times as fast as difflib.SequenceMatcher().ratio() for {repr(str_one)} to {repr(str_two)}")
print(f"Minimum performance gain (ratio): {min(ratio_list):.4f} times as fast")
print(f"Maximum performance gain (ratio): {max(ratio_list):.4f} times as fast")
print(f"Median performance gain (ratio) : {sorted(ratio_list)[round(len(ratio_list)/2)]:.4f} times as fast")

Which outputted:

Minimum performance gain (ratio): 21.2509 times as fast
Maximum performance gain (ratio): 194.4625 times as fast
Median performance gain (ratio) : 78.2975 times as fast

So, yes, it is much faster. Will it be a noticable performance increase when implemented in word_forms? Well, I made a quick script for that too.

Is it actually noticably faster in get_word_forms?

#!/usr/bin/env python
# encoding: utf-8

from timeit import timeit
from word_forms.word_forms import get_word_forms
from test_values import test_values

n = 200
speed_list = []
for test_case in test_values:
    str_one = test_case[0]
    speed = timeit(lambda: get_word_forms(str_one), number=n)
    speed_list.append(speed)
    #print(f"Took {speed:.8f}s")
print(f"Minimum execution time (seconds): {min(speed_list):.8f}s")
print(f"Maximum execution time (seconds): {max(speed_list):.8f}s")
print(f"Median execution time (seconds) : {sorted(speed_list)[round(len(speed_list)/2)]:.8f}s")

Which outputted the following when using difflib.SequenceMatcher().ratio(): (Note, the execution time is for calling the function 200 times)

Minimum execution time (seconds): 0.01940580s
Maximum execution time (seconds): 1.16317950s
Median execution time (seconds) : 0.07265300s

and the following for Levenshtein.ratio():

Minimum execution time (seconds): 0.01647300s
Maximum execution time (seconds): 1.23246420s
Median execution time (seconds) : 0.05827050s

When considering the median, there is a noticable performance increase of some ~20%, but this is likely not enough for any user to actually notice. Regardless, using Levenshtein.ratio() is definitely preferable, unless you desperately want to keep the amount of dependencies to a minimum.


Thank you for the recommendation @chrisjbryant

All tests still pass, as expected.

  • Tom Aarsen

Message from @gutfeeling

  1. Added a simple lemmatizer in word_forms.lemmatizer. It uses get_word_forms() to generate all forms of the word and then picks the shortest form that appears first in the dictionary (i.e. in alphabetically sorted order).
  2. Updated dependencies. Unipath has been replaced by Python3's pathlib. NLTK and inflect versions have been bumped.