You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am starting to use this library and came across those potential glitches.
I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):
server_encoding | UTF8
client_encoding | UTF8
pg_similarity.cosine_is_normalized | on
pg_similarity.jaro_is_normalized | on
cl_generated_1000=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.9 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
# set pg_similarity.cosine_tokenizer = 'alnum';
SET
# set pg_similarity.jaro_tokenizer = 'alnum';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | t | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'word';
SET
# set pg_similarity.jaro_tokenizer = 'word';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | f | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'gram';
SET
# set pg_similarity.jaro_tokenizer = 'gram';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
f | f | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'camelcase';
SET
# set pg_similarity.jaro_tokenizer = 'camelcase';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | f | t | t
(1 row)
The text was updated successfully, but these errors were encountered:
Hi,
I am starting to use this library and came across those potential glitches.
I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):
The text was updated successfully, but these errors were encountered: