Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

From François CHAHUNEAU
Subject Re: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date
Msg-id AM4PR02MB131633B46174D056E6FE6130F6330@AM4PR02MB1316.eurprd02.prod.outlook.com
Whole thread Raw
In response to Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List pgsql-bugs
Hello Alexander,
This is fine with us. Yes, separate thresholds seem preferable.
Best Regards


From: Alexander Korotkov <a.korotkov@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
 

François CHAHUNEAU
Directeur des technologies

 
NUMEN DIGITAL24, rue Marc Seguin 75018 Paris Francewww.numen.fr
Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94
Pensez vert, n’imprimez que nécessaire. Les informations contenues dans le présent e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et peuvent contenir des informations confidentielles, protégées par un secret professionnel. L’utilisation de ces informations par d’autres personnes que le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas destinataire de ce message, la publication, la reproduction, la diffusion et /ou la distribution de ces informations auprès de tiers n’est pas autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en informer immédiatement, détruire l'email, ses copies et documents joints et le supprimer.
 
On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point.  I've no complaints about that.  I'm going to propose corresponding patch to the next commitfest.

I've written a draft patch for fixing this inconsistency.  Please, find it in attachment.  This patch doesn't contain proper documentation and comments yet.

I've called existing behavior subset_similarity().  I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams.  For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided.  I've added extra arrow sign to denote these operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word bounds.  Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.

# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
        t         | subset_similarity | word_similarity
------------------+-------------------+-----------------
 message          |               0.6 |             0.3
 message s        |               0.8 |        0.363636
 message sag      |                 1 |             0.5
 message sag sag  |                 1 |             0.5
 message sag sage |                 1 |               1
(5 rows)

The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that.  In this case word_similarity() returns similarity between 'sage' and 'message s'.

# select similarity('sage', 'message s');
 similarity
------------
   0.363636
(1 row)

I think behavior of word_similarity() appears better here, because typo can break word into two.

I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search.  subset_similarity() typically returns higher values than word_similarity().  Thus, it's probably makes sense to split their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-bugs by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: BUG #14952: COPY fails to fill in IDENTITY column default value
Next
From: Robert Haas
Date:
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug