Home > mailing lists

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

From	François CHAHUNEAU
Subject	Re: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date	December 7, 2017 23:39:05
Msg-id	AM4PR02MB131633B46174D056E6FE6130F6330@AM4PR02MB1316.eurprd02.prod.outlook.com Whole thread Raw
In response to	Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug (Alexander Korotkov <a.korotkov@postgrespro.ru>)
List	pgsql-bugs

Tree view

Hello Alexander,

This is fine with us. Yes, separate thresholds seem preferable.

Best Regards

Obtenez Outlook pour iOS

From: Alexander Korotkov <a.korotkov@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

François CHAHUNEAU
Directeur des technologies

NUMEN DIGITAL| 24, rue Marc Seguin 75018 Paris France | www.numen.fr
Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94

Pensez vert, n’imprimez que nécessaire. Les informations contenues dans le présent e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et peuvent contenir des informations confidentielles, protégées par un secret professionnel. L’utilisation de ces informations par d’autres personnes que le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas destinataire de ce message, la publication, la reproduction, la diffusion et /ou la distribution de ces informations auprès de tiers n’est pas autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en informer immédiatement, détruire l'email, ses copies et documents joints et le supprimer.

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.

I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.

I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.

# with data(t) as (

values

('message'),

('message s'),

('message sag'),

('message sag sag'),

('message sag sage')

)

select t, subset_similarity('sage', t), word_similarity('sage', t)

from data;

t | subset_similarity | word_similarity

------------------+-------------------+-----------------

message | 0.6 | 0.3

message s | 0.8 | 0.363636

message sag | 1 | 0.5

message sag sag | 1 | 0.5

message sag sage | 1 | 1

(5 rows)

The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'.

# select similarity('sage', 'message s');

similarity

------------

0.363636

(1 row)

I think behavior of word_similarity() appears better here, because typo can break word into two.

I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

pgsql-bugs by date:

From: Peter Eisentraut
Date: 07 December 2017, 23:13:53
Subject: Re: BUG #14952: COPY fails to fill in IDENTITY column default value

From: Robert Haas
Date: 07 December 2017, 23:59:06
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

Previous

Next