Home > mailing lists

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

From	Alexander Korotkov
Subject	Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date	December 7, 2017 19:38:59
Msg-id	CAPpHfdtJ+JdeKUqBCOP_nHoDGs8iPsZSywUGJftLxOofehb96w@mail.gmail.com Whole thread Raw
In response to	Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses	Re: [BUGS] pg_trgm word_similarity inconsistencies or bug (François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>)
List	pgsql-bugs

Tree view

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.

I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.

I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.

# with data(t) as (

values

('message'),

('message s'),

('message sag'),

('message sag sag'),

('message sag sage')

)

select t, subset_similarity('sage', t), word_similarity('sage', t)

from data;

t | subset_similarity | word_similarity

------------------+-------------------+-----------------

message | 0.6 | 0.3

message s | 0.8 | 0.363636

message sag | 1 | 0.5

message sag sag | 1 | 0.5

message sag sage | 1 | 1

(5 rows)

The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'.

# select similarity('sage', 'message s');

similarity

------------

0.363636

(1 row)

I think behavior of word_similarity() appears better here, because typo can break word into two.

I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Attachment

pg-trgm-word-subset-similarity-1.patch

pgsql-bugs by date:

From: Jaroslav Urik
Date: 07 December 2017, 19:36:15
Subject: Re: BUG #14949: array_append() - performance issues (in update)

From: "Raghavendra Rao Jsv -X (rjsv - SCARLET WIRELESS INDIA PRIVATE LIMITEDat Cisco)"
Date: 07 December 2017, 20:21:30
Subject: missing chunk number 0 for toast value 1086251 in pg_toast_2619

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

Attachment

Previous

Next