Home > mailing lists

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

From	Alexander Korotkov
Subject	Re: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date	October 31, 2017 00:08:21
Msg-id	CAPpHfdtz4a9dTxp-Fpem8uBdSBuOQV_i2xuFyL1tG6zM6nAkpg@mail.gmail.com Whole thread Raw
In response to	Re: [BUGS] pg_trgm word_similarity inconsistencies or bug (Arthur Zakirov <a.zakirov@postgrespro.ru>)
Responses	Re: [BUGS] pg_trgm word_similarity inconsistencies or bug (Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com>)
List	pgsql-bugs

Tree view

On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
> Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>
> But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
>
> Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
>

Interesting. An klin's answer from stackoverflow.com is right.

The initial example can be reduced to the next:

=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1

It computes maximum similarity using closest trigrams not considering order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.

Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'

Maybe the problem should be solved by considering 'sage' trigrams
initial order.

We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.

Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.

So, I see following solution:

1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,

2) Document both cases of word_similarity() behavior.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

pgsql-bugs by date:

From: konst583@gmail.com
Date: 30 October 2017, 18:53:45
Subject: [BUGS] BUG #14879: Bug connected with table structure modification andtrigger function query plan invalidation

From: miha@rthand.com
Date: 31 October 2017, 00:34:57
Subject: [BUGS] BUG #14880: pg_dump doesn't run or crashes immediately

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

Previous

Next