On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote: > Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words > > But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison. > > Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it. >
It computes maximum similarity using closest trigrams not considering order of 'sage' trigrams. It determines that all trigrams from 'sage' match trigrams from 'age sag'.
Initial order of 'age sag' trigrams: ' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag ' ^ ^ |from |to Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams continuously): ' s', ' sa', 'age', 'ge ', 'sag'
Maybe the problem should be solved by considering 'sage' trigrams initial order.
We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.
Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.
So, I see following solution:
1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,
2) Document both cases of word_similarity() behavior.