On 10.10.2013 15:03, Fujii Masao wrote:
> Hi,
>
> The behavior of pg_trgm's similarity function seems strange. Is this
> intentional?
>
> I was thinking that the following three calls of the similarity function return
> the same number because the second argument is just the three characters
> contained in the first argument in every calls.
>
> =# SELECT similarity('12345', '123');
> =# SELECT similarity('12345', '234');
> =# SELECT similarity('12345', '345');
>
> But that's not true. Each returns the different number.
>
> =# SELECT similarity('12345', '123');
> similarity
> ------------
> 0.428571
> (1 row)
>
> =# SELECT similarity('12345', '234');
> similarity
> ------------
> 0.111111
> (1 row)
>
> =# SELECT similarity('12345', '345');
> similarity
> ------------
> 0.25
> (1 row)
>
> This happens because, for example, similarity('12345', '123') returns
> the similarity number of '**12345*' and '**123*' (* means the blank character),
> NOT '12345' and '123'. IOW, two and one blank characters are added into
> the heading and tailing of each argument, respectively. I wonder why
> pg_trgm's similarity function works in this way. We should change this
> so that no blank characters are added into the arguments?
Well, you could also argue that "111111" and "222222" are quite similar,
even though pg_trgm's similarity will not think so. It comes down to the
definition of similarity, and how well that definition matches your
intuition.
FWIW, it feels right to me that a match in the beginning of a word is
worth more than one in the middle of a string. -1 on changing that.
- Heikki