Re: strange behavior of pg_trgm's similarity function - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: strange behavior of pg_trgm's similarity function
Date
Msg-id 525699BE.9040508@vmware.com
Whole thread Raw
In response to strange behavior of pg_trgm's similarity function  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: strange behavior of pg_trgm's similarity function  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-hackers
On 10.10.2013 15:03, Fujii Masao wrote:
> Hi,
>
> The behavior of pg_trgm's similarity function seems strange. Is this
> intentional?
>
> I was thinking that the following three calls of the similarity function return
> the same number because the second argument is just the three characters
> contained in the first argument in every calls.
>
> =# SELECT similarity('12345', '123');
> =# SELECT similarity('12345', '234');
> =# SELECT similarity('12345', '345');
>
> But that's not true. Each returns the different number.
>
> =# SELECT similarity('12345', '123');
>   similarity
> ------------
>     0.428571
> (1 row)
>
> =# SELECT similarity('12345', '234');
>   similarity
> ------------
>     0.111111
> (1 row)
>
> =# SELECT similarity('12345', '345');
>   similarity
> ------------
>         0.25
> (1 row)
>
> This happens because, for example, similarity('12345', '123') returns
> the similarity number of '**12345*' and '**123*' (* means the blank character),
> NOT '12345' and '123'. IOW, two and one blank characters are added into
> the heading and tailing of each argument, respectively. I wonder why
> pg_trgm's similarity function works in this way. We should change this
> so that no blank characters are added into the arguments?

Well, you could also argue that "111111" and "222222" are quite similar, 
even though pg_trgm's similarity will not think so. It comes down to the 
definition of similarity, and how well that definition matches your 
intuition.

FWIW, it feels right to me that a match in the beginning of a word is 
worth more than one in the middle of a string. -1 on changing that.

- Heikki



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: strange behavior of pg_trgm's similarity function
Next
From: Stephen Frost
Date:
Subject: Re: Auto-tuning work_mem and maintenance_work_mem