Thread: strange behavior of pg_trgm's similarity function
Hi, The behavior of pg_trgm's similarity function seems strange. Is this intentional? I was thinking that the following three calls of the similarity function return the same number because the second argument is just the three characters contained in the first argument in every calls. =# SELECT similarity('12345', '123'); =# SELECT similarity('12345', '234'); =# SELECT similarity('12345', '345'); But that's not true. Each returns the different number. =# SELECT similarity('12345', '123');similarity ------------ 0.428571 (1 row) =# SELECT similarity('12345', '234');similarity ------------ 0.111111 (1 row) =# SELECT similarity('12345', '345');similarity ------------ 0.25 (1 row) This happens because, for example, similarity('12345', '123') returns the similarity number of '**12345*' and '**123*' (* means the blank character), NOT '12345' and '123'. IOW, two and one blank characters are added into the heading and tailing of each argument, respectively. I wonder why pg_trgm's similarity function works in this way. We should change this so that no blank characters are added into the arguments? Regards, -- Fujii Masao
On 10.10.2013 15:03, Fujii Masao wrote: > Hi, > > The behavior of pg_trgm's similarity function seems strange. Is this > intentional? > > I was thinking that the following three calls of the similarity function return > the same number because the second argument is just the three characters > contained in the first argument in every calls. > > =# SELECT similarity('12345', '123'); > =# SELECT similarity('12345', '234'); > =# SELECT similarity('12345', '345'); > > But that's not true. Each returns the different number. > > =# SELECT similarity('12345', '123'); > similarity > ------------ > 0.428571 > (1 row) > > =# SELECT similarity('12345', '234'); > similarity > ------------ > 0.111111 > (1 row) > > =# SELECT similarity('12345', '345'); > similarity > ------------ > 0.25 > (1 row) > > This happens because, for example, similarity('12345', '123') returns > the similarity number of '**12345*' and '**123*' (* means the blank character), > NOT '12345' and '123'. IOW, two and one blank characters are added into > the heading and tailing of each argument, respectively. I wonder why > pg_trgm's similarity function works in this way. We should change this > so that no blank characters are added into the arguments? Well, you could also argue that "111111" and "222222" are quite similar, even though pg_trgm's similarity will not think so. It comes down to the definition of similarity, and how well that definition matches your intuition. FWIW, it feels right to me that a match in the beginning of a word is worth more than one in the middle of a string. -1 on changing that. - Heikki
On Thu, Oct 10, 2013 at 7:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 10.10.2013 15:03, Fujii Masao wrote: >> >> Hi, >> >> The behavior of pg_trgm's similarity function seems strange. Is this >> intentional? >> >> I was thinking that the following three calls of the similarity function >> return >> the same number because the second argument is just the three characters >> contained in the first argument in every calls. >> >> =# SELECT similarity('12345', '123'); >> =# SELECT similarity('12345', '234'); >> =# SELECT similarity('12345', '345'); >> >> But that's not true. Each returns the different number. >> >> =# SELECT similarity('12345', '123'); >> similarity >> ------------ >> 0.428571 >> (1 row) >> >> =# SELECT similarity('12345', '234'); >> similarity >> ------------ >> 0.111111 >> (1 row) >> >> =# SELECT similarity('12345', '345'); >> similarity >> ------------ >> 0.25 >> (1 row) >> >> This happens because, for example, similarity('12345', '123') returns >> the similarity number of '**12345*' and '**123*' (* means the blank >> character), >> NOT '12345' and '123'. IOW, two and one blank characters are added into >> the heading and tailing of each argument, respectively. I wonder why >> pg_trgm's similarity function works in this way. We should change this >> so that no blank characters are added into the arguments? > > > Well, you could also argue that "111111" and "222222" are quite similar, > even though pg_trgm's similarity will not think so. It comes down to the > definition of similarity, and how well that definition matches your > intuition. > > FWIW, it feels right to me that a match in the beginning of a word is worth > more than one in the middle of a string. -1 on changing that. I'm not so sure that the assumption that leading trigrams should effectively weight > 3x is a good one to build into the library. However, the behavior is clearly documented and can't be changed. I think you'd need to improvise an alternate set of "trigram ops" if you wanted to rig an alternate matching behavior. merlin
On Thu, Oct 10, 2013 at 11:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Oct 10, 2013 at 7:12 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> On 10.10.2013 15:03, Fujii Masao wrote: >>> >>> Hi, >>> >>> The behavior of pg_trgm's similarity function seems strange. Is this >>> intentional? >>> >>> I was thinking that the following three calls of the similarity function >>> return >>> the same number because the second argument is just the three characters >>> contained in the first argument in every calls. >>> >>> =# SELECT similarity('12345', '123'); >>> =# SELECT similarity('12345', '234'); >>> =# SELECT similarity('12345', '345'); >>> >>> But that's not true. Each returns the different number. >>> >>> =# SELECT similarity('12345', '123'); >>> similarity >>> ------------ >>> 0.428571 >>> (1 row) >>> >>> =# SELECT similarity('12345', '234'); >>> similarity >>> ------------ >>> 0.111111 >>> (1 row) >>> >>> =# SELECT similarity('12345', '345'); >>> similarity >>> ------------ >>> 0.25 >>> (1 row) >>> >>> This happens because, for example, similarity('12345', '123') returns >>> the similarity number of '**12345*' and '**123*' (* means the blank >>> character), >>> NOT '12345' and '123'. IOW, two and one blank characters are added into >>> the heading and tailing of each argument, respectively. I wonder why >>> pg_trgm's similarity function works in this way. We should change this >>> so that no blank characters are added into the arguments? >> >> >> Well, you could also argue that "111111" and "222222" are quite similar, >> even though pg_trgm's similarity will not think so. It comes down to the >> definition of similarity, and how well that definition matches your >> intuition. >> >> FWIW, it feels right to me that a match in the beginning of a word is worth >> more than one in the middle of a string. -1 on changing that. Okay, understood. > I'm not so sure that the assumption that leading trigrams should > effectively weight > 3x is a good one to build into the library. > However, the behavior is clearly documented and can't be changed. I > think you'd need to improvise an alternate set of "trigram ops" if you > wanted to rig an alternate matching behavior. Yeah, this makes sense. Regards, -- Fujii Masao