Home > mailing lists

Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Fuzzy substring searching with the pg_trgm extension
Date	February 26, 2016 01:00:58
Msg-id	CAMkU=1zREk5a4g=d134q6J0VnV4H9s4YZHFA0owLbaD3ghHizQ@mail.gmail.com Whole thread Raw
In response to	Re: Fuzzy substring searching with the pg_trgm extension (Teodor Sigaev <teodor@sigaev.ru>)
Responses	Re: Fuzzy substring searching with the pg_trgm extension
List	pgsql-hackers

Tree view

On Fri, Jan 29, 2016 at 6:15 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> The behavior of this function is surprising to me.
>>
>> select substring_similarity('dog' ,  'hotdogpound') ;
>>
>>   substring_similarity
>> ----------------------
>>                   0.25
>>
> Substring search was desined to search similar word in string:
> contrib_regression=# select substring_similarity('dog' ,  'hot dogpound') ;
>  substring_similarity
> ----------------------
>                  0.75
>
> contrib_regression=# select substring_similarity('dog' ,  'hot dog pound') ;
>  substring_similarity
> ----------------------
>                     1
> It seems to me that users search words in long string. But I'm agree that
> more detailed explanation needed and, may be, we need to change feature name
> to fuzzywordsearch or something else, I can't imagine how.

If we implement my proposed behavior, and I wanted the existing
behavior instead, I could just do:

select substring_similarity(' dog ' ,  'hotdogpound');

But with the existing implementation, there isn't anything I could to
switch to the one I want instead. So treating is purely as a substring
is more flexible than treating it as a word match.

The reason I like the option of not treating word boundaries as
special in this case is that often in scientific vocabulary, and in
catalog part numbers, people are pretty inconsistent about whether
they included spaces.  "HEK 293", "HEK293", and "HEK-293" could be all
the same thing.  So I like to strip out spaces and punctuation on both
sides of operator.  Of course I can't do that if there are invisible
un-removable spaces on the substring side.

But, It doesn't sound like I am going to win that debate.  Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.

Cheers,

Jeff

pgsql-hackers by date:

From: Vitaly Burovoy
Date: 26 February 2016, 00:44:32
Subject: Re: [PATCH] Integer overflow in timestamp[tz]_part() and date/time boundaries check

From: Michael Paquier
Date: 26 February 2016, 01:10:09
Subject: Re: Password identifiers, protocol aging and SCRAM protocol

Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers

Previous

Next