Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers

From Artur Zakirov
Subject Re: Fuzzy substring searching with the pg_trgm extension
Date
Msg-id 56AB8C2F.2080609@postgrespro.ru
Whole thread Raw
In response to Re: Fuzzy substring searching with the pg_trgm extension  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: Fuzzy substring searching with the pg_trgm extension  (Artur Zakirov <a.zakirov@postgrespro.ru>)
List pgsql-hackers
On 29.01.2016 18:39, Alvaro Herrera wrote:
> Teodor Sigaev wrote:
>>> The behavior of this function is surprising to me.
>>>
>>> select substring_similarity('dog' ,  'hotdogpound') ;
>>>
>>>   substring_similarity
>>> ----------------------
>>>                   0.25
>>>
>> Substring search was desined to search similar word in string:
>> contrib_regression=# select substring_similarity('dog' ,  'hot dogpound') ;
>>   substring_similarity
>> ----------------------
>>                   0.75
>>
>> contrib_regression=# select substring_similarity('dog' ,  'hot dog pound') ;
>>   substring_similarity
>> ----------------------
>>                      1
>
> Hmm, this behavior looks too much like magic to me.  I mean, a substring
> is a substring -- why are we treating the space as a special character
> here?
>

I think, I can rename this function to subword_similarity() and correct 
the documentation.

The current behavior is developed to find most similar word in a text. 
For example, if we will search just substring (not word) then we will 
get the following result:

select substring_similarity('dog', 'dogmatist'); substring_similarity
---------------------                    1
(1 row)

But this is wrong I think. They are completely different words.

For searching a similar substring (not word) in a text maybe another 
function should be added?

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.
Next
From: Thom Brown
Date:
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.