Home > mailing lists

Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers

From	Artur Zakirov
Subject	Re: Fuzzy substring searching with the pg_trgm extension
Date	January 29, 2016 15:58:44
Msg-id	56AB8C2F.2080609@postgrespro.ru Whole thread Raw
In response to	Re: Fuzzy substring searching with the pg_trgm extension (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses	Re: Fuzzy substring searching with the pg_trgm extension
List	pgsql-hackers

Tree view

On 29.01.2016 18:39, Alvaro Herrera wrote:
> Teodor Sigaev wrote:
>>> The behavior of this function is surprising to me.
>>>
>>> select substring_similarity('dog' ,  'hotdogpound') ;
>>>
>>>   substring_similarity
>>> ----------------------
>>>                   0.25
>>>
>> Substring search was desined to search similar word in string:
>> contrib_regression=# select substring_similarity('dog' ,  'hot dogpound') ;
>>   substring_similarity
>> ----------------------
>>                   0.75
>>
>> contrib_regression=# select substring_similarity('dog' ,  'hot dog pound') ;
>>   substring_similarity
>> ----------------------
>>                      1
>
> Hmm, this behavior looks too much like magic to me.  I mean, a substring
> is a substring -- why are we treating the space as a special character
> here?
>

I think, I can rename this function to subword_similarity() and correct 
the documentation.

The current behavior is developed to find most similar word in a text. 
For example, if we will search just substring (not word) then we will 
get the following result:

select substring_similarity('dog', 'dogmatist'); substring_similarity
---------------------                    1
(1 row)

But this is wrong I think. They are completely different words.

For searching a similar substring (not word) in a text maybe another 
function should be added?

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

pgsql-hackers by date:

From: Aleksander Alekseev
Date: 29 January 2016, 15:50:02
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.

From: Thom Brown
Date: 29 January 2016, 16:02:16
Subject: Re: [WIP] Effective storage of duplicates in B-tree index.

Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers

Previous

Next