Thread: Can pg_trgm handle non-alphanumeric characters?

Can pg_trgm handle non-alphanumeric characters?

From
"MauMau"
Date:
Hello,

This question may be appropriate for pgsql-general, but let me ask here 
because the only relevant discussion seems to have been done on 
pgsql-hackers:

http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php

Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM 
in contrib/pg_trgm/trgm.h? If no, what kind of problems would happen?

Regards
MauMau




Re: Can pg_trgm handle non-alphanumeric characters?

From
Fujii Masao
Date:
On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote:
> Hello,
>
> This question may be appropriate for pgsql-general, but let me ask here
> because the only relevant discussion seems to have been done on
> pgsql-hackers:
>
> http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php
>
> Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM
> in contrib/pg_trgm/trgm.h?

Yes unless I'm missing something.

Regards,

-- 
Fujii Masao


Re: Can pg_trgm handle non-alphanumeric characters?

From
"MauMau"
Date:
From: "Fujii Masao" <masao.fujii@gmail.com>
> On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote:
>> This question may be appropriate for pgsql-general, but let me ask here
>> because the only relevant discussion seems to have been done on
>> pgsql-hackers:
>>
>> http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php
>>
>> Can pg_trgm support non-alphanumeric characters by undefining 
>> KEEPONLYALNUM
>> in contrib/pg_trgm/trgm.h?
>
> Yes unless I'm missing something.

Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would 
it cause any problems? If no, I wish that, because it eliminates the need to 
do the removal every time the users applies minor releases.


Regards
MauMau




Re: Can pg_trgm handle non-alphanumeric characters?

From
Euler Taveira
Date:
On 09-05-2012 19:17, MauMau wrote:
> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
> cause any problems? If no, I wish that, because it eliminates the need to do
> the removal every time the users applies minor releases.
> 
If you do so, you'll break minor versions. IMHO the default is the desirable
behavior for almost all use cases (you are the first one that complain about
it). Maybe in the future, we should be able to flip this flag without
rebuilding binaries.


--   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento
 


Re: Can pg_trgm handle non-alphanumeric characters?

From
Fujii Masao
Date:
On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote:
> On 09-05-2012 19:17, MauMau wrote:
>> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it
>> cause any problems? If no, I wish that, because it eliminates the need to do
>> the removal every time the users applies minor releases.
>>
> If you do so, you'll break minor versions.

Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
so that should be proposed during major version development cycle.

> IMHO the default is the desirable
> behavior for almost all use cases (you are the first one that complain about
> it).

Really? I was thinking non-English users (including me) basicaly would not be
satisfied with the default because they cannot use pg_trgm for N-gram full text
search of non-English text. Though I agree some users would prefer the default.

> Maybe in the future, we should be able to flip this flag without
> rebuilding binaries.

Agreed.

Regards,

-- 
Fujii Masao


Re: Can pg_trgm handle non-alphanumeric characters?

From
"MauMau"
Date:
From: "Fujii Masao" <masao.fujii@gmail.com>
> On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote:
>> On 09-05-2012 19:17, MauMau wrote:
>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? 
>>> Would it
>>> cause any problems? If no, I wish that, because it eliminates the need 
>>> to do
>>> the removal every time the users applies minor releases.
>>>
>> If you do so, you'll break minor versions.
>
> Right. And removing KEEPONLYALNUM is a feature change rather than bug fix,
> so that should be proposed during major version development cycle.

For information, what kind of breakage would occur? Is it performance 
degradation, extra index storage consumption, or undesirable query results? 
I imagined removing KEEPONLYALNUM would just accept non-alphanumeric 
characters and cause no harm to those who use only alphanumeric characters.

Regards
MauMau



Re: Can pg_trgm handle non-alphanumeric characters?

From
"Kevin Grittner"
Date:
"MauMau" <maumau307@gmail.com> wrote:
>>> On 09-05-2012 19:17, MauMau wrote:
>>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in
>>>> 9.1.4? Would it cause any problems?
Yes, it will cause problems.
> For information, what kind of breakage would occur?
> I imagined removing KEEPONLYALNUM would just accept
> non-alphanumeric characters and cause no harm to those who use
> only alphanumeric characters.
This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters.  It
would make similarity (and distance) values less useful for our
current name searches using it.  To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');                        show_trgm                         
-----------------------------------------------------------{"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');                     show_trgm                      
-----------------------------------------------------{"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');similarity 
------------  0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');similarity 
------------    0.3125
(1 row)
So making the proposed change unconditionally could indeed hurt
current users of the technique.  On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.
-Kevin


Re: Can pg_trgm handle non-alphanumeric characters?

From
"MauMau"
Date:
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> "MauMau" <maumau307@gmail.com> wrote:
>> For information, what kind of breakage would occur?
>
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
>
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters.  It
> would make similarity (and distance) values less useful for our
> current name searches using it.  To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
>
> test=# select show_trgm('smith,john');
>                         show_trgm
> -----------------------------------------------------------
> {"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
>
> test=# select show_trgm('smith8john');
>                      show_trgm
> -----------------------------------------------------
> {"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
>
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
>   0.615385
> (1 row)
>
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
>     0.3125
> (1 row)
>
> So making the proposed change unconditionally could indeed hurt
> current users of the technique.  On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thanks for your explanation. Although I haven't understood it well yet, I'll 
consider what you taught. And I'll consider if the tentative measure of 
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm 
against Japanese text.

Regards
MauMau



Re: Can pg_trgm handle non-alphanumeric characters?

From
Fujii Masao
Date:
On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:
> Thanks for your explanation. Although I haven't understood it well yet, I'll
> consider what you taught. And I'll consider if the tentative measure of
> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
> against Japanese text.

In Japanese, it's common to do a text search with two characters keyword.
But since pg_trgm is 3-gram, you basically would not be able to use index
for such text search. So you might need something like pg_bigm or pg_unigm
for Japanese text search.

Regards,

-- 
Fujii Masao


Re: Can pg_trgm handle non-alphanumeric characters?

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:
>> Thanks for your explanation. Although I haven't understood it well yet, I'll
>> consider what you taught. And I'll consider if the tentative measure of
>> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
>> against Japanese text.

> In Japanese, it's common to do a text search with two characters keyword.
> But since pg_trgm is 3-gram, you basically would not be able to use index
> for such text search. So you might need something like pg_bigm or pg_unigm
> for Japanese text search.

I believe the trigrams are three *bytes* not three characters.  So a
couple of kanji should work just fine for this.
        regards, tom lane


Re: Can pg_trgm handle non-alphanumeric characters?

From
"MauMau"
Date:
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> "MauMau" <maumau307@gmail.com> wrote:
>> For information, what kind of breakage would occur?
>
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
>
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters.  It
> would make similarity (and distance) values less useful for our
> current name searches using it.  To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
>
> test=# select show_trgm('smith,john');
>                         show_trgm
> -----------------------------------------------------------
> {"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
>
> test=# select show_trgm('smith8john');
>                      show_trgm
> -----------------------------------------------------
> {"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
>
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
>   0.615385
> (1 row)
>
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
>     0.3125
> (1 row)
>
> So making the proposed change unconditionally could indeed hurt
> current users of the technique.  On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thank you for your concise examples. I probably got it.

From your examples, I thought KEEPONLYALNUM controls whether 
non-alphanumeric characters are included in trigrams, though I haven't read 
the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams 
unnecessary for users who handle only alphanumeric text. That would lead to 
undesirable query results.

Then, I wonder what would be the ideal specification...to add 
alphanumeric/non-alphanumeric boolean switch to similarity() function, add 
non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric 
version of operator classes (ex. gin_allchars_trgm_ops)? At least, I 
understood the fix is not appropriate for minor releases.

Regards
MauMau



Re: Can pg_trgm handle non-alphanumeric characters?

From
Fujii Masao
Date:
On Fri, May 11, 2012 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote:
>>> Thanks for your explanation. Although I haven't understood it well yet, I'll
>>> consider what you taught. And I'll consider if the tentative measure of
>>> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm
>>> against Japanese text.
>
>> In Japanese, it's common to do a text search with two characters keyword.
>> But since pg_trgm is 3-gram, you basically would not be able to use index
>> for such text search. So you might need something like pg_bigm or pg_unigm
>> for Japanese text search.

Even if an index can be used for two characters text search, bitmap index scan
picks up all rows, so it's too slow.

> I believe the trigrams are three *bytes* not three characters.  So a
> couple of kanji should work just fine for this.

Really? As far as I read the code of pg_trgm, the trigram is three characters
and its CRC32 is used as an index key if its size is more than three bytes.

Regards,

--
Fujii Masao