Re: Can pg_trgm handle non-alphanumeric characters? - Mailing list pgsql-hackers

From MauMau
Subject Re: Can pg_trgm handle non-alphanumeric characters?
Date
Msg-id 96F7F5F69579425F9B42EAAA8B19CD14@maumau
Whole thread Raw
In response to Re: Can pg_trgm handle non-alphanumeric characters?  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> "MauMau" <maumau307@gmail.com> wrote:
>> For information, what kind of breakage would occur?
>
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
>
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters.  It
> would make similarity (and distance) values less useful for our
> current name searches using it.  To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
>
> test=# select show_trgm('smith,john');
>                         show_trgm
> -----------------------------------------------------------
> {"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
>
> test=# select show_trgm('smith8john');
>                      show_trgm
> -----------------------------------------------------
> {"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
>
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
>   0.615385
> (1 row)
>
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
>     0.3125
> (1 row)
>
> So making the proposed change unconditionally could indeed hurt
> current users of the technique.  On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thank you for your concise examples. I probably got it.

From your examples, I thought KEEPONLYALNUM controls whether 
non-alphanumeric characters are included in trigrams, though I haven't read 
the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams 
unnecessary for users who handle only alphanumeric text. That would lead to 
undesirable query results.

Then, I wonder what would be the ideal specification...to add 
alphanumeric/non-alphanumeric boolean switch to similarity() function, add 
non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric 
version of operator classes (ex. gin_allchars_trgm_ops)? At least, I 
understood the fix is not appropriate for minor releases.

Regards
MauMau



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Draft release notes complete
Next
From: Tom Lane
Date:
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows