Thread: pg_trgm

pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 00:08:18

Hi,

Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
not, I'm interested in the work.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Teodor Sigaev

Date:

27 May 2010, 07:54:46

> Anyone working on make contrib/pg_trgm mutibyte encoding aware? If
> not, I'm interested in the work.

It's already multibyte safe since 8.4
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 08:53:46

> It's already multibyte safe since 8.4

No, it doesn't.

$ psql test
Pager usage is off.
psql (8.4.4)
Type "help" for help.

test=# select similarity('abc', 'abd');    -- OKsimilarity 
------------  0.333333
(1 row)

test=# select similarity('日本語', '日本後');    -- NGsimilarity 
------------       NaN
(1 row)

test=# select show_trgm('abc');    -- OK       show_trgm        
-------------------------{"  a"," ab",abc,"bc "}
(1 row)

test=# select show_trgm('日本語');    -- NGshow_trgm 
-----------{}
(1 row)

Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Andres Freund

Date:

27 May 2010, 09:18:34

Hi,

On Thursday 27 May 2010 13:53:37 Tatsuo Ishii wrote:
> > It's already multibyte safe since 8.4
>
> No, it doesn't.
> Encoding is EUC_JP, locale is C. Included is the script to reproduce
> the problem.
test=# select show_trgm('日本語');               show_trgm               ---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}(1row) 
Time: 0.443 ms
test=# select similarity('日本語', '日本後'); similarity ------------   0.333333(1 row)
Time: 0.426 ms


Encoding is UTF-8...

Andres

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 09:40:49

> > No, it doesn't.
> > Encoding is EUC_JP, locale is C. Included is the script to reproduce
> > the problem.
> test=# select show_trgm('日本語');
>                 show_trgm               
>  ---------------------------------------
>   {0x8194c0,0x836e53,0x1dc363,0x1e22e9}
>  (1 row)
>  
> Time: 0.443 ms
> test=# select similarity('日本語', '日本後');
>   similarity 
>  ------------
>     0.333333
>  (1 row)
>  
> Time: 0.426 ms
> 
> 
> Encoding is UTF-8...

What is your locale?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Andres Freund

Date:

27 May 2010, 10:07:04

On Thursday 27 May 2010 14:40:41 Tatsuo Ishii wrote:
> > > No, it doesn't.
> > > Encoding is EUC_JP, locale is C. Included is the script to reproduce
> > > the problem.
> > 
> > test=# select show_trgm('日本語');
> > 
> >                 show_trgm
> >  
> >  ---------------------------------------
> >  
> >   {0x8194c0,0x836e53,0x1dc363,0x1e22e9}
> >  
> >  (1 row)
> > 
> > Time: 0.443 ms
> > test=# select similarity('日本語', '日本後');
> > 
> >   similarity
> >  
> >  ------------
> >  
> >     0.333333
> >  
> >  (1 row)
> > 
> > Time: 0.426 ms
> > 
> > 
> > Encoding is UTF-8...
> 
> What is your locale?
It was en_EN.UTF-8. Interesting. With C it fails...

Andres

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 10:26:41

> > What is your locale?
> It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 10:56:00

Tatsuo Ishii <ishii@postgresql.org> writes:
> What is your locale?
>> It was en_EN.UTF-8. Interesting. With C it fails...

> Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation.  pg_trgm uses the locale's definition of "alpha",
"digit", etc.  In C locale only basic ASCII letters and digits will be
recognized as word constituents.
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 11:05:55

> > Yes, pg_trgm seems to have problems with multibyte + C locale.
> 
> It's not a problem, it's just pilot error, or possibly inadequate
> documentation.  pg_trgm uses the locale's definition of "alpha",
> "digit", etc.  In C locale only basic ASCII letters and digits will be
> recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale?  If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 11:15:57

Tatsuo Ishii <ishii@postgresql.org> writes:
>> It's not a problem, it's just pilot error, or possibly inadequate
>> documentation.  pg_trgm uses the locale's definition of "alpha",
>> "digit", etc.  In C locale only basic ASCII letters and digits will be
>> recognized as word constituents.

> That means there is no chance to make pg_trgm work with multibyte + C
> locale?  If so, I will leave pg_trgm as it is and provide private
> patches for those who need the functionality.

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that.  The
core text search functionality behaves the same way.
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 11:20:50

> Exactly what do you consider to be the missing functionality?
> You need a notion of word vs non-word character from somewhere,
> and the locale setting is the standard place to get that.  The
> core text search functionality behaves the same way.

No. Text search works fine with multibyte + C locale.

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 11:24:28

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Anyway locale is completely usesless for finding word vs non-character
> an agglutinative language such as Japanese.

Well, that doesn't mean that the answer is to use C locale ;-)

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c)    (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c)    (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 11:39:18

> Well, that doesn't mean that the answer is to use C locale ;-)

Of course it's up to user whether to use C locale or not. I just want
pg_trgm work with C locale as well.

> However, you could possibly think about making this bit of code
> more flexible:
> 
> #ifdef KEEPONLYALNUM
> #define iswordchr(c)    (t_isalpha(c) || t_isdigit(c))
> #else
> #define iswordchr(c)    (!t_isspace(c))
> #endif
> 
> Currently it seems to be hard-wired to the first case in standard
> builds.

Yup. Here is the patch in my mind:

*** trgm_op.c~    2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c    2010-05-27 23:38:20.000000000 +0900
***************
*** 59,65 **** }  #ifdef KEEPONLYALNUM
! #define iswordchr(c)    (t_isalpha(c) || t_isdigit(c)) #else #define iswordchr(c)    (!t_isspace(c)) #endif
--- 59,65 ---- }  #ifdef KEEPONLYALNUM
! #define iswordchr(c)    (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c))) #else #define
iswordchr(c)   (!t_isspace(c)) #endif

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 11:53:10

Tatsuo Ishii <ishii@postgresql.org> writes:
> ! #define iswordchr(c)    (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))

This seems entirely arbitrary.  It might "fix" things in your view
but it will break the longstanding behavior for other people.

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.
        regards, tom lane

Re: pg_trgm

From

Greg Stark

Date:

27 May 2010, 12:04:00

On Thu, May 27, 2010 at 3:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think a more appropriate type of fix would be to expose the
> KEEPONLYALNUM option as a GUC, or some other way of letting the
> user decide what he wants.
>

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

-- 
greg

Re: pg_trgm

From

Peter Eisentraut

Date:

27 May 2010, 12:25:01

On tor, 2010-05-27 at 23:20 +0900, Tatsuo Ishii wrote:
> Anyway locale is completely usesless for finding word vs non-character
> an agglutinative language such as Japanese.

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages.  I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 12:46:28

> I don't know about Japanese, but the locale approach works just fine for
> other agglutinative languages.  I would rather suspect that it is the
> trigram approach that might be rather useless for such languages,
> because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes".  But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 12:51:37

> So I think a GUC is broken because pg_tgrm has a index opclasses and
> any indexes built using one setting will be broken if the GUC is
> changed.
> 
> Perhaps we need two sets of functions (which presumably call the same
> implementation with a flag to indicate which definition to use). Then
> you can define an index using one or the other and the meaning would
> be stable.

It's worse. pg_trgm has another compile option "IGNORECASE" which
might affect index opclasses.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Peter Eisentraut

Date:

27 May 2010, 15:01:17

On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
> > I don't know about Japanese, but the locale approach works just fine for
> > other agglutinative languages.  I would rather suspect that it is the
> > trigram approach that might be rather useless for such languages,
> > because you are going to get a lot of similarity hits for the affixes.
> 
> I'm not sure what you mean by "affixes".  But I will explain...
> 
> A Japanese sentence consists of words. Problem is, each word is not
> separated by space (agglutinative). So most text tools such as text
> search need preprocess which finds word boundaries by looking up
> dictionaries (and smart grammer analysis routine). In the process
> "affixes" can be determined and perhaps removed from the target word
> group to be used for text search (note that removing affixes is no
> relevant to locale). Once we get space separated sentence, it can be
> processed by text search or by pg_trgm just same as Engligh. (Note
> that these preprocessing are done outside PostgreSQL world). The
> difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues.  You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use.  E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');similarity
────────────       NaN
(1 row)

Re: pg_trgm

From

Robert Haas

Date:

27 May 2010, 16:00:32

On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
>> > I don't know about Japanese, but the locale approach works just fine for
>> > other agglutinative languages.  I would rather suspect that it is the
>> > trigram approach that might be rather useless for such languages,
>> > because you are going to get a lot of similarity hits for the affixes.
>>
>> I'm not sure what you mean by "affixes".  But I will explain...
>>
>> A Japanese sentence consists of words. Problem is, each word is not
>> separated by space (agglutinative). So most text tools such as text
>> search need preprocess which finds word boundaries by looking up
>> dictionaries (and smart grammer analysis routine). In the process
>> "affixes" can be determined and perhaps removed from the target word
>> group to be used for text search (note that removing affixes is no
>> relevant to locale). Once we get space separated sentence, it can be
>> processed by text search or by pg_trgm just same as Engligh. (Note
>> that these preprocessing are done outside PostgreSQL world). The
>> difference is just the "word" can be consists of non ASCII letters.
>
> I think the problem at hand has nothing at all to do with agglutination
> or CJK-specific issues.  You will get the same problem with other
> languages *if* you set a locale that does not adequately support the
> characters in use.  E.g., Russian with locale C and encoding UTF8:
>
> select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> \u043D\u044B');
>  similarity
> ────────────
>        NaN
> (1 row)

What I can't help wondering as I'm reading this discussion is -
Tatsuo-san said upthread that he has a problem with pg_trgm that he
does not have with full text search.  So what is full text search
doing differently than pg_trgm?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 20:54:49

> I think the problem at hand has nothing at all to do with agglutination
> or CJK-specific issues.  You will get the same problem with other
> languages *if* you set a locale that does not adequately support the
> characters in use.  E.g., Russian with locale C and encoding UTF8:
> 
> select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> \u043D\u044B');
>  similarity
> ────────────
>         NaN
> (1 row)

Of course. That's why I started this thread.

With my patch:

test=# select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E\u043D\u044B');similarity 
------------      0.75
(1 row)

Or you could just #undef KEEPONLYALNUM in trgm.h. But I'm not sure
this is the right thing for you.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 20:55:14

> What I can't help wondering as I'm reading this discussion is -
> Tatsuo-san said upthread that he has a problem with pg_trgm that he
> does not have with full text search.  So what is full text search
> doing differently than pg_trgm?

Problem with pg_trgm is, it uses isascii() etc. to recognize a letter,
which will skip any non ASCII range character in C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 21:14:38

Tatsuo Ishii <ishii@postgresql.org> writes:
> Problem with pg_trgm is, it uses isascii() etc. to recognize a letter,
> which will skip any non ASCII range character in C locale.

The only place I see that is in those ISPRINTABLE macros, which are only
used in show_trgm(), which is just a debugging function.  It could stand
to be improved but it doesn't seem exactly critical.
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 21:40:23

> > Problem with pg_trgm is, it uses isascii() etc. to recognize a letter,
> > which will skip any non ASCII range character in C locale.
> 
> The only place I see that is in those ISPRINTABLE macros, which are only
> used in show_trgm(), which is just a debugging function.  It could stand
> to be improved but it doesn't seem exactly critical.

Really?

similarity -> generate_trgm -> find_word -> iswordchr -> t_isalpha -> isalpha

if locale is C and USE_WIDE_UPPER_LOWER defined which is the case in
most modern OSs.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 21:44:51

Tatsuo Ishii <ishii@postgresql.org> writes:
> similarity -> generate_trgm -> find_word -> iswordchr -> t_isalpha -> isalpha

> if locale is C and USE_WIDE_UPPER_LOWER defined which is the case in
> most modern OSs.

Quite.  And *if locale is C then only standard ASCII letters are letters*.
You may not like that but it's not wrong; in fact any other behavior
would be wrong.
        regards, tom lane

Re: pg_trgm

From

Tom Lane

Date:

27 May 2010, 21:51:43

Tatsuo Ishii <ishii@postgresql.org> writes:
> Or you could just #undef KEEPONLYALNUM in trgm.h. But I'm not sure
> this is the right thing for you.

It's not a practical solution for people working with prebuilt Postgres
versions, which is most people.  I don't object to finding a way to
provide a "not-space" behavior instead of an "is-alnum" behavior,
but as noted upthread a GUC isn't the right way.  How do you feel
about a new set of functions with an additional flag argument of
some sort?
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 21:54:09

> Tatsuo Ishii <ishii@postgresql.org> writes:
> > similarity -> generate_trgm -> find_word -> iswordchr -> t_isalpha -> isalpha
> 
> > if locale is C and USE_WIDE_UPPER_LOWER defined which is the case in
> > most modern OSs.
> 
> Quite.  And *if locale is C then only standard ASCII letters are letters*.
> You may not like that but it's not wrong; in fact any other behavior
> would be wrong.

> *if locale is C then only standard ASCII letters are letters*.

That's just the definition/implementaion of isalpha. My point is,
using isalpha is quite correct for pg_trgm or not. Text search, oracle
compat functions and any other string handling functions in PostgreSQL
behave quite different from what pg_trgm does.

The essential question is, are there any reason for pg_trgm to limit
to only ASCII range characters?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 22:04:29

> I think the problem at hand has nothing at all to do with agglutination
> or CJK-specific issues.  You will get the same problem with other
> languages *if* you set a locale that does not adequately support the
> characters in use.  E.g., Russian with locale C and encoding UTF8:
> 
> select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> \u043D\u044B');
>  similarity
> ────────────
>         NaN
> (1 row)

Wait. This works fine for me with stock pg_trgm. local is C and
encoding is UTF8. What version of PostgreSQL are you using? Mine is
8.4.4.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tatsuo Ishii

Date:

27 May 2010, 22:10:16

> It's not a practical solution for people working with prebuilt Postgres
> versions, which is most people.  I don't object to finding a way to
> provide a "not-space" behavior instead of an "is-alnum" behavior,
> but as noted upthread a GUC isn't the right way.  How do you feel
> about a new set of functions with an additional flag argument of
> some sort?

Let me see how many functions we need to create...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Peter Eisentraut

Date:

28 May 2010, 03:21:24

On fre, 2010-05-28 at 10:04 +0900, Tatsuo Ishii wrote:
> > I think the problem at hand has nothing at all to do with agglutination
> > or CJK-specific issues.  You will get the same problem with other
> > languages *if* you set a locale that does not adequately support the
> > characters in use.  E.g., Russian with locale C and encoding UTF8:
> > 
> > select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> > \u043D\u044B');
> >  similarity
> > ────────────
> >         NaN
> > (1 row)
> 
> Wait. This works fine for me with stock pg_trgm. local is C and
> encoding is UTF8. What version of PostgreSQL are you using? Mine is
> 8.4.4.

This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
you run this in 8.4, you're just comparing a sequence of ASCII letters
and digits.

Re: pg_trgm

From

Tatsuo Ishii

Date:

29 May 2010, 05:13:40

> > It's not a practical solution for people working with prebuilt Postgres
> > versions, which is most people.  I don't object to finding a way to
> > provide a "not-space" behavior instead of an "is-alnum" behavior,
> > but as noted upthread a GUC isn't the right way.  How do you feel
> > about a new set of functions with an additional flag argument of
> > some sort?
> 
> Let me see how many functions we need to create...

After thinking a little bit more, I think following patch would not
break existing behavior and also adopts mutibyte + C locale case. What
do you think?

*** trgm_op.c~    2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c    2010-05-29 17:07:28.000000000 +0900
***************
*** 59,65 **** }  #ifdef KEEPONLYALNUM
! #define iswordchr(c)    (t_isalpha(c) || t_isdigit(c)) #else #define iswordchr(c)    (!t_isspace(c)) #endif
--- 59,67 ---- }  #ifdef KEEPONLYALNUM
! #define iswordchr(c)    (lc_ctype_is_c()? \
!                                 ((*(c) & 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \
!                                 (t_isalpha(c) || t_isdigit(c))) #else #define iswordchr(c)    (!t_isspace(c)) #endif

Re: pg_trgm

From

Greg Stark

Date:

29 May 2010, 11:09:44

On Sat, May 29, 2010 at 9:13 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> ! #define iswordchr(c)  (lc_ctype_is_c()? \
> !                                                               ((*(c) & 0x80)? !t_isspace(c) : (t_isalpha(c) ||
t_isdigit(c))): \ 
>

Surely isspace(c) will always be false for non-ascii characters in C locale?

Now it might be sensible to just treat any non-ascii character as a
word-character in addition to alpha and digits, so what might make
sense is
  t_isalpha(c) || t_isdigit(c)) || (lc_ctype_is_c() && *(c)&0x80)

Though I wonder whether it wouldn't be generally more useful to users
to provide the non-space version as an option. I could see that being
useful for people in other circumstances aside from working around
this locale problem.

--
greg

Re: pg_trgm

From

Tom Lane

Date:

29 May 2010, 11:31:21

Tatsuo Ishii <ishii@postgresql.org> writes:
> After thinking a little bit more, I think following patch would not
> break existing behavior and also adopts mutibyte + C locale case. What
> do you think?

This is still ignoring the point: arbitrarily changing the module's
longstanding standard behavior isn't acceptable.  You need to provide
a way for the user to control the behavior.  (Once you've done that,
I think it can be just either "alnum" or "!isspace", but maybe some
other behaviors would be interesting.)
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

29 May 2010, 22:53:25

> This is still ignoring the point: arbitrarily changing the module's
> longstanding standard behavior isn't acceptable.  You need to provide
> a way for the user to control the behavior.  (Once you've done that,
> I think it can be just either "alnum" or "!isspace", but maybe some
> other behaviors would be interesting.)

To be honest I don't know what "module's longstanding standard
behavior" should be. It's not documented anywhere. If you mean that is
whatever the current implementation is, then any effort to touch the
module should be prohibited.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Tatsuo Ishii

Date:

29 May 2010, 23:05:59

> > Wait. This works fine for me with stock pg_trgm. local is C and
> > encoding is UTF8. What version of PostgreSQL are you using? Mine is
> > 8.4.4.
> 
> This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
> you run this in 8.4, you're just comparing a sequence of ASCII letters
> and digits.

Hum. Still I prefer 8.4's behavior since anything is better than
returning NaN. It seems 9.0 does not have any escape route for
multibyte+C locale users.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Peter Eisentraut

Date:

30 May 2010, 08:37:09

On sön, 2010-05-30 at 11:05 +0900, Tatsuo Ishii wrote:
> > > Wait. This works fine for me with stock pg_trgm. local is C and
> > > encoding is UTF8. What version of PostgreSQL are you using? Mine is
> > > 8.4.4.
> > 
> > This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
> > you run this in 8.4, you're just comparing a sequence of ASCII letters
> > and digits.
> 
> Hum. Still I prefer 8.4's behavior since anything is better than
> returning NaN. It seems 9.0 does not have any escape route for
> multibyte+C locale users.

I think you are confusing some things here.  The \u escape syntax is for
string literals in general.  The behavior of pg_trgm is still the same
in 8.4 and in 9.0.  It's just easier in 9.0 to write out examples
relevant to the current problem.

Re: pg_trgm

From

Tom Lane

Date:

30 May 2010, 11:42:13

Tatsuo Ishii <ishii@postgresql.org> writes:
>> This is still ignoring the point: arbitrarily changing the module's
>> longstanding standard behavior isn't acceptable.  You need to provide
>> a way for the user to control the behavior.  (Once you've done that,
>> I think it can be just either "alnum" or "!isspace", but maybe some
>> other behaviors would be interesting.)

> To be honest I don't know what "module's longstanding standard
> behavior" should be. It's not documented anywhere.

Well, that's a documentation problem rather than an argument for
changing the code.

> If you mean that is
> whatever the current implementation is, then any effort to touch the
> module should be prohibited.

I don't think it's unreasonable to insist that behavioral changes be
made in an upward compatible fashion ... especially ones that seem as
least as likely to break some current usages as to enable new usages.
        regards, tom lane

Re: pg_trgm

From

Tatsuo Ishii

Date:

30 May 2010, 11:52:44

> > > This is in 9.0, because 8.4 doesn't recognize the \u escape syntax.  If
> > > you run this in 8.4, you're just comparing a sequence of ASCII letters
> > > and digits.
> > 
> > Hum. Still I prefer 8.4's behavior since anything is better than
> > returning NaN. It seems 9.0 does not have any escape route for
> > multibyte+C locale users.
> 
> I think you are confusing some things here. The \u escape syntax is for
> string literals in general.  The behavior of pg_trgm is still the same
> in 8.4 and in 9.0.  It's just easier in 9.0 to write out examples
> relevant to the current problem.

I just wanted to point out from the point of view of users. I do not
object the new \u escape syntax. I think pg_trgm has a problem. But
Tom thinks that it's not a problem. That's the point.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: pg_trgm

From

Greg Stark

Date:

30 May 2010, 13:59:54

On Sun, May 30, 2010 at 3:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't think it's unreasonable to insist that behavioral changes be
> made in an upward compatible fashion ... especially ones that seem as
> least as likely to break some current usages as to enable new usages.

Fwiw I don't think we've traditionally been so tense about contrib
modules. With the advent of extensions that users can easily install
with a single command that might be about to change though.

There seem to be three behaviours on the table here:

1) Status quo -- only alpha and digit characters for the current
locale are considered word elements

2) All characters aside from space characters for the current locale
are considered word elements

3) Alpha and digit characters for the current locale, and for C locale
any non-ascii (high bit set) character is considered a word element

1 -> 3 seems like a pretty safe change considering that anyone using
non-ascii characters in C locale probably isn't using pg_tgrm or they
would be complaining about it already. How big a user-base do we think
pg_tgrm has anyways? How many of those are using it on non-ascii
characters in C locale? And of those how many expect the non-ascii
characters to be considered non-word characters? It doesn't sound like
terribly useful behaviour to me.

Behaviour 2 also seems like it would be useful so providing it as well
is also a perfectly reasonable option. But I agree that 1->2 would be
a user-visible change for basically all users so it would have to be
an additional option.

-- 
greg

Re: pg_trgm

From

Tom Lane

Date:

30 May 2010, 14:51:30

Greg Stark <gsstark@mit.edu> writes:
> There seem to be three behaviours on the table here:

You're neglecting

4) Let the user decide whether he wants pg_trgm to consider word
elements to be "alphanumerics" or "any non-space".

The main problem I have with Tatsuo's patch is that it forecloses any
reasonably upward-compatible extension to a user-selected behavior like
(4).  The current behavior can be extended and is simple to document
(though we've neglected to do so).  But once you've put in this
arbitrary warping of the behavior of C locale, you're going to be at
a dead end for improving it later.
        regards, tom lane