Thread: regular expressions stranges

regular expressions stranges

From

Teodor Sigaev

Date:

23 January 2007, 08:53:49

Regexp works differently with no-ascii characters depending on server encoding
(bug.sql contains non-ascii char):

% initdb -E KOI8-R --locale ru_RU.KOI8-R
% psql postgres < bug.sql
  true
------
  t
(1 row)

  true | true
------+------
  t    | t
(1 row)
% initdb -E UTF8 --locale ru_RU.UTF-8
% psql postgres < bug.sql
  true
------
  f
(1 row)

  true | true
------+------
  f    | t
(1 row)

As I can see, that is because of using isalpha (and other is*), tolower &
toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
can modify regc_locale.c similarly to tsearch2 locale part.



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
set client_encoding='KOI8';

SELECT    '�' ~* '[[:alpha:]]' as "true";
SELECT
        '������' ~* '������' as "true",
        '������' ~* '������' as "true";

Re: regular expressions stranges

From

Tom Lane

Date:

23 January 2007, 11:00:52

Teodor Sigaev <teodor@sigaev.ru> writes:
> As I can see, that is because of using isalpha (and other is*), tolower & 
> toupper instead of isw* and tow* functions. Is any reason to use them? If not, I 
> can modify regc_locale.c similarly to tsearch2 locale part.

The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect.  If we could guarantee compatibility then the above plan
would make sense ...
        regards, tom lane

Re: regular expressions stranges

From

Teodor Sigaev

Date:

23 January 2007, 11:19:32

> The regex code is working with pg_wchar strings, which aren't
> necessarily the same representation that the OS' wide-char functions
> expect.  If we could guarantee compatibility then the above plan
> would make sense ...

it seems to me, that is possible for UTF8 encoding. So isalpha() function may be 
defined as:

static int
pg_wc_isalpha(pg_wchar c)
{    if ( (c >= 0 && c <= UCHAR_MAX) )return isalpha((unsigned char) c)
#ifdef HAVE_WCSTOMBS    else if ( GetDatabaseEncoding() == PG_UTF8 )return iswalpha((wint_t) c)
#endif    return 0;
}



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: regular expressions stranges

From

Tom Lane

Date:

23 January 2007, 11:27:05

Teodor Sigaev <teodor@sigaev.ru> writes:
>> The regex code is working with pg_wchar strings, which aren't
>> necessarily the same representation that the OS' wide-char functions
>> expect.  If we could guarantee compatibility then the above plan
>> would make sense ...

> it seems to me, that is possible for UTF8 encoding.

Why?  The one thing that a wchar certainly is not is UTF8.
It might be that the <wctype.h> functions are expecting UTF16 or UTF32,
but we don't know which, and really we can hardly even be sure they're
expecting Unicode at all.
        regards, tom lane