Thread: regular expressions stranges

regular expressions stranges

From
Teodor Sigaev
Date:
Regexp works differently with no-ascii characters depending on server encoding
(bug.sql contains non-ascii char):

% initdb -E KOI8-R --locale ru_RU.KOI8-R
% psql postgres < bug.sql
  true
------
  t
(1 row)

  true | true
------+------
  t    | t
(1 row)
% initdb -E UTF8 --locale ru_RU.UTF-8
% psql postgres < bug.sql
  true
------
  f
(1 row)

  true | true
------+------
  f    | t
(1 row)

As I can see, that is because of using isalpha (and other is*), tolower &
toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
can modify regc_locale.c similarly to tsearch2 locale part.



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
set client_encoding='KOI8';

SELECT    '�' ~* '[[:alpha:]]' as "true";
SELECT
        '������' ~* '������' as "true",
        '������' ~* '������' as "true";

Re: regular expressions stranges

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> As I can see, that is because of using isalpha (and other is*), tolower & 
> toupper instead of isw* and tow* functions. Is any reason to use them? If not, I 
> can modify regc_locale.c similarly to tsearch2 locale part.

The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect.  If we could guarantee compatibility then the above plan
would make sense ...
        regards, tom lane


Re: regular expressions stranges

From
Teodor Sigaev
Date:
> The regex code is working with pg_wchar strings, which aren't
> necessarily the same representation that the OS' wide-char functions
> expect.  If we could guarantee compatibility then the above plan
> would make sense ...

it seems to me, that is possible for UTF8 encoding. So isalpha() function may be 
defined as:

static int
pg_wc_isalpha(pg_wchar c)
{    if ( (c >= 0 && c <= UCHAR_MAX) )return isalpha((unsigned char) c)
#ifdef HAVE_WCSTOMBS    else if ( GetDatabaseEncoding() == PG_UTF8 )return iswalpha((wint_t) c)
#endif    return 0;
}



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: regular expressions stranges

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> The regex code is working with pg_wchar strings, which aren't
>> necessarily the same representation that the OS' wide-char functions
>> expect.  If we could guarantee compatibility then the above plan
>> would make sense ...

> it seems to me, that is possible for UTF8 encoding.

Why?  The one thing that a wchar certainly is not is UTF8.
It might be that the <wctype.h> functions are expecting UTF16 or UTF32,
but we don't know which, and really we can hardly even be sure they're
expecting Unicode at all.
        regards, tom lane