Thread: regular expressions stranges
Regexp works differently with no-ascii characters depending on server encoding (bug.sql contains non-ascii char): % initdb -E KOI8-R --locale ru_RU.KOI8-R % psql postgres < bug.sql true ------ t (1 row) true | true ------+------ t | t (1 row) % initdb -E UTF8 --locale ru_RU.UTF-8 % psql postgres < bug.sql true ------ f (1 row) true | true ------+------ f | t (1 row) As I can see, that is because of using isalpha (and other is*), tolower & toupper instead of isw* and tow* functions. Is any reason to use them? If not, I can modify regc_locale.c similarly to tsearch2 locale part. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ set client_encoding='KOI8'; SELECT '�' ~* '[[:alpha:]]' as "true"; SELECT '������' ~* '������' as "true", '������' ~* '������' as "true";
Teodor Sigaev <teodor@sigaev.ru> writes: > As I can see, that is because of using isalpha (and other is*), tolower & > toupper instead of isw* and tow* functions. Is any reason to use them? If not, I > can modify regc_locale.c similarly to tsearch2 locale part. The regex code is working with pg_wchar strings, which aren't necessarily the same representation that the OS' wide-char functions expect. If we could guarantee compatibility then the above plan would make sense ... regards, tom lane
> The regex code is working with pg_wchar strings, which aren't > necessarily the same representation that the OS' wide-char functions > expect. If we could guarantee compatibility then the above plan > would make sense ... it seems to me, that is possible for UTF8 encoding. So isalpha() function may be defined as: static int pg_wc_isalpha(pg_wchar c) { if ( (c >= 0 && c <= UCHAR_MAX) )return isalpha((unsigned char) c) #ifdef HAVE_WCSTOMBS else if ( GetDatabaseEncoding() == PG_UTF8 )return iswalpha((wint_t) c) #endif return 0; } -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> The regex code is working with pg_wchar strings, which aren't >> necessarily the same representation that the OS' wide-char functions >> expect. If we could guarantee compatibility then the above plan >> would make sense ... > it seems to me, that is possible for UTF8 encoding. Why? The one thing that a wchar certainly is not is UTF8. It might be that the <wctype.h> functions are expecting UTF16 or UTF32, but we don't know which, and really we can hardly even be sure they're expecting Unicode at all. regards, tom lane