Thread: Regular expression
Hello,
Case insensitive pattern matching gives strange results for non-ascii character (such as UTF-8 encoded cyrillic letters):
test=# select 'б' ~* 'Б' ;
?column?
----------
f
(1 row)
( 'б' and 'Б' are lower and upper case variants of cyrillic 'B')
at the same time:
test=# select 'б' ilike 'Б' ;
?column?
----------
t
(1 row)
(PG 8.3 on Linux, UTF-8 locale)
Also, what could be the reason for that cyrillic letters are not treated by regexp engine as the part of [:alpha:], [:alnum:], \w etc. classes? Or they never meant to be?
Case insensitive pattern matching gives strange results for non-ascii character (such as UTF-8 encoded cyrillic letters):
test=# select 'б' ~* 'Б' ;
?column?
----------
f
(1 row)
( 'б' and 'Б' are lower and upper case variants of cyrillic 'B')
at the same time:
test=# select 'б' ilike 'Б' ;
?column?
----------
t
(1 row)
(PG 8.3 on Linux, UTF-8 locale)
Also, what could be the reason for that cyrillic letters are not treated by regexp engine as the part of [:alpha:], [:alnum:], \w etc. classes? Or they never meant to be?
"Vyacheslav Kalinin" <vka@mgcp.com> writes: > Case insensitive pattern matching gives strange results for non-ascii > character (such as UTF-8 encoded cyrillic letters): Yeah, the regex locale support doesn't work well in multibyte character sets --- it basically will not recognize that non-ASCII characters have any case variants. Fixing this has been on the TODO list for awhile ... regards, tom lane