Bug concerning regular expressions and UTF-8 - Mailing list pgsql-bugs

From Helmar Spangenberg
Subject Bug concerning regular expressions and UTF-8
Date
Msg-id 200601201803.17706.hspangenberg@frey.de
Whole thread Raw
List pgsql-bugs
Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by=20
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produ=
ce=20
a faulty result is

select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';=20

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see als=
o=20
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei=
=20
Regular Expressions"). It seems that this problem does not exist in Windows=
=20
based installations.

It seems to me that this bug is originated in the file=20
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and=
=20
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and=
=20
tolower(unsigned char) which definitely are the wrong choice for UTF8=20
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and=20
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",=20
which lead to the correct results of=20
select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';

Since I don't have any idea concerning the side effects of this change, ple=
ase=20
let me know as soon as an "official" patch is available - I definitely do=
=20
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg@frey.de

pgsql-bugs by date:

Previous
From: "John Jorgensen"
Date:
Subject: BUG #2192: misbehaving IRIX strtod() subverts parsing of "infinity"
Next
From: "Andras Got"
Date:
Subject: BUG #2193: INITCAP and LOWER/UPPER string conversion error