Home > mailing lists

Bug concerning regular expressions and UTF-8 - Mailing list pgsql-bugs

From	Helmar Spangenberg
Subject	Bug concerning regular expressions and UTF-8
Date	January 22, 2006 00:56:22
Msg-id	200601201803.17706.hspangenberg@frey.de Whole thread Raw
List	pgsql-bugs

Tree view

Hello folks,

my system is a SuSE 10.0 Linux and a plain PostgreSQL 8.1.2 (compiled by=20
myself, NLS enabled). LOCALE is set to de_DE.UTF-8.

The bug shows up using the operator '~*' with umlauts. An easy way to produ=
ce=20
a faulty result is

select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';=20

The result should be "TRUE", however Postgres thinks, it's "FALSE" (see als=
o=20
discussion in www.pg-forum.de, subject "Konfiguration", thread "Umlaute bei=
=20
Regular Expressions"). It seems that this problem does not exist in Windows=
=20
based installations.

It seems to me that this bug is originated in the file=20
src/backend/regex/regc_locale.c. The functions pg_wc_tolower(pg_wchar) and=
=20
pg_wc_toupper(pg_wchar) rely on the C-functions toupper(unsigned char) and=
=20
tolower(unsigned char) which definitely are the wrong choice for UTF8=20
characters beyond the ASCII coding.

To check my estimation, I replaced the bodies of pg_wc_tolower and=20
pg_wc_toupper simply by "return towlower(c);" and "return towupper(c);",=20
which lead to the correct results of=20
select 'XXXM=DCLLERYyyy' ~* '.*m=FCller.*';

Since I don't have any idea concerning the side effects of this change, ple=
ase=20
let me know as soon as an "official" patch is available - I definitely do=
=20
need regular expressions handling UTF8 correctly...

Thanks,
Helmar Spangenberg
e-mail: hspangenberg@frey.de

pgsql-bugs by date:

From: "John Jorgensen"
Date: 22 January 2006, 00:56:19
Subject: BUG #2192: misbehaving IRIX strtod() subverts parsing of "infinity"

From: "Andras Got"
Date: 22 January 2006, 00:56:23
Subject: BUG #2193: INITCAP and LOWER/UPPER string conversion error

Bug concerning regular expressions and UTF-8 - Mailing list pgsql-bugs

Previous

Next