BUG #4200: Regexp character classes not UTF8-compliant - Mailing list pgsql-bugs

From Jean-Baptiste Quenot
Subject BUG #4200: Regexp character classes not UTF8-compliant
Date
Msg-id 200805261913.m4QJD5gh048059@wwwmaster.postgresql.org
Whole thread Raw
Responses Re: BUG #4200: Regexp character classes not UTF8-compliant  (Bruce Momjian <bruce@momjian.us>)
List pgsql-bugs
The following bug has been logged online:

Bug reference:      4200
Logged by:          Jean-Baptiste Quenot
Email address:      jbq@caraldi.com
PostgreSQL version: 8.3.1
Operating system:   Linux Ubuntu Hardy
Description:        Regexp character classes not UTF8-compliant
Details:

PostgreSQL documentation at
http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
the various character classes, and they can be used to match or replace
strings with regexp support.  However, the [:alnum:] and [:alpha:] character
classes are not UTF8-compliant, like shown in the examples below:

dockee=# show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

dockee=# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

dockee=# select regexp_replace('bébéàu', '[[:alnum:]]', '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

ovhdev=# select regexp_replace('bébéàu', '[[:alpha:]]', '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

dockee=# select regexp_replace('bébéàu', $$\w$$, '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

Only characters in the ASCII range were correctly detected to belong to the
[:alnum:] character class, whereas other characters are valid too.

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #4177: Dump and restore from Slonified 8.1.11 causes a segfault
Next
From: "Nahum Castro"
Date:
Subject: BUG #4201: Instalation fails