Thread: BUG #4200: Regexp character classes not UTF8-compliant

BUG #4200: Regexp character classes not UTF8-compliant

From
"Jean-Baptiste Quenot"
Date:
The following bug has been logged online:

Bug reference:      4200
Logged by:          Jean-Baptiste Quenot
Email address:      jbq@caraldi.com
PostgreSQL version: 8.3.1
Operating system:   Linux Ubuntu Hardy
Description:        Regexp character classes not UTF8-compliant
Details:

PostgreSQL documentation at
http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
the various character classes, and they can be used to match or replace
strings with regexp support.  However, the [:alnum:] and [:alpha:] character
classes are not UTF8-compliant, like shown in the examples below:

dockee=# show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

dockee=# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

dockee=# select regexp_replace('bébéàu', '[[:alnum:]]', '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

ovhdev=# select regexp_replace('bébéàu', '[[:alpha:]]', '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

dockee=# select regexp_replace('bébéàu', $$\w$$, '', 'g');
 regexp_replace
----------------
 ééà
(1 row)

Only characters in the ASCII range were correctly detected to belong to the
[:alnum:] character class, whereas other characters are valid too.

Re: BUG #4200: Regexp character classes not UTF8-compliant

From
Bruce Momjian
Date:
I am not sure how to help you except to say that UTF8 is a character set
encoding, while en_US.UTF-8 is more of an encoding with a locale.  My
guess is that if you use *.UTF-8 where you specified the proper
localization language, it would work.

    http://www.postgresql.org/docs/8.2/static/locale.html

---------------------------------------------------------------------------

Jean-Baptiste Quenot wrote:
>
> The following bug has been logged online:
>
> Bug reference:      4200
> Logged by:          Jean-Baptiste Quenot
> Email address:      jbq@caraldi.com
> PostgreSQL version: 8.3.1
> Operating system:   Linux Ubuntu Hardy
> Description:        Regexp character classes not UTF8-compliant
> Details:
>
> PostgreSQL documentation at
> http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
> the various character classes, and they can be used to match or replace
> strings with regexp support.  However, the [:alnum:] and [:alpha:] character
> classes are not UTF8-compliant, like shown in the examples below:
>
> dockee=# show client_encoding;
>  client_encoding
> -----------------
>  UTF8
> (1 row)
>
> dockee=# show lc_ctype;
>   lc_ctype
> -------------
>  en_US.UTF-8
> (1 row)
>
> dockee=# select regexp_replace('bébéàu', '[[:alnum:]]', '', 'g');
>  regexp_replace
> ----------------
>  ééà
> (1 row)
>
> ovhdev=# select regexp_replace('bébéàu', '[[:alpha:]]', '', 'g');
>  regexp_replace
> ----------------
>  ééà
> (1 row)
>
> dockee=# select regexp_replace('bébéàu', $$\w$$, '', 'g');
>  regexp_replace
> ----------------
>  ééà
> (1 row)
>
> Only characters in the ASCII range were correctly detected to belong to the
> [:alnum:] character class, whereas other characters are valid too.
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +