Thread: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

[9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

From
"David Johnston"
Date:

PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB Install Executable)

 

CREATE DATABASE betatest

                TEMPLATE template0

                ENCODING 'UTF8'

                LC_COLLATE 'C'

                LC_CTYPE 'C';

               

[connect to database]

 

CREATE DOMAIN idcode AS text

                NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$')

;

 

SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR:  value for domain idcode violates check constraint "idcode_check" (note the accented “e” between all the “A”s)

 

This is running just fine against a 9.0 install on the same machine.  [\w] is Unicode aware and server encoding is set (and confirmed via SHOW) to be “UTF8”.

 

David J.

 

 

"David Johnston" <polobo@yahoo.com> writes:
> PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB
> Install Executable)

> CREATE DATABASE betatest
>                 TEMPLATE template0
>                 ENCODING 'UTF8'
>                 LC_COLLATE 'C'
>                 LC_CTYPE 'C';

> CREATE DOMAIN idcode AS text
>                 NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$')
> ;

> SELECT 'AAAAA�aaaaa'::idcode; // -> SQL Error: ERROR:  value for domain
> idcode violates check constraint "idcode_check" (note the accented �e�
> between all the �A�s)

AFAICS that's correct behavior.  C locale should not think that � is
a letter.

> This is running just fine against a 9.0 install on the same machine.

We made some strides towards getting locale-sensitive stuff to work as
it "should" in 9.1.  In particular, platform-specific creative
interpretations of what C locale means shouldn't happen anymore ...

            regards, tom lane

Re: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

From
"David Johnston"
Date:
Got it.  Changing LC_CTYPE to " English_United States.1252" restores the
correct behavior.

Thanks.

David J.

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Monday, May 30, 2011 10:40 PM
> To: David Johnston
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] [9.1beta1] UTF-8/Regex Word-Character Definition
> excluding accented letters
>
> "David Johnston" <polobo@yahoo.com> writes:
> > PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit
> > (EnterpriseDB Install Executable)
>
> > CREATE DATABASE betatest
> >                 TEMPLATE template0
> >                 ENCODING 'UTF8'
> >                 LC_COLLATE 'C'
> >                 LC_CTYPE 'C';
>
> > CREATE DOMAIN idcode AS text
> >                 NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$') ;
>
> > SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR:  value for
> > domain idcode violates check constraint "idcode_check" (note the
> accented “e”
> > between all the “A”s)
>
> AFAICS that's correct behavior.  C locale should not think that é is a
letter.
>
> > This is running just fine against a 9.0 install on the same machine.
>
> We made some strides towards getting locale-sensitive stuff to work as it
> "should" in 9.1.  In particular, platform-specific creative
interpretations of
> what C locale means shouldn't happen anymore ...
>
>             regards, tom lane