Thread: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters
[9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters
From
"David Johnston"
Date:
PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB Install Executable)
CREATE DATABASE betatest
TEMPLATE template0
ENCODING 'UTF8'
LC_COLLATE 'C'
LC_CTYPE 'C';
[connect to database]
CREATE DOMAIN idcode AS text
NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$')
;
SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR: value for domain idcode violates check constraint "idcode_check" (note the accented “e” between all the “A”s)
This is running just fine against a 9.0 install on the same machine. [\w] is Unicode aware and server encoding is set (and confirmed via SHOW) to be “UTF8”.
David J.
"David Johnston" <polobo@yahoo.com> writes: > PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB > Install Executable) > CREATE DATABASE betatest > TEMPLATE template0 > ENCODING 'UTF8' > LC_COLLATE 'C' > LC_CTYPE 'C'; > CREATE DOMAIN idcode AS text > NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$') > ; > SELECT 'AAAAA�aaaaa'::idcode; // -> SQL Error: ERROR: value for domain > idcode violates check constraint "idcode_check" (note the accented �e� > between all the �A�s) AFAICS that's correct behavior. C locale should not think that � is a letter. > This is running just fine against a 9.0 install on the same machine. We made some strides towards getting locale-sensitive stuff to work as it "should" in 9.1. In particular, platform-specific creative interpretations of what C locale means shouldn't happen anymore ... regards, tom lane
Re: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters
From
"David Johnston"
Date:
Got it. Changing LC_CTYPE to " English_United States.1252" restores the correct behavior. Thanks. David J. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, May 30, 2011 10:40 PM > To: David Johnston > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] [9.1beta1] UTF-8/Regex Word-Character Definition > excluding accented letters > > "David Johnston" <polobo@yahoo.com> writes: > > PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit > > (EnterpriseDB Install Executable) > > > CREATE DATABASE betatest > > TEMPLATE template0 > > ENCODING 'UTF8' > > LC_COLLATE 'C' > > LC_CTYPE 'C'; > > > CREATE DOMAIN idcode AS text > > NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$') ; > > > SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR: value for > > domain idcode violates check constraint "idcode_check" (note the > accented e > > between all the As) > > AFAICS that's correct behavior. C locale should not think that é is a letter. > > > This is running just fine against a 9.0 install on the same machine. > > We made some strides towards getting locale-sensitive stuff to work as it > "should" in 9.1. In particular, platform-specific creative interpretations of > what C locale means shouldn't happen anymore ... > > regards, tom lane