Thread: BUG #4319: lower()/upper() does not know about UNICODE case mapping
BUG #4319: lower()/upper() does not know about UNICODE case mapping
From
"Valentine Gogichashvili"
Date:
The following bug has been logged online: Bug reference: 4319 Logged by: Valentine Gogichashvili Email address: valgog@gmail.com PostgreSQL version: 8.3.1 Operating system: SuSE Linux (kernel 2.6.13-15.11-default) Description: lower()/upper() does not know about UNICODE case mapping Details: Hi, I understand, that it is more a feature, but it does not help me anyways... On the UNICODE databases lower and upper functions are using system locale settings (that cannot be changed after initializing DB?) and does not know anything about UNICODE case mapping. The problem really becomes 'a problem' on multilingual systems. I have to store data for German, Russian and Romanian languages together. On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to check if how it works on the 8.3 as I do not have any db instance with the LC_CTYPE set to en_EN. I can understand, that LC_COLLATE is something that is not clear from the context of the character. But case pair is always defined in the UNICODE standard and should be not dependent of the LC_CTYPE. Or am I wrong? Regards, -- Valentine Gogichashvili
Valentine Gogichashvili escribió: > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually > corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to > check if how it works on the 8.3 as I do not have any db instance with the > LC_CTYPE set to en_EN. I don't know about Unicode case mapping, but en_EN is not a UTF8 locale which is why it's corrupting the data. Try en_EN.utf8 instead. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote: > Valentine Gogichashvili escribi=F3: > > > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actua= lly > > corrupting UTF8 data, lowering UTF8 control bytes... I did have a chanc= e to > > check if how it works on the 8.3 as I do not have any db instance with = the > > LC_CTYPE set to en_EN. > > I don't know about Unicode case mapping, but en_EN is not a UTF8 locale > which is why it's corrupting the data. =A0Try en_EN.utf8 instead. > > -- > Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > -- > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > To make changes to your subscription:http://www.postgresql.org/mailpref/p= gsql-bugs Will en_EN.utf8 know about German, Russian, Romanian and Dutch?
On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote: > Valentine Gogichashvili escribi=F3: > > > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actua= lly > > corrupting UTF8 data, lowering UTF8 control bytes... I did have a chanc= e to > > check if how it works on the 8.3 as I do not have any db instance with = the > > LC_CTYPE set to en_EN. > > I don't know about Unicode case mapping, but en_EN is not a UTF8 locale > which is why it's corrupting the data. =A0Try en_EN.utf8 instead. > > -- > Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > -- > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > To make changes to your subscription:http://www.postgresql.org/mailpref/p= gsql-bugs Why Postgres allows creating UNICODE database with the locale, that can possibly corrupt my data?
Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping
From
Peter Eisentraut
Date:
Am Tuesday, 22. July 2008 schrieb valgog: > Why Postgres allows creating UNICODE database with the locale, that > can possibly corrupt my data? It doesn't allow it, as of 8.3. In 8.2 it does, but we have fixed that, for the reasons that are becoming obvious to you now. Perhaps part of the problem is that en_EN isn't actually a valid locale, as far as I can tell, unless SUSE has invented a new country. :) Try locale -a and pick one from that list.
I checked for de_DE.utf8 (created a test database with initdb) and lower/upper is working correctly there on both 8.2 and 8.3 in different languages On Jul 22, 11:26=A0am, valgog <val...@gmail.com> wrote: > On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote: > > > > > Valentine Gogichashvili escribi=F3: > > > > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is act= ually > > > corrupting UTF8 data, lowering UTF8 control bytes... I did have a cha= nce to > > > check if how it works on the 8.3 as I do not have any db instance wit= h the > > > LC_CTYPE set to en_EN. > > > I don't know about Unicode case mapping, but en_EN is not a UTF8 locale > > which is why it's corrupting the data. =A0Try en_EN.utf8 instead. > > > -- > > Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0http://www.CommandPrompt.com/ > > The PostgreSQL Company - Command Prompt, Inc. > > > -- > > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > > To make changes to your subscription:http://www.postgresql.org/mailpref= /pgsql-bugs > > Will en_EN.utf8 know about German, Russian, Romanian and Dutch?
On Jul 22, 11:53=A0am, pete...@gmx.net (Peter Eisentraut) wrote: > Am Tuesday, 22. July 2008 schrieb valgog: > > > Why Postgres allows creating UNICODE database with the locale, that > > can possibly corrupt my data? > > It doesn't allow it, as of 8.3. =A0In 8.2 it does, but we have fixed that= , for > the reasons that are becoming obvious to you now. > > Perhaps part of the problem is that en_EN isn't actually a valid locale, = as > far as I can tell, unless SUSE has invented a new country. :) =A0Try loca= le -a > and pick one from that list. > > -- > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > To make changes to your subscription:http://www.postgresql.org/mailpref/p= gsql-bugs Ok, I have checked in databases with locale set to one of the utf8 locales convert data correctly. But I still do not understand, why postgres initdb allows using non-UTF8 locales together with UNICODE database encoding. Another question. Is it possible to change the current Database LC_CTYPE on the database without recreating it with initdb and reimporting all the data. I would rather prefer my indexes recreated, rather then reimporting database (cannot afford such a long time out of service :(
Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping
From
Tomasz Ostrowski
Date:
On 2008-07-23 11:48, valgog wrote: > Another question. Is it possible to change the current Database > LC_CTYPE on the database without recreating it with initdb and > reimporting all the data. No. > I would rather prefer my indexes recreated, rather then reimporting > database (cannot afford such a long time out of service :( You can use slony replication for minimal downtime (seconds), but it is rather complicated. You can also seriously reduce reimport time by: - temporary disabling sync; - temporary enabling hard disk write cache; - importing and exporting concurrently with a pipe: old_pgdata = /var/lib/pgsql/data new_pgdata = /var/lib/pgsql/newdata initdb -D $new_pgdata postgres -p 10001 -F -D $new_pgdata & su -c 'service postgresql stop' postgres -p 10002 -F -D $old_pgdata & pg_dumpall -p 10002 | pg_restore -p 10001 killall postgres mv $old_pgdata $old_pgdata.bak mv $new_pgdata $old_pgdata su -c 'service postgresql start' - importing and exporting from one disk or cluster to another Regards Tometzky -- ...although Eating Honey was a very good thing to do, there was a moment just before you began to eat it which was better than when you were... Winnie the Pooh
Valentine Gogichashvili napsal(a): > Hi, Hi > I understand, that it is more a feature, but it does not help me anyways... > > On the UNICODE databases lower and upper functions are using system locale > settings (that cannot be changed after initializing DB?) and does not know > anything about UNICODE case mapping. > > The problem really becomes 'a problem' on multilingual systems. I have to > store data for German, Russian and Romanian languages together. I understand you and I hope things will be better. See collation per database project: http://wiki.postgresql.org/wiki/Gsoc08-collation Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 25, 10:17=A0am, Zdenek.Kot...@Sun.COM (Zdenek Kotala) wrote: > Valentine Gogichashvili napsal(a): > > > Hi, > > Hi > > > I understand, that it is more a feature, but it does not help me anyway= s... > > > On the UNICODE databases lower and upper functions are using system loc= ale > > settings (that cannot be changed after initializing DB?) and does not k= now > > anything about UNICODE case mapping. > > > The problem really becomes 'a problem' on multilingual systems. I have = to > > store data for German, Russian and Romanian languages together. =A0 > > I understand you and I hope things will be better. See collation per data= base > project:http://wiki.postgresql.org/wiki/Gsoc08-collation > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Zdenek > > -- > Zdenek Kotala =A0 =A0 =A0 =A0 =A0 =A0 =A0Sun Microsystems > Prague, Czech Republic =A0 =A0http://sun.com/postgresql > > -- > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > To make changes to your subscription:http://www.postgresql.org/mailpref/p= gsql-bugs This is actually a high time to include support for different collates... but do you think it is enough to add the change to the database or maybe start building the collation support for every text column actually?