Thread: BUG #4319: lower()/upper() does not know about UNICODE case mapping

BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

"Valentine Gogichashvili"

Date:

21 July 2008, 11:20:14

The following bug has been logged online:

Bug reference:      4319
Logged by:          Valentine Gogichashvili
Email address:      valgog@gmail.com
PostgreSQL version: 8.3.1
Operating system:   SuSE Linux  (kernel 2.6.13-15.11-default)
Description:        lower()/upper() does not know about UNICODE case mapping
Details:

Hi,

I understand, that it is more a feature, but it does not help me anyways...

On the UNICODE databases lower and upper functions are using system locale
settings (that cannot be changed after initializing DB?) and does not know
anything about UNICODE case mapping.

The problem really becomes 'a problem' on multilingual systems. I have to
store data for German, Russian and Romanian languages together.

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I can understand, that LC_COLLATE is something that is not clear from the
context of the character. But case pair is always defined in the UNICODE
standard and should be not dependent of the LC_CTYPE. Or am I wrong?

Regards,

-- Valentine Gogichashvili

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

Alvaro Herrera

Date:

21 July 2008, 14:33:08

Valentine Gogichashvili escribió:

> On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
> corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
> check if how it works on the 8.3 as I do not have any db instance with the
> LC_CTYPE set to en_EN.

I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
which is why it's corrupting the data.  Try en_EN.utf8 instead.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

valgog

Date:

22 July 2008, 09:41:59

On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:
> Valentine Gogichashvili escribi=F3:
>
> > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actua=
lly
> > corrupting UTF8 data, lowering UTF8 control bytes... I did have a chanc=
e to
> > check if how it works on the 8.3 as I do not have any db instance with =
the
> > LC_CTYPE set to en_EN.
>
> I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
> which is why it's corrupting the data. =A0Try en_EN.utf8 instead.
>
> --
> Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> --
> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/p=
gsql-bugs

Will en_EN.utf8 know about German, Russian, Romanian and Dutch?

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

valgog

Date:

22 July 2008, 09:41:59

On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:
> Valentine Gogichashvili escribi=F3:
>
> > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actua=
lly
> > corrupting UTF8 data, lowering UTF8 control bytes... I did have a chanc=
e to
> > check if how it works on the 8.3 as I do not have any db instance with =
the
> > LC_CTYPE set to en_EN.
>
> I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
> which is why it's corrupting the data. =A0Try en_EN.utf8 instead.
>
> --
> Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> --
> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/p=
gsql-bugs

Why Postgres allows creating UNICODE database with the locale, that
can possibly corrupt my data?

Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

Peter Eisentraut

Date:

22 July 2008, 09:53:23

Am Tuesday, 22. July 2008 schrieb valgog:
> Why Postgres allows creating UNICODE database with the locale, that
> can possibly corrupt my data?

It doesn't allow it, as of 8.3.  In 8.2 it does, but we have fixed that, for
the reasons that are becoming obvious to you now.

Perhaps part of the problem is that en_EN isn't actually a valid locale, as
far as I can tell, unless SUSE has invented a new country. :)  Try locale -a
and pick one from that list.

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

valgog

Date:

23 July 2008, 10:12:14

I checked for de_DE.utf8 (created a test database with initdb) and
lower/upper is working correctly there on both 8.2 and 8.3 in
different languages


On Jul 22, 11:26=A0am, valgog <val...@gmail.com> wrote:
> On Jul 21, 4:32=A0pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:
>
>
>
> > Valentine Gogichashvili escribi=F3:
>
> > > On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is act=
ually
> > > corrupting UTF8 data, lowering UTF8 control bytes... I did have a cha=
nce to
> > > check if how it works on the 8.3 as I do not have any db instance wit=
h the
> > > LC_CTYPE set to en_EN.
>
> > I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
> > which is why it's corrupting the data. =A0Try en_EN.utf8 instead.
>
> > --
> > Alvaro Herrera =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0http://www.CommandPrompt.com/
> > The PostgreSQL Company - Command Prompt, Inc.
>
> > --
> > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
> > To make changes to your subscription:http://www.postgresql.org/mailpref=
/pgsql-bugs
>
> Will en_EN.utf8 know about German, Russian, Romanian and Dutch?

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

valgog

Date:

23 July 2008, 10:12:15

On Jul 22, 11:53=A0am, pete...@gmx.net (Peter Eisentraut) wrote:
> Am Tuesday, 22. July 2008 schrieb valgog:
>
> > Why Postgres allows creating UNICODE database with the locale, that
> > can possibly corrupt my data?
>
> It doesn't allow it, as of 8.3. =A0In 8.2 it does, but we have fixed that=
, for
> the reasons that are becoming obvious to you now.
>
> Perhaps part of the problem is that en_EN isn't actually a valid locale, =
as
> far as I can tell, unless SUSE has invented a new country. :) =A0Try loca=
le -a
> and pick one from that list.
>
> --
> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/p=
gsql-bugs

Ok, I have checked in databases with locale set to one of the utf8
locales convert data correctly. But I still do not understand, why
postgres initdb allows using non-UTF8 locales together with UNICODE
database encoding.

Another question. Is it possible to change the current Database
LC_CTYPE on the database without recreating it with initdb and
reimporting all the data. I would rather prefer my indexes recreated,
rather then reimporting database (cannot afford such a long time out
of service :(

Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

Tomasz Ostrowski

Date:

23 July 2008, 13:07:24

On 2008-07-23 11:48, valgog wrote:

> Another question. Is it possible to change the current Database
> LC_CTYPE on the database without recreating it with initdb and
> reimporting all the data.

No.

> I would rather prefer my indexes recreated, rather then reimporting
> database (cannot afford such a long time out of service :(

You can use slony replication for minimal downtime (seconds), but it is
rather complicated.

You can also seriously reduce reimport time by:
- temporary disabling sync;
- temporary enabling hard disk write cache;
- importing and exporting concurrently with a pipe:
    old_pgdata = /var/lib/pgsql/data
    new_pgdata = /var/lib/pgsql/newdata
    initdb -D $new_pgdata
    postgres -p 10001 -F -D $new_pgdata &
    su -c 'service postgresql stop'
    postgres -p 10002 -F -D $old_pgdata &
    pg_dumpall -p 10002 | pg_restore -p 10001
    killall postgres
    mv $old_pgdata $old_pgdata.bak
    mv $new_pgdata $old_pgdata
    su -c 'service postgresql start'
- importing and exporting from one disk or cluster to another

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
                                                      Winnie the Pooh

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

Zdenek Kotala

Date:

25 July 2008, 08:23:36

Valentine Gogichashvili napsal(a):

> Hi,

Hi

> I understand, that it is more a feature, but it does not help me anyways...
>
> On the UNICODE databases lower and upper functions are using system locale
> settings (that cannot be changed after initializing DB?) and does not know
> anything about UNICODE case mapping.
>
> The problem really becomes 'a problem' on multilingual systems. I have to
> store data for German, Russian and Romanian languages together.

I understand you and I hope things will be better. See collation per database
project: http://wiki.postgresql.org/wiki/Gsoc08-collation

        Zdenek


--
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

From

valgog

Date:

25 July 2008, 14:48:42

On Jul 25, 10:17=A0am, Zdenek.Kot...@Sun.COM (Zdenek Kotala) wrote:
> Valentine Gogichashvili napsal(a):
>
> > Hi,
>
> Hi
>
> > I understand, that it is more a feature, but it does not help me anyway=
s...
>
> > On the UNICODE databases lower and upper functions are using system loc=
ale
> > settings (that cannot be changed after initializing DB?) and does not k=
now
> > anything about UNICODE case mapping.
>
> > The problem really becomes 'a problem' on multilingual systems. I have =
to
> > store data for German, Russian and Romanian languages together. =A0
>
> I understand you and I hope things will be better. See collation per data=
base
> project:http://wiki.postgresql.org/wiki/Gsoc08-collation
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Zdenek
>
> --
> Zdenek Kotala =A0 =A0 =A0 =A0 =A0 =A0 =A0Sun Microsystems
> Prague, Czech Republic =A0 =A0http://sun.com/postgresql
>
> --
> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/p=
gsql-bugs

This is actually a high time to include support for different
collates... but do you think it is enough to add the change to the
database or maybe start building the collation support for every text
column actually?