Thread: initdb in 8.3

initdb in 8.3

From

Tim Tassonis

Date:

23 April 2008, 10:16:21

Hi

I just recently compiled and installed 8.3.1 on a System that has UTF-8
as the default characterset in the environment. Copied the binaries, run
initdb without parameters, the usual stuff.

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems anyway etc and you
have to use locale=C to be able to create databases with different
charactersets.

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Apart from that I'd of course like to restate that postgresql is a
wonderful software which I like very much.

Bye
Tim

Re: initdb in 8.3

From

Richard Huxton

Date:

23 April 2008, 10:42:21

Tim Tassonis wrote:
> Hi
>
> I just recently compiled and installed 8.3.1 on a System that has UTF-8
> as the default characterset in the environment. Copied the binaries, run
> initdb without parameters, the usual stuff.
>
> As you probably are all aware of, this results now in a cluster that
> will only allow you to create UTF-8 databases. I have read some posts
> regarding this topic where it is explained that allowing LATIN1 on a
> cluster initialized with UTF-8 will give you problems anyway etc and you
> have to use locale=C to be able to create databases with different
> charactersets.

You can only have one locale per installation (initdb'd cluster).
You can create a database with any encoding that is compatible with that
locale (LATIN1, LATIN9, UTF-8, etc).

Now, the locale controls (amongst other things) all your sorting. If you
choose a locale of "C" you get a simple binary sorting. By default PG
tries to match whatever sorting you have set up on your operating-system.

See the difference below:

richardh@server3:db$ LANG=C sort /tmp/words.txt
  apple
  berry
Apple
apple
apples
the apple

richardh@server3:db$ LANG=en_GB.UTF-8 sort /tmp/words.txt
apple
  apple
Apple
apples
  berry
the apple

I think someone is looking at per-database locales for 8.4 - the issue
is more tricky than you might think because you need to worry about
system catalogue sort-order.

--
   Richard Huxton
   Archonet Ltd

Re: initdb in 8.3

From

Peter Eisentraut

Date:

23 April 2008, 11:15:28

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:
> My question is: Why then is --locale=C not the default for initdb, as I
> do regard it as a rather big annoyance that a default installation on
> probably almost any modern linux distribution results in a UTF-8 only
> cluster, fixable only by dropping all databases, rerun initdb and the
> reimporting them again.

Because the vast majority of users prefers UTF-8 encoded databases over C
locale databases.

Re: initdb in 8.3

From

Tim Tassonis

Date:

23 April 2008, 14:35:30

Peter Eisentraut wrote:
> Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:
>> My question is: Why then is --locale=C not the default for initdb, as I
>> do regard it as a rather big annoyance that a default installation on
>> probably almost any modern linux distribution results in a UTF-8 only
>> cluster, fixable only by dropping all databases, rerun initdb and the
>> reimporting them again.
>
> Because the vast majority of users prefers UTF-8 encoded databases over C
> locale databases.

Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
UTF-8 databases can be created, if C is chosen, you can specify
different encodings (UTF-8, LATIN1 etc) for each database.

As I understood now, sorting will then still be in C style and not in
the locale specific way. Which leads me to the following questions:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

 From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If specifying a characterset different from the default locale for a
database is not a bad idea, why does the default install forbid me to do
exactly this?

Regards
Tim

Re: initdb in 8.3

From

Martijn van Oosterhout

Date:

23 April 2008, 16:00:32

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:
> Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
> UTF-8 databases can be created, if C is chosen, you can specify
> different encodings (UTF-8, LATIN1 etc) for each database.
>
> As I understood now, sorting will then still be in C style and not in
> the locale specific way. Which leads me to the following questions:
>
> If specifying a characterset different from the default locale for a
> database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

> From how I understand you, if I wanted a postgres server machine
> supporting databases with different charsets, I'm advised to initialise
> one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

> If specifying a characterset different from the default locale for a
> database is not a bad idea, why does the default install forbid me to do
> exactly this?

It is a bad idea, because most normal the C library can only handle one
encoding at a time. Locale C is a backdoor because it has system
independant semantics and does not require libc. It's also not what
people usually want, and so not recommended.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: initdb in 8.3

From

Peter Eisentraut

Date:

23 April 2008, 16:03:47

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:
> If specifying a characterset different from the default locale for a
> database is such a bad idea, why is it possible at all?

Because Japanese users need this functionality.  Aside from spectacularly
bizarre niche applications, that is really the only reason.  If you are
dealing with European-type languages, just forget about it and specify the
locale you want to initdb and let the encodings fall into place.

Re: initdb in 8.3

From

Tim Tassonis

Date:

23 April 2008, 16:22:38

Martijn van Oosterhout wrote:
> On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:
>>
>> If specifying a characterset different from the default locale for a
>> database is such a bad idea, why is it possible at all?
>
> It isn't possible, that's the point. What is possible is that client
> can use any encoding they like to talk to the server, but the server
> will store and manage it all in one. What locale C means "I'm an
> encoding wizard and will ensure all my programs can handle all the
> encodings I want to use, because I understand the database will treat
> everything I send as ASCII bytes no matter what encoding the clients
> say it is".
>
>> From how I understand you, if I wanted a postgres server machine
>> supporting databases with different charsets, I'm advised to initialise
>> one cluster per locale.
>
> If you want to control the *storage* charset, yes. If you just want
> clients to think it's a LATIN9 DB, doing a:
>
> ALTER DATABASE foo SET client_encoding=latin9;

Ok, got it, it's really this setting that's interesting. If I have a
legacy application that defaults to latin1, I can leave the DB to UTF-8
,set the client_encoding to latin1 and then all my selects and inserts
can use latin1, but the data will be stored in utf-8.

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

Re: initdb in 8.3

From

Karsten Hilbert

Date:

23 April 2008, 18:13:07

On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:

> As you probably are all aware of, this results now in a cluster that
> will only allow you to create UTF-8 databases. I have read some posts
> regarding this topic where it is explained that allowing LATIN1 on a
> cluster initialized with UTF-8 will give you problems

Ain't it the other way round ?

Creating UTF8 databases on LATIN1 clusters won't work ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: initdb in 8.3

From

"Christopher Condit"

Date:

23 April 2008, 18:15:01

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with  varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

Thanks,
-Chris

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tim Tassonis
Sent: Wednesday, April 23, 2008 12:22
To: Martijn van Oosterhout
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Martijn van Oosterhout wrote:
> On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:
>>
>> If specifying a characterset different from the default locale for a
>> database is such a bad idea, why is it possible at all?
>
> It isn't possible, that's the point. What is possible is that client
> can use any encoding they like to talk to the server, but the server
> will store and manage it all in one. What locale C means "I'm an
> encoding wizard and will ensure all my programs can handle all the
> encodings I want to use, because I understand the database will treat
> everything I send as ASCII bytes no matter what encoding the clients
> say it is".
>
>> From how I understand you, if I wanted a postgres server machine
>> supporting databases with different charsets, I'm advised to
initialise
>> one cluster per locale.
>
> If you want to control the *storage* charset, yes. If you just want
> clients to think it's a LATIN9 DB, doing a:
>
> ALTER DATABASE foo SET client_encoding=latin9;

Ok, got it, it's really this setting that's interesting. If I have a
legacy application that defaults to latin1, I can leave the DB to UTF-8
,set the client_encoding to latin1 and then all my selects and inserts
can use latin1, but the data will be stored in utf-8.

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: initdb in 8.3

From

Tim Tassonis

Date:

23 April 2008, 23:00:59

Karsten Hilbert wrote:
> On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:
>
>> As you probably are all aware of, this results now in a cluster that
>> will only allow you to create UTF-8 databases. I have read some posts
>> regarding this topic where it is explained that allowing LATIN1 on a
>> cluster initialized with UTF-8 will give you problems
>
> Ain't it the other way round ?
>
> Creating UTF8 databases on LATIN1 clusters won't work ?

No it's this way round, but maybe the other way round, too.
You can only create databases with arbitrary encodings on C locale
clusters from 8.3 on.

Regards
Tim

>
> Karsten

Re: initdb in 8.3

From

Craig Ringer

Date:

24 April 2008, 04:20:38

Christopher Condit wrote:
> I have a question related to this issue:
> Now that the locale has changed, it seems that the planner no longer
> wants to use the indexes for running LIKE queries on varchar columns
> unless I specify varchar_pattern_ops when creating the index. And if I
> create the index with  varchar_pattern_ops, then the planner won't use
> it for = queries.
>
> What's the correct solution to this problem (when using UTF-8 and
> lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

By the way, when I tried to leave a comment on the pattern matching
operator doc page I found that after logging in I was presented with a
blank comments form again. When I re-entered my comment and submitted I
got an error indicating that '' is not a valid NUMERIC. Going back to
the original comments form on the pattern matching operator doc page
(having logged in) and submitting there works fine.

--
Craig Ringer

Re: initdb in 8.3

From

"Christopher Condit"

Date:

24 April 2008, 13:51:46

Ahhh - I See. Thanks, Craig.
Although, once you've built the index with varchar_pattern_ops index,
the following two (essentially equivalent) queries will run at vastly
different speeds:
select * from A where A.value like 'Nacho';
select * from A where A.value = 'Nacho';

Seems that the optimizer should catch this...
C

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Thursday, April 24, 2008 01:20
To: Christopher Condit
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Christopher Condit wrote:
> I have a question related to this issue:
> Now that the locale has changed, it seems that the planner no longer
> wants to use the indexes for running LIKE queries on varchar columns
> unless I specify varchar_pattern_ops when creating the index. And if I
> create the index with  varchar_pattern_ops, then the planner won't use
> it for = queries.
>
> What's the correct solution to this problem (when using UTF-8 and
> lc_collate and lc_ctype are both 1252)? Do I need to create two
indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

By the way, when I tried to leave a comment on the pattern matching
operator doc page I found that after logging in I was presented with a
blank comments form again. When I re-entered my comment and submitted I
got an error indicating that '' is not a valid NUMERIC. Going back to
the original comments form on the pattern matching operator doc page
(having logged in) and submitting there works fine.

--
Craig Ringer

Re: initdb in 8.3

From

Tom Lane

Date:

24 April 2008, 14:02:29

"Christopher Condit" <condit@sdsc.edu> writes:
> Although, once you've built the index with varchar_pattern_ops index,
> the following two (essentially equivalent) queries will run at vastly
> different speeds:
> select * from A where A.value like 'Nacho';
> select * from A where A.value = 'Nacho';

> Seems that the optimizer should catch this...

Yeah, it's on the to-do list ...
http://archives.postgresql.org/pgsql-hackers/2008-02/msg01003.php

            regards, tom lane

Re: initdb in 8.3

From

Zdenek Kotala

Date:

25 April 2008, 10:43:52

Richard Huxton napsal(a):


> I think someone is looking at per-database locales for 8.4 - the issue
> is more tricky than you might think because you need to worry about
> system catalogue sort-order.

There is Google Soc project for implementing collation per database level. I
hope it will appear in 8.4 version.

Catalog sort-order is not affected by locale, because "name" data type uses
different operator then varchar.


        Zdenek