Thread: initdb in 8.3
Hi I just recently compiled and installed 8.3.1 on a System that has UTF-8 as the default characterset in the environment. Copied the binaries, run initdb without parameters, the usual stuff. As you probably are all aware of, this results now in a cluster that will only allow you to create UTF-8 databases. I have read some posts regarding this topic where it is explained that allowing LATIN1 on a cluster initialized with UTF-8 will give you problems anyway etc and you have to use locale=C to be able to create databases with different charactersets. My question is: Why then is --locale=C not the default for initdb, as I do regard it as a rather big annoyance that a default installation on probably almost any modern linux distribution results in a UTF-8 only cluster, fixable only by dropping all databases, rerun initdb and the reimporting them again. Apart from that I'd of course like to restate that postgresql is a wonderful software which I like very much. Bye Tim
Tim Tassonis wrote: > Hi > > I just recently compiled and installed 8.3.1 on a System that has UTF-8 > as the default characterset in the environment. Copied the binaries, run > initdb without parameters, the usual stuff. > > As you probably are all aware of, this results now in a cluster that > will only allow you to create UTF-8 databases. I have read some posts > regarding this topic where it is explained that allowing LATIN1 on a > cluster initialized with UTF-8 will give you problems anyway etc and you > have to use locale=C to be able to create databases with different > charactersets. You can only have one locale per installation (initdb'd cluster). You can create a database with any encoding that is compatible with that locale (LATIN1, LATIN9, UTF-8, etc). Now, the locale controls (amongst other things) all your sorting. If you choose a locale of "C" you get a simple binary sorting. By default PG tries to match whatever sorting you have set up on your operating-system. See the difference below: richardh@server3:db$ LANG=C sort /tmp/words.txt apple berry Apple apple apples the apple richardh@server3:db$ LANG=en_GB.UTF-8 sort /tmp/words.txt apple apple Apple apples berry the apple I think someone is looking at per-database locales for 8.4 - the issue is more tricky than you might think because you need to worry about system catalogue sort-order. -- Richard Huxton Archonet Ltd
Am Mittwoch, 23. April 2008 schrieb Tim Tassonis: > My question is: Why then is --locale=C not the default for initdb, as I > do regard it as a rather big annoyance that a default installation on > probably almost any modern linux distribution results in a UTF-8 only > cluster, fixable only by dropping all databases, rerun initdb and the > reimporting them again. Because the vast majority of users prefers UTF-8 encoded databases over C locale databases.
Peter Eisentraut wrote: > Am Mittwoch, 23. April 2008 schrieb Tim Tassonis: >> My question is: Why then is --locale=C not the default for initdb, as I >> do regard it as a rather big annoyance that a default installation on >> probably almost any modern linux distribution results in a UTF-8 only >> cluster, fixable only by dropping all databases, rerun initdb and the >> reimporting them again. > > Because the vast majority of users prefers UTF-8 encoded databases over C > locale databases. Ok, let me put it in another way. If UTF-8 is chosen at initdb, only UTF-8 databases can be created, if C is chosen, you can specify different encodings (UTF-8, LATIN1 etc) for each database. As I understood now, sorting will then still be in C style and not in the locale specific way. Which leads me to the following questions: If specifying a characterset different from the default locale for a database is such a bad idea, why is it possible at all? From how I understand you, if I wanted a postgres server machine supporting databases with different charsets, I'm advised to initialise one cluster per locale. If specifying a characterset different from the default locale for a database is not a bad idea, why does the default install forbid me to do exactly this? Regards Tim
On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote: > Ok, let me put it in another way. If UTF-8 is chosen at initdb, only > UTF-8 databases can be created, if C is chosen, you can specify > different encodings (UTF-8, LATIN1 etc) for each database. > > As I understood now, sorting will then still be in C style and not in > the locale specific way. Which leads me to the following questions: > > If specifying a characterset different from the default locale for a > database is such a bad idea, why is it possible at all? It isn't possible, that's the point. What is possible is that client can use any encoding they like to talk to the server, but the server will store and manage it all in one. What locale C means "I'm an encoding wizard and will ensure all my programs can handle all the encodings I want to use, because I understand the database will treat everything I send as ASCII bytes no matter what encoding the clients say it is". > From how I understand you, if I wanted a postgres server machine > supporting databases with different charsets, I'm advised to initialise > one cluster per locale. If you want to control the *storage* charset, yes. If you just want clients to think it's a LATIN9 DB, doing a: ALTER DATABASE foo SET client_encoding=latin9; > If specifying a characterset different from the default locale for a > database is not a bad idea, why does the default install forbid me to do > exactly this? It is a bad idea, because most normal the C library can only handle one encoding at a time. Locale C is a backdoor because it has system independant semantics and does not require libc. It's also not what people usually want, and so not recommended. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Am Mittwoch, 23. April 2008 schrieb Tim Tassonis: > If specifying a characterset different from the default locale for a > database is such a bad idea, why is it possible at all? Because Japanese users need this functionality. Aside from spectacularly bizarre niche applications, that is really the only reason. If you are dealing with European-type languages, just forget about it and specify the locale you want to initdb and let the encodings fall into place.
Martijn van Oosterhout wrote: > On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote: >> >> If specifying a characterset different from the default locale for a >> database is such a bad idea, why is it possible at all? > > It isn't possible, that's the point. What is possible is that client > can use any encoding they like to talk to the server, but the server > will store and manage it all in one. What locale C means "I'm an > encoding wizard and will ensure all my programs can handle all the > encodings I want to use, because I understand the database will treat > everything I send as ASCII bytes no matter what encoding the clients > say it is". > >> From how I understand you, if I wanted a postgres server machine >> supporting databases with different charsets, I'm advised to initialise >> one cluster per locale. > > If you want to control the *storage* charset, yes. If you just want > clients to think it's a LATIN9 DB, doing a: > > ALTER DATABASE foo SET client_encoding=latin9; Ok, got it, it's really this setting that's interesting. If I have a legacy application that defaults to latin1, I can leave the DB to UTF-8 ,set the client_encoding to latin1 and then all my selects and inserts can use latin1, but the data will be stored in utf-8. Well, that's really all I need, sorry for the confusion. Thanks a lot Tim
On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote: > As you probably are all aware of, this results now in a cluster that > will only allow you to create UTF-8 databases. I have read some posts > regarding this topic where it is explained that allowing LATIN1 on a > cluster initialized with UTF-8 will give you problems Ain't it the other way round ? Creating UTF8 databases on LATIN1 clusters won't work ? Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
I have a question related to this issue: Now that the locale has changed, it seems that the planner no longer wants to use the indexes for running LIKE queries on varchar columns unless I specify varchar_pattern_ops when creating the index. And if I create the index with varchar_pattern_ops, then the planner won't use it for = queries. What's the correct solution to this problem (when using UTF-8 and lc_collate and lc_ctype are both 1252)? Do I need to create two indexes? Thanks, -Chris -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tim Tassonis Sent: Wednesday, April 23, 2008 12:22 To: Martijn van Oosterhout Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] initdb in 8.3 Martijn van Oosterhout wrote: > On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote: >> >> If specifying a characterset different from the default locale for a >> database is such a bad idea, why is it possible at all? > > It isn't possible, that's the point. What is possible is that client > can use any encoding they like to talk to the server, but the server > will store and manage it all in one. What locale C means "I'm an > encoding wizard and will ensure all my programs can handle all the > encodings I want to use, because I understand the database will treat > everything I send as ASCII bytes no matter what encoding the clients > say it is". > >> From how I understand you, if I wanted a postgres server machine >> supporting databases with different charsets, I'm advised to initialise >> one cluster per locale. > > If you want to control the *storage* charset, yes. If you just want > clients to think it's a LATIN9 DB, doing a: > > ALTER DATABASE foo SET client_encoding=latin9; Ok, got it, it's really this setting that's interesting. If I have a legacy application that defaults to latin1, I can leave the DB to UTF-8 ,set the client_encoding to latin1 and then all my selects and inserts can use latin1, but the data will be stored in utf-8. Well, that's really all I need, sorry for the confusion. Thanks a lot Tim -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Karsten Hilbert wrote: > On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote: > >> As you probably are all aware of, this results now in a cluster that >> will only allow you to create UTF-8 databases. I have read some posts >> regarding this topic where it is explained that allowing LATIN1 on a >> cluster initialized with UTF-8 will give you problems > > Ain't it the other way round ? > > Creating UTF8 databases on LATIN1 clusters won't work ? No it's this way round, but maybe the other way round, too. You can only create databases with arbitrary encodings on C locale clusters from 8.3 on. Regards Tim > > Karsten
Christopher Condit wrote: > I have a question related to this issue: > Now that the locale has changed, it seems that the planner no longer > wants to use the indexes for running LIKE queries on varchar columns > unless I specify varchar_pattern_ops when creating the index. And if I > create the index with varchar_pattern_ops, then the planner won't use > it for = queries. > > What's the correct solution to this problem (when using UTF-8 and > lc_collate and lc_ctype are both 1252)? Do I need to create two indexes? http://www.postgresql.org/docs/faqs.FAQ.html#item4.6 http://www.postgresql.org/docs/8.3/static/indexes-opclass.html It should really also be mentioned in the section on the LIKE operator. By the way, when I tried to leave a comment on the pattern matching operator doc page I found that after logging in I was presented with a blank comments form again. When I re-entered my comment and submitted I got an error indicating that '' is not a valid NUMERIC. Going back to the original comments form on the pattern matching operator doc page (having logged in) and submitting there works fine. -- Craig Ringer
Ahhh - I See. Thanks, Craig. Although, once you've built the index with varchar_pattern_ops index, the following two (essentially equivalent) queries will run at vastly different speeds: select * from A where A.value like 'Nacho'; select * from A where A.value = 'Nacho'; Seems that the optimizer should catch this... C -----Original Message----- From: Craig Ringer [mailto:craig@postnewspapers.com.au] Sent: Thursday, April 24, 2008 01:20 To: Christopher Condit Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] initdb in 8.3 Christopher Condit wrote: > I have a question related to this issue: > Now that the locale has changed, it seems that the planner no longer > wants to use the indexes for running LIKE queries on varchar columns > unless I specify varchar_pattern_ops when creating the index. And if I > create the index with varchar_pattern_ops, then the planner won't use > it for = queries. > > What's the correct solution to this problem (when using UTF-8 and > lc_collate and lc_ctype are both 1252)? Do I need to create two indexes? http://www.postgresql.org/docs/faqs.FAQ.html#item4.6 http://www.postgresql.org/docs/8.3/static/indexes-opclass.html It should really also be mentioned in the section on the LIKE operator. By the way, when I tried to leave a comment on the pattern matching operator doc page I found that after logging in I was presented with a blank comments form again. When I re-entered my comment and submitted I got an error indicating that '' is not a valid NUMERIC. Going back to the original comments form on the pattern matching operator doc page (having logged in) and submitting there works fine. -- Craig Ringer
"Christopher Condit" <condit@sdsc.edu> writes: > Although, once you've built the index with varchar_pattern_ops index, > the following two (essentially equivalent) queries will run at vastly > different speeds: > select * from A where A.value like 'Nacho'; > select * from A where A.value = 'Nacho'; > Seems that the optimizer should catch this... Yeah, it's on the to-do list ... http://archives.postgresql.org/pgsql-hackers/2008-02/msg01003.php regards, tom lane
Richard Huxton napsal(a): > I think someone is looking at per-database locales for 8.4 - the issue > is more tricky than you might think because you need to worry about > system catalogue sort-order. There is Google Soc project for implementing collation per database level. I hope it will appear in 8.4 version. Catalog sort-order is not affected by locale, because "name" data type uses different operator then varchar. Zdenek