Thread: locale
Is anyone working to make the locale support in pg better? Running initdb to set the locale is a bit heavy. It would be nice to at least be able to set it per database. -- /Dennis Björklund
Dennis Bjorklund wrote: > Is anyone working to make the locale support in pg better? Running initdb > to set the locale is a bit heavy. It would be nice to at least be able to > set it per database. Uh, createdb and CREATE DATABASE both have encoding options. initdb only sets the encoding for template1, and the default for future databases, but you can override it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Dennis Bjorklund wrote: > Is anyone working to make the locale support in pg better? Running initdb > to set the locale is a bit heavy. It would be nice to at least be able to > set it per database. Oops, I confused locale and multibyte. Yes, I don't see a way to change locale for new databases, but I don't see why we can't. The initdb manual page says: initdb initializes the database cluster's default locale and character set encoding. Some locale categoriesare fixed for the lifetime of the cluster, so it is important to make the right choice when runninginitdb. Other locale categories can be changed later when the server is started. initdb will write thoselocale settings into the postgresql.conf configuration file so they are the default, but they canbe changed by editing that file. To set the locale that initdb uses, see the description of the --localeoption. The character set encoding can be set separately for each database as it is created. initdb determines the encoding for the template1 database, which will serve as the default for all other databases. To alter the default encoding use the --encoding option. and --locale=locale Sets the default locale for the database cluster. If this option is not specified, the locale is inherited from the environment that initdb runs in. --lc-collate=locale --lc-ctype=locale --lc-messages=locale --lc-monetary=locale --lc-numeric=locale --lc-time=locale Like --locale, but only sets the locale in the specified category. My only guess is that you can use ALTER DATABASE SET to set some of the values when someone connects to the database. Looking at guc.c I see: { {"lc_collate", PGC_INTERNAL, CLIENT_CONN_LOCALE, gettext_noop("Shows the collation order locale."), NULL, GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE }, &locale_collate, "C", NULL,NULL }, { {"lc_ctype", PGC_INTERNAL, CLIENT_CONN_LOCALE, gettext_noop("Shows the character classification andcase conversion locale."), NULL, GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE }, &locale_ctype, "C", NULL, NULL }, { {"lc_messages", PGC_SUSET, CLIENT_CONN_LOCALE, gettext_noop("Sets the language in which messages aredisplayed."), NULL }, &locale_messages, "", locale_messages_assign, NULL }, { {"lc_monetary", PGC_USERSET, CLIENT_CONN_LOCALE, gettext_noop("Sets the locale for formatting monetaryamounts."), NULL }, &locale_monetary, "C", locale_monetary_assign, NULL }, { {"lc_numeric", PGC_USERSET, CLIENT_CONN_LOCALE, gettext_noop("Sets the locale for formatting numbers."), NULL }, &locale_numeric, "C", locale_numeric_assign, NULL }, { {"lc_time", PGC_USERSET, CLIENT_CONN_LOCALE, gettext_noop("Sets the locale for formatting date and timevalues."), NULL }, &locale_time, "C", locale_time_assign, NULL }, You can't change the internal ones, but you can modify some of the others. Anyone know why we don't allow locale to be set per database? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Dennis Bjorklund wrote: > Is anyone working to make the locale support in pg better? Running > initdb to set the locale is a bit heavy. It would be nice to at least > be able to set it per database. I was supposed to do that but I got distracted. I send out a longish implementation and transition plan some time ago, if you're interested. Setting the locale per database is quite doable actually, you just need a plan to prevent corruption of the shared system catalogs and you need to deal with modifications of the template database(s). There was some discussion about that as well. See the thread "Translations in the distributions" around 2004-01-09. I can help out if you want to do what was discussed there.
Bruce Momjian wrote: >Dennis Bjorklund wrote: > > >>Is anyone working to make the locale support in pg better? Running initdb >>to set the locale is a bit heavy. It would be nice to at least be able to >>set it per database. >> >> > >Uh, createdb and CREATE DATABASE both have encoding options. initdb >only sets the encoding for template1, and the default for future >databases, but you can override it. > > > That is true for encoding, but not true for LC_CTYPE and LC_COLLATE locale settings, which only initdb can set. cheers andrew
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Anyone know why we don't allow locale to be set per database? Changing it on the fly would corrupt index sort ordering. See also Peter's response nearby. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Anyone know why we don't allow locale to be set per database? > > Changing it on the fly would corrupt index sort ordering. See also > Peter's response nearby. I was asking why we can't set it to a new static value when we create the database. I don't think the poster was asking for the ability to change it after the database was created. Added to TODO: * Allow locale to be set at database creation -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I was asking why we can't set it to a new static value when we create > the database. Because that would corrupt indexes on shared tables. (It might be possible to finesse that, but it's not a no-brainer.) regards, tom lane
I said: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> I was asking why we can't set it to a new static value when we create >> the database. > Because that would corrupt indexes on shared tables. (It might be > possible to finesse that, but it's not a no-brainer.) And even more to the point, it would corrupt non-shared indexes inherited from template1. This could not be finessed --- AFAICS you'd need to do the equivalent of a REINDEX in the new database to make it work. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I was asking why we can't set it to a new static value when we create > > the database. > > Because that would corrupt indexes on shared tables. (It might be > possible to finesse that, but it's not a no-brainer.) Oh, I hadn't thought of that. The problem isn't encoding, because we handle that already, but differen representations of time and stuff? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Because that would corrupt indexes on shared tables. (It might be >> possible to finesse that, but it's not a no-brainer.) > Oh, I hadn't thought of that. The problem isn't encoding, because we > handle that already, but differen representations of time and stuff? No, the problem is sort ordering of indexes on textual columns. regards, tom lane
On Wed, 7 Apr 2004, Tom Lane wrote: > And even more to the point, it would corrupt non-shared indexes > inherited from template1. This could not be finessed --- AFAICS you'd > need to do the equivalent of a REINDEX in the new database to make it > work. From what I can tell there is only 3 tables we talk about: pg_database pg_shadow pg_group and in each case there is the name column that is indexed (that matters to us, int columns are the same no matter what locale). These name columns all use the special name datatype, maybe one could simply treat name differently, like comparing these strings bytewise. For my small databases I don't even need an index on any of these. But I can imaging someone having a couple of thousand users. -- /Dennis Björklund
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Added to TODO: > * Allow locale to be set at database creation BTW, that is redundant with the locale todo items already present. regards, tom lane
Dennis Bjorklund <db@zigo.dhs.org> writes: > From what I can tell there is only 3 tables we talk about: > pg_database > pg_shadow > pg_group If that were so, we'd not have a problem. The reason we have to tread very carefully is that we do not know what tables/indexes users might have added to template1. If we copy a text index into a new database and claim that it is sorted by some new locale, we'd be breaking things. In any case, the whole idea is substantially inferior to the correct solution, which is per-column locale settings within databases. That does what we want, is required functionality per SQL spec, and avoids problems during CREATE DATABASE. It's just a tad harder to do :-( regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Added to TODO: > > * Allow locale to be set at database creation > > BTW, that is redundant with the locale todo items already present. I see:* Allow locale to be set at database creation* Allow locale on a per-column basis, default to ASCII The first seems easier than the second. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 7 Apr 2004, Tom Lane wrote: > If that were so, we'd not have a problem. The reason we have to tread > very carefully is that we do not know what tables/indexes users might > have added to template1. Aah, now I see the real problem! > If we copy a text index into a new database and claim that it is sorted > by some new locale, we'd be breaking things. How is this handled for encodings? You can very well have something in template1 in an encoding that is not compatible with the encoding you use to create a new database. Right now I can't imagine how that was solved. > In any case, the whole idea is substantially inferior to the correct > solution, which is per-column locale settings within databases. Of course, but that solution might be many years ahead. Had it been fairly easy to create a database with a different locale it would have been worth it (and still is if one could come up with some solution). I have a number of different data directories with different locales, and add to that a number of different versions of pg and you can imagine what it looks like when I run ps :-) -- /Dennis Björklund
Tom Lane wrote: >Dennis Bjorklund <db@zigo.dhs.org> writes: > > >>From what I can tell there is only 3 tables we talk about: >> pg_database >> pg_shadow >> pg_group >> >> > >If that were so, we'd not have a problem. The reason we have to tread >very carefully is that we do not know what tables/indexes users might >have added to template1. If we copy a text index into a new database >and claim that it is sorted by some new locale, we'd be breaking things. > > Wouldn't reindex correct that? If so, it could be forced with a flag on "create database" maybe, or else some test to compare the two locale settings and force it if necessary? >In any case, the whole idea is substantially inferior to the correct >solution, which is per-column locale settings within databases. That >does what we want, is required functionality per SQL spec, and avoids >problems during CREATE DATABASE. It's just a tad harder to do :-( > > Yeah. But everything higher than the table level can surely be finessed with differrent locations / databases. Not having this right (i.e. at the column level) is a great pity, to say the least. cheers andrew
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Wed, 7 Apr 2004, Tom Lane wrote: >> If we copy a text index into a new database and claim that it is sorted >> by some new locale, we'd be breaking things. > How is this handled for encodings? You can very well have something in > template1 in an encoding that is not compatible with the encoding you use > to create a new database. This is likely broken; but that's no excuse for creating similar breakage for locale settings. Note that Peter's planned project would hopefully clean up both of these issues. In practice, we know that we have seen index failures from altering the locale settings (back before we installed the code that locks down LC_COLLATE/LC_CTYPE at initdb time). I do not recall having heard any reports of index failures that could be traced to changing encoding. This may be because strcoll() derives its assumptions about encoding from the LC_CTYPE setting and doesn't actually know what PG thinks the encoding is. So you might have a stored string that is illegal per the current encoding, but nonetheless it will sort the same as it did in the mother database. regards, tom lane
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Wed, 7 Apr 2004, Tom Lane wrote: >> In any case, the whole idea is substantially inferior to the correct >> solution, which is per-column locale settings within databases. > Of course, but that solution might be many years ahead. Peter E. seems to think that it's not an infeasible amount of work. (See previous discussion that he mentioned earlier in this thread.) Basically, I'd rather see us tackle that than expend effort on kluging CREATE DATABASE to allow per-database locales. regards, tom lane
On Wed, 7 Apr 2004, Tom Lane wrote: > >> solution, which is per-column locale settings within databases. > > > Of course, but that solution might be many years ahead. > > Peter E. seems to think that it's not an infeasible amount of work. > (See previous discussion that he mentioned earlier in this thread.) I don't know how it should work in theory yet, much less what an implementation would look like. What happens when you have two columns with different locales and try to compare them with with the operator <. Is the locale part of the string type, like text@sv_SE.UTF-8. What does that do to overloaded functions. What would happen when a locale and an encoding does not match. Should one just assume that it wont happen. I've got lots of questions like that, some are probably answered by the sql standard and others maybe don't have an answer. > Basically, I'd rather see us tackle that than expend effort on > kluging CREATE DATABASE to allow per-database locales. Don't think for a second that I don't want this. You are an american that live in a ASCII world and you wants this. You can not imagine how much I want it :-) -- /Dennis Björklund
> On Wed, 7 Apr 2004, Tom Lane wrote: > > > If that were so, we'd not have a problem. The reason we have to tread > > very carefully is that we do not know what tables/indexes users might > > have added to template1. > > Aah, now I see the real problem! > > > If we copy a text index into a new database and claim that it is sorted > > by some new locale, we'd be breaking things. > > How is this handled for encodings? You can very well have something in > template1 in an encoding that is not compatible with the encoding you use > to create a new database. Are you talking about the sort order? Then there's no problem with encoding itself. -- Tatsuo Ishii
On Thu, 8 Apr 2004, Tatsuo Ishii wrote: > Are you talking about the sort order? Then there's no problem with > encoding itself. The tables in template1 in encoding E1 are compied into the new database in encoding E2. Not all encodings are compatable, so you can't even convert from E1 to E2. -- /Dennis Björklund
On Wed, Apr 07, 2004 at 03:40:57PM -0400, Tom Lane wrote: > > In practice, we know that we have seen index failures from altering the > locale settings (back before we installed the code that locks down > LC_COLLATE/LC_CTYPE at initdb time). I do not recall having heard any Cannot the same failure happen if one upgrades their glibc / locales and the new version implements the locale differently? Perhaps fixing previous bug, or simply producing different results for strcoll / strxfrm? If PostgreSQL depends on external locale information for something as important as indexes, shouldn't it make elementary checks (upon startup, perhaps) that the current locale settings and the current locale version produces results compatible with the existing indexes? And if it does not, reindex? -- ------------------------------------------------------------------------Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/.project:Perl, mod_perl, DBI, Oracle, large Web systems, XML/XSL, ... Only self-confidentpeople can be simple.
> > Are you talking about the sort order? Then there's no problem with > > encoding itself. > > The tables in template1 in encoding E1 are compied into the new database > in encoding E2. Not all encodings are compatable, so you can't even > convert from E1 to E2. In this case you just set your terminal encoding to E1, then SELECT the table. Point is you do not use set client_encoding or \encoding command. This will work as long as both E1 and E2 are single byte encodings. -- Tatsuo Ishii
On Thu, 8 Apr 2004, Tatsuo Ishii wrote: > > The tables in template1 in encoding E1 are compied into the new database > > in encoding E2. Not all encodings are compatable, so you can't even > > convert from E1 to E2. > > In this case you just set your terminal encoding to E1, then SELECT > the table. Point is you do not use set client_encoding or \encoding > command. This will work as long as both E1 and E2 are single byte > encodings. That is not a solution. As you said, it does not even work for all encodings. If the database is in Latin1 I'd expect that the strings in the table are just latin1 and not something else. And for some multibyte encodings that something else might not just be the wrong characters but an invalid string (think utf-8). I can also imagine the indexes being wrong when you keep the encoding of tables when you create a new database. Since the same character can be represented differently, the sort order also changes if you try to interpret something with another encoding then what the compare operator think it is. That makes the index invalid. It's simply broken if you ask me. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > I can also imagine the indexes being wrong when you keep the encoding of > tables when you create a new database. Since the same character can be > represented differently, the sort order also changes if you try to > interpret something with another encoding then what the compare operator > think it is. That makes the index invalid. See my previous point: the index does not actually fail, in our current implementation, because strcoll() is unaffected by the database's encoding setting. You'd be likely to have trouble with I/O translation and with other encoding-dependent operations like upper()/lower() ... but not with indexes. > It's simply broken if you ask me. It's certainly ungood, but I don't think we can materially improve things without a fundamental rewrite along the lines of Peter's proposal to support per-column locale/encoding. Database-level settings are simply the wrong tool for this. regards, tom lane
On Thu, 8 Apr 2004, Tom Lane wrote: > See my previous point: the index does not actually fail, in our current > implementation, because strcoll() is unaffected by the database's > encoding setting. How can it be? If I have a utf-8 template1 and a table with an index sorted according to the utf-8 characters in some locale. Then this table and index is copied into a Latin1 database. When I interpret these bytes as Latin1 in the index, the ordering does not have to be the same as it was before and the index can not be used. I don't understand what you mean when you say that strcoll() is unaffected by the database's encoding setting. It interprets characters, how can it not be? If it works it must be something more going on that I don't know/understand yet. If I am I would be happy to be corrected, if not we have a more broken system then we expected before. The objection to a per database locale is that we can not copy a table from the template into the database since the index would not be valid anymore. To me that is solvable by just reindexing. The current problem with encodings does not look solvable at all to me (except to not copy tables when we can not reencode the strings). -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Thu, 8 Apr 2004, Tom Lane wrote: >> See my previous point: the index does not actually fail, in our current >> implementation, because strcoll() is unaffected by the database's >> encoding setting. > How can it be? If I have a utf-8 template1 and a table with an index > sorted according to the utf-8 characters in some locale. Then this table > and index is copied into a Latin1 database. When I interpret these bytes > as Latin1 in the index, the ordering does not have to be the same as it > was before and the index can not be used. No, the ordering *will* be the same as it was before, because strcoll() is still functioning the same. You'd get the same answer from a sort operation since it depends on the same operators. Now, you will probably complain that the sort order doesn't appear correct according to your Latin1 interpretation --- and you're right. But the index is not corrupt, it is still consistent in its own terms. > I don't understand what you mean when you say that strcoll() is unaffected > by the database's encoding setting. It interprets characters, how can it > not be? It interprets them according to LC_CTYPE, which does not change. regards, tom lane
On Thu, 8 Apr 2004, Tom Lane wrote: > No, the ordering *will* be the same as it was before, because strcoll() > is still functioning the same. You'd get the same answer from a sort > operation since it depends on the same operators. > > It interprets them according to LC_CTYPE, which does not change. I'm afraid that I don't understand you yet, and would like to have it explained in more detail if possible. While I feel a bit stupid to not understand what you are stating, but I'm sure there are more then me who feels like that :-) Maybe we can look at an example. Let us take some utf-8 strings correctly ordered in swedish Åke Ära now, since these are utf-8 they are encoded as c3 85 6b 65 (Åke) c3 84 72 61 (Ära) and that is the order they have in the index. Now, this index is copied into a new database where the encoding is Latin1. Now we want to in the above table lookup the string that in Latin1 is represented as c3 84 72 61 So we look in the index and see that the first row in the index is not the same. But, now when we compare these strings as latin1 strings it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings we compare each character and c3 = c3, and then 84 < 85 (in latin1 84 and 85 are some control characters). Se, we will not find this string in the index since we think it should have been before the first entry. We might even insert a new copy of this string in another position in the index. So, my question is. a) What have we gained by copying this table into the latin1 database. It looks broken to me. As far as I understand wehave to rebuild the index to get something that works at least a little. b) Maybe one should not just reindex but reencode. In some cases that works and produces good result. For example from latin1to utf-8. c) if we are going to reindex anyway, then why not do that and solve the per database locale also. This is an independentpoint from a) and b) that I still want to understand the first two points even if we don't talk about per databaselocale. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Thu, 8 Apr 2004, Tom Lane wrote: >> No, the ordering *will* be the same as it was before, because strcoll() >> is still functioning the same. You'd get the same answer from a sort >> operation since it depends on the same operators. > But, now when we compare these strings as latin1 strings > it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings > we compare each character and c3 = c3, and then 84 < 85 (in latin1 84 > and 85 are some control characters). You're missing the point: strcoll() is not going to compare them as latin1 strings. It's going to interpret the bytes as utf-8 strings, because that's what LC_CTYPE will tell it to do. So the sort ordering of any particular byte string remains the same as it was before, and the index does not become corrupt. Whether the index is delivering answers that you find useful is a whole different question ;-). For example, if you do a "WHERE col = 'foo'" type of query, you'll be presenting the latin1 encoding of 'foo', which may well not equal the utf-8 encoding of 'foo', meaning you won't find that row even if it exists. However this would be true whether you used the index or not --- it's really a data failure and not an index failure. > a) What have we gained by copying this table into the latin1 database. > It looks broken to me. It looks broken to me too, in terms of user functionality. I was simply responding to your assertion that the indexes will be corrupt. They won't be. AFAICS, to support per-database encoding and locale correctly, CREATE DATABASE would have to be prepared to re-encode *and* re-index every textual column in the copied database. I don't really foresee us going to that much work in order to have a solution that's still half-baked and non-spec-compliant. It's much more likely that per-column locale and encoding will get done instead. regards, tom lane
On Thu, 8 Apr 2004, Tom Lane wrote: > You're missing the point: strcoll() is not going to compare them as > latin1 strings. It's going to interpret the bytes as utf-8 strings, > because that's what LC_CTYPE will tell it to do. My current understanding of what you are saying now is that LC_CTYPE is always UTF-8 and all comparisons in the new database are going to be wrong. This since all strings will be compared as if they where UTF-8. LC_CTYPE is per cluster and not per database as some of the other LC_xxxx. Yes, this actually makes sense. I really hope that this is the way it work because I think I can understand this. I don't like it, but I can understand what pg currently do, which is good (unless pg does something else :-) Thanks for the explanation. -- /Dennis Björklund
Dennis Bjorklund <db@zigo.dhs.org> writes: > LC_CTYPE is per cluster and not per database as some of the other LC_xxxx. Yup, exactly. If we did not force both LC_COLLATE and LC_CTYPE to have the same values cluster-wide, then we *would* have index corruption issues. regards, tom lane
On Thu, 8 Apr 2004, Tom Lane wrote: > Yup, exactly. If we did not force both LC_COLLATE and LC_CTYPE to have > the same values cluster-wide, then we *would* have index corruption > issues. We really show warn people that using another encoding in a database then what the cluster uses, breaks sorting. I was under the impression that as long as I've set the right locale when doing initdb I could then create different databases with different encodings and it all works, but it does not. I simply trust pg too much (not without reason since it is an amazing project). -- /Dennis Björklund
Tom Lane wrote: > It's certainly ungood, but I don't think we can materially improve > things without a fundamental rewrite along the lines of Peter's > proposal to support per-column locale/encoding. Database-level > settings are simply the wrong tool for this. Well, the complete redo is about two years out if you ask me. Allowing the locale to be set on a database level would already improve things a lot for many people. Since we have a perfectly good reindex command, I think the problems that we have discussed are not showstoppers.
Peter Eisentraut wrote: > Tom Lane wrote: > > It's certainly ungood, but I don't think we can materially improve > > things without a fundamental rewrite along the lines of Peter's > > proposal to support per-column locale/encoding. Database-level > > settings are simply the wrong tool for this. > > Well, the complete redo is about two years out if you ask me. Allowing > the locale to be set on a database level would already improve things a > lot for many people. Since we have a perfectly good reindex command, I > think the problems that we have discussed are not showstoppers. I added a TODO item for fixing per-database locales, so we are ready if someone wants to code it: o Allow locale to be set at database creation -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073