Thread: locale

locale

From

Dennis Bjorklund

Date:

07 April 2004, 08:04:09

Is anyone working to make the locale support in pg better? Running initdb
to set the locale is a bit heavy. It would be nice to at least be able to 
set it per database.

-- 
/Dennis Björklund

Re: locale

From

Bruce Momjian

Date:

07 April 2004, 14:17:32

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running initdb
> to set the locale is a bit heavy. It would be nice to at least be able to 
> set it per database.

Uh, createdb and CREATE DATABASE both have encoding options.  initdb
only sets the encoding for template1, and the default for future
databases, but you can override it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: locale

From

Bruce Momjian

Date:

07 April 2004, 14:28:58

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running initdb
> to set the locale is a bit heavy. It would be nice to at least be able to 
> set it per database.

Oops, I confused locale and multibyte.  Yes, I don't see a way to change
locale for new databases, but I don't see why we can't.

The initdb manual page says:
      initdb initializes the database cluster's  default  locale      and  character  set  encoding.  Some locale
categoriesare      fixed for the lifetime of the cluster, so it is  important      to  make  the  right  choice  when
runninginitdb.  Other      locale categories can be changed later when the server  is      started.  initdb will write
thoselocale settings into the      postgresql.conf  configuration  file  so  they   are   the      default,  but they
canbe changed by editing that file. To      set the locale that initdb uses, see  the  description  of      the
--localeoption. The character set encoding can be set      separately for each database as  it  is  created.   initdb
  determines  the encoding for the template1 database, which      will serve as the default  for  all  other
databases. To      alter the default encoding use the --encoding option.
 

and
      --locale=locale             Sets  the  default locale for the database cluster.             If this option is
not specified,  the  locale  is             inherited from the environment that initdb runs in.
 
      --lc-collate=locale
      --lc-ctype=locale
      --lc-messages=locale
      --lc-monetary=locale
      --lc-numeric=locale
      --lc-time=locale             Like --locale, but only  sets  the  locale  in  the             specified category.

My only guess is that you can use ALTER DATABASE SET to set some of the
values when someone connects to the database.

Looking at guc.c I see:
   {       {"lc_collate", PGC_INTERNAL, CLIENT_CONN_LOCALE,           gettext_noop("Shows the collation order
locale."),          NULL,           GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE       },       &locale_collate,       "C",
NULL,NULL   },
 
   {       {"lc_ctype", PGC_INTERNAL, CLIENT_CONN_LOCALE,           gettext_noop("Shows the character classification
andcase conversion locale."),           NULL,           GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE       },
&locale_ctype,      "C", NULL, NULL   },
 
   {       {"lc_messages", PGC_SUSET, CLIENT_CONN_LOCALE,           gettext_noop("Sets the language in which messages
aredisplayed."),           NULL       },       &locale_messages,       "", locale_messages_assign, NULL   },
 
   {       {"lc_monetary", PGC_USERSET, CLIENT_CONN_LOCALE,           gettext_noop("Sets the locale for formatting
monetaryamounts."),           NULL       },       &locale_monetary,       "C", locale_monetary_assign, NULL   },
 
   {       {"lc_numeric", PGC_USERSET, CLIENT_CONN_LOCALE,           gettext_noop("Sets the locale for formatting
numbers."),          NULL       },       &locale_numeric,       "C", locale_numeric_assign, NULL   },
 
   {       {"lc_time", PGC_USERSET, CLIENT_CONN_LOCALE,           gettext_noop("Sets the locale for formatting date and
timevalues."),           NULL       },       &locale_time,       "C", locale_time_assign, NULL   },
 


You can't change the internal ones, but you can modify some of the others.

Anyone know why we don't allow locale to be set per database?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: locale

From

Peter Eisentraut

Date:

07 April 2004, 14:32:08

Dennis Bjorklund wrote:
> Is anyone working to make the locale support in pg better? Running
> initdb to set the locale is a bit heavy. It would be nice to at least
> be able to set it per database.

I was supposed to do that but I got distracted.  I send out a longish 
implementation and transition plan some time ago, if you're interested.  
Setting the locale per database is quite doable actually, you just need 
a plan to prevent corruption of the shared system catalogs and you need 
to deal with modifications of the template database(s).  There was some 
discussion about that as well.  See the thread "Translations in the 
distributions" around 2004-01-09.  I can help out if you want to do 
what was discussed there.

Re: locale

From

Andrew Dunstan

Date:

07 April 2004, 14:51:44

Bruce Momjian wrote:

>Dennis Bjorklund wrote:
>  
>
>>Is anyone working to make the locale support in pg better? Running initdb
>>to set the locale is a bit heavy. It would be nice to at least be able to 
>>set it per database.
>>    
>>
>
>Uh, createdb and CREATE DATABASE both have encoding options.  initdb
>only sets the encoding for template1, and the default for future
>databases, but you can override it.
>
>  
>

That is true for encoding, but not true for LC_CTYPE and LC_COLLATE 
locale settings, which only initdb can set.

cheers

andrew

Re: locale

From

Tom Lane

Date:

07 April 2004, 15:29:28

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Anyone know why we don't allow locale to be set per database?

Changing it on the fly would corrupt index sort ordering.  See also
Peter's response nearby.
        regards, tom lane

Re: locale

From

Bruce Momjian

Date:

07 April 2004, 15:34:47

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Anyone know why we don't allow locale to be set per database?
> 
> Changing it on the fly would corrupt index sort ordering.  See also
> Peter's response nearby.

I was asking why we can't set it to a new static value when we create
the database.  I don't think the poster was asking for the ability to
change it after the database was created.

Added to TODO:
* Allow locale to be set at database creation

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: locale

From

Tom Lane

Date:

07 April 2004, 15:40:55

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I was asking why we can't set it to a new static value when we create
> the database.

Because that would corrupt indexes on shared tables.  (It might be
possible to finesse that, but it's not a no-brainer.)
        regards, tom lane

Re: locale

From

Tom Lane

Date:

07 April 2004, 15:42:38

I said:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> I was asking why we can't set it to a new static value when we create
>> the database.

> Because that would corrupt indexes on shared tables.  (It might be
> possible to finesse that, but it's not a no-brainer.)

And even more to the point, it would corrupt non-shared indexes
inherited from template1.  This could not be finessed --- AFAICS you'd
need to do the equivalent of a REINDEX in the new database to make it
work.
        regards, tom lane

Re: locale

From

Bruce Momjian

Date:

07 April 2004, 15:42:46

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I was asking why we can't set it to a new static value when we create
> > the database.
> 
> Because that would corrupt indexes on shared tables.  (It might be
> possible to finesse that, but it's not a no-brainer.)

Oh, I hadn't thought of that.  The problem isn't encoding, because we
handle that already, but differen representations of time and stuff?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: locale

From

Tom Lane

Date:

07 April 2004, 15:58:00

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Because that would corrupt indexes on shared tables.  (It might be
>> possible to finesse that, but it's not a no-brainer.)

> Oh, I hadn't thought of that.  The problem isn't encoding, because we
> handle that already, but differen representations of time and stuff?

No, the problem is sort ordering of indexes on textual columns.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

07 April 2004, 15:59:08

On Wed, 7 Apr 2004, Tom Lane wrote:

> And even more to the point, it would corrupt non-shared indexes
> inherited from template1.  This could not be finessed --- AFAICS you'd
> need to do the equivalent of a REINDEX in the new database to make it
> work.

From what I can tell there is only 3 tables we talk about:
 pg_database pg_shadow pg_group

and in each case there is the name column that is indexed (that matters to 
us, int columns are the same no matter what locale).

These name columns all use the special name datatype, maybe one could
simply treat name differently, like comparing these strings bytewise.

For my small databases I don't even need an index on any of these. But I 
can imaging someone having a couple of thousand users.

-- 
/Dennis Björklund

Re: locale

From

Tom Lane

Date:

07 April 2004, 16:01:06

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Added to TODO:
>     * Allow locale to be set at database creation

BTW, that is redundant with the locale todo items already present.
        regards, tom lane

Re: locale

From

Tom Lane

Date:

07 April 2004, 16:07:09

Dennis Bjorklund <db@zigo.dhs.org> writes:
> From what I can tell there is only 3 tables we talk about:
>   pg_database
>   pg_shadow
>   pg_group

If that were so, we'd not have a problem.  The reason we have to tread
very carefully is that we do not know what tables/indexes users might
have added to template1.  If we copy a text index into a new database
and claim that it is sorted by some new locale, we'd be breaking things.

In any case, the whole idea is substantially inferior to the correct
solution, which is per-column locale settings within databases.  That
does what we want, is required functionality per SQL spec, and avoids
problems during CREATE DATABASE.  It's just a tad harder to do :-(
        regards, tom lane

Re: locale

From

Bruce Momjian

Date:

07 April 2004, 16:14:55

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Added to TODO:
> >     * Allow locale to be set at database creation
> 
> BTW, that is redundant with the locale todo items already present.

I see:* Allow locale to be set at database creation* Allow locale on a per-column basis, default to ASCII

The first seems easier than the second.    

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: locale

From

Dennis Bjorklund

Date:

07 April 2004, 16:15:08

On Wed, 7 Apr 2004, Tom Lane wrote:

> If that were so, we'd not have a problem.  The reason we have to tread
> very carefully is that we do not know what tables/indexes users might
> have added to template1.

Aah, now I see the real problem!

> If we copy a text index into a new database and claim that it is sorted
> by some new locale, we'd be breaking things.

How is this handled for encodings? You can very well have something in
template1 in an encoding that is not compatible with the encoding you use 
to create a new database.

Right now I can't imagine how that was solved.

> In any case, the whole idea is substantially inferior to the correct
> solution, which is per-column locale settings within databases.

Of course, but that solution might be many years ahead. Had it been fairly 
easy to create a database with a different locale it would have been 
worth it (and still is if one could come up with some solution).

I have a number of different data directories with different locales, and 
add to that a number of different versions of pg and you can imagine 
what it looks like when I run ps :-)

-- 
/Dennis Björklund

Re: locale

From

Andrew Dunstan

Date:

07 April 2004, 16:35:28

Tom Lane wrote:

>Dennis Bjorklund <db@zigo.dhs.org> writes:
>  
>
>>From what I can tell there is only 3 tables we talk about:
>>  pg_database
>>  pg_shadow
>>  pg_group
>>    
>>
>
>If that were so, we'd not have a problem.  The reason we have to tread
>very carefully is that we do not know what tables/indexes users might
>have added to template1.  If we copy a text index into a new database
>and claim that it is sorted by some new locale, we'd be breaking things.
>  
>

Wouldn't reindex correct that? If so, it could be forced with a flag on 
"create database" maybe, or else some test to compare the two locale 
settings and force it if necessary?

>In any case, the whole idea is substantially inferior to the correct
>solution, which is per-column locale settings within databases.  That
>does what we want, is required functionality per SQL spec, and avoids
>problems during CREATE DATABASE.  It's just a tad harder to do :-(
>  
>

Yeah. But everything higher than the table level can surely be finessed 
with differrent locations / databases. Not having this right (i.e. at 
the column level) is a great pity, to say the least.

cheers

andrew

Re: locale

From

Tom Lane

Date:

07 April 2004, 16:41:08

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Wed, 7 Apr 2004, Tom Lane wrote:
>> If we copy a text index into a new database and claim that it is sorted
>> by some new locale, we'd be breaking things.

> How is this handled for encodings? You can very well have something in
> template1 in an encoding that is not compatible with the encoding you use 
> to create a new database.

This is likely broken; but that's no excuse for creating similar
breakage for locale settings.  Note that Peter's planned project would
hopefully clean up both of these issues.

In practice, we know that we have seen index failures from altering the
locale settings (back before we installed the code that locks down
LC_COLLATE/LC_CTYPE at initdb time).  I do not recall having heard any
reports of index failures that could be traced to changing encoding.
This may be because strcoll() derives its assumptions about encoding
from the LC_CTYPE setting and doesn't actually know what PG thinks the
encoding is.  So you might have a stored string that is illegal per the
current encoding, but nonetheless it will sort the same as it did in the
mother database.
        regards, tom lane

Re: locale

From

Tom Lane

Date:

07 April 2004, 17:03:19

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Wed, 7 Apr 2004, Tom Lane wrote:
>> In any case, the whole idea is substantially inferior to the correct
>> solution, which is per-column locale settings within databases.

> Of course, but that solution might be many years ahead.

Peter E. seems to think that it's not an infeasible amount of work.
(See previous discussion that he mentioned earlier in this thread.)

Basically, I'd rather see us tackle that than expend effort on
kluging CREATE DATABASE to allow per-database locales.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

07 April 2004, 17:55:38

On Wed, 7 Apr 2004, Tom Lane wrote:

> >> solution, which is per-column locale settings within databases.
> 
> > Of course, but that solution might be many years ahead.
> 
> Peter E. seems to think that it's not an infeasible amount of work.
> (See previous discussion that he mentioned earlier in this thread.)

I don't know how it should work in theory yet, much less what an 
implementation would look like.

What happens when you have two columns with different locales and try to 
compare them with with the operator <. Is the locale part of the string 
type, like text@sv_SE.UTF-8. What does that do to overloaded functions. 
What would happen when a locale and an encoding does not match. Should one 
just assume that it wont happen.

I've got lots of questions like that, some are probably answered by the 
sql standard and others maybe don't have an answer.

> Basically, I'd rather see us tackle that than expend effort on
> kluging CREATE DATABASE to allow per-database locales.

Don't think for a second that I don't want this. You are an american that
live in a ASCII world and you wants this. You can not imagine how much I
want it :-)

-- 
/Dennis Björklund

Re: locale

From

Tatsuo Ishii

Date:

07 April 2004, 20:22:08

> On Wed, 7 Apr 2004, Tom Lane wrote:
> 
> > If that were so, we'd not have a problem.  The reason we have to tread
> > very carefully is that we do not know what tables/indexes users might
> > have added to template1.
> 
> Aah, now I see the real problem!
> 
> > If we copy a text index into a new database and claim that it is sorted
> > by some new locale, we'd be breaking things.
> 
> How is this handled for encodings? You can very well have something in
> template1 in an encoding that is not compatible with the encoding you use 
> to create a new database.

Are you talking about the sort order? Then there's no problem with
encoding itself.
--
Tatsuo Ishii

Re: locale

From

Dennis Bjorklund

Date:

08 April 2004, 02:09:45

On Thu, 8 Apr 2004, Tatsuo Ishii wrote:

> Are you talking about the sort order? Then there's no problem with
> encoding itself.

The tables in template1 in encoding E1 are compied into the new database 
in encoding E2. Not all encodings are compatable, so you can't even 
convert from E1 to E2.

-- 
/Dennis Björklund

Re: locale

From

Honza Pazdziora

Date:

08 April 2004, 04:09:50

On Wed, Apr 07, 2004 at 03:40:57PM -0400, Tom Lane wrote:
> 
> In practice, we know that we have seen index failures from altering the
> locale settings (back before we installed the code that locks down
> LC_COLLATE/LC_CTYPE at initdb time).  I do not recall having heard any

Cannot the same failure happen if one upgrades their glibc / locales
and the new version implements the locale differently? Perhaps fixing
previous bug, or simply producing different results for strcoll /
strxfrm? If PostgreSQL depends on external locale information for
something as important as indexes, shouldn't it make elementary checks
(upon startup, perhaps) that the current locale settings and the
current locale version produces results compatible with the existing
indexes? And if it does not, reindex?

-- 
------------------------------------------------------------------------Honza Pazdziora | adelton@fi.muni.cz |
http://www.fi.muni.cz/~adelton/.project:Perl, mod_perl, DBI, Oracle, large Web systems, XML/XSL, ...    Only
self-confidentpeople can be simple.

Re: locale

From

Tatsuo Ishii

Date:

08 April 2004, 04:21:44

> > Are you talking about the sort order? Then there's no problem with
> > encoding itself.
> 
> The tables in template1 in encoding E1 are compied into the new database 
> in encoding E2. Not all encodings are compatable, so you can't even 
> convert from E1 to E2.

In this case you just set your terminal encoding to E1, then SELECT
the table. Point is you do not use set client_encoding or \encoding
command. This will work as long as both E1 and E2 are single byte
encodings.
--
Tatsuo Ishii

Re: locale

From

Dennis Bjorklund

Date:

08 April 2004, 04:32:01

On Thu, 8 Apr 2004, Tatsuo Ishii wrote:

> > The tables in template1 in encoding E1 are compied into the new database 
> > in encoding E2. Not all encodings are compatable, so you can't even 
> > convert from E1 to E2.
> 
> In this case you just set your terminal encoding to E1, then SELECT
> the table. Point is you do not use set client_encoding or \encoding
> command. This will work as long as both E1 and E2 are single byte
> encodings.

That is not a solution.

As you said, it does not even work for all encodings. If the database is
in Latin1 I'd expect that the strings in the table are just latin1 and not
something else. And for some multibyte encodings that something else might
not just be the wrong characters but an invalid string (think utf-8).

I can also imagine the indexes being wrong when you keep the encoding of
tables when you create a new database. Since the same character can be
represented differently, the sort order also changes if you try to
interpret something with another encoding then what the compare operator
think it is. That makes the index invalid.

It's simply broken if you ask me.

-- 
/Dennis Björklund

Re: locale

From

Tom Lane

Date:

08 April 2004, 10:56:10

Dennis Bjorklund <db@zigo.dhs.org> writes:
> I can also imagine the indexes being wrong when you keep the encoding of
> tables when you create a new database. Since the same character can be
> represented differently, the sort order also changes if you try to
> interpret something with another encoding then what the compare operator
> think it is. That makes the index invalid.

See my previous point: the index does not actually fail, in our current
implementation, because strcoll() is unaffected by the database's
encoding setting.  You'd be likely to have trouble with I/O translation
and with other encoding-dependent operations like upper()/lower() ...
but not with indexes.

> It's simply broken if you ask me.

It's certainly ungood, but I don't think we can materially improve
things without a fundamental rewrite along the lines of Peter's proposal
to support per-column locale/encoding.  Database-level settings are
simply the wrong tool for this.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

08 April 2004, 11:16:12

On Thu, 8 Apr 2004, Tom Lane wrote:

> See my previous point: the index does not actually fail, in our current
> implementation, because strcoll() is unaffected by the database's
> encoding setting.

How can it be? If I have a utf-8 template1 and a table with an index
sorted according to the utf-8 characters in some locale. Then this table
and index is copied into a Latin1 database. When I interpret these bytes
as Latin1 in the index, the ordering does not have to be the same as it 
was before and the index can not be used.

I don't understand what you mean when you say that strcoll() is unaffected
by the database's encoding setting. It interprets characters, how can it 
not be?

If it works it must be something more going on that I don't
know/understand yet. If I am I would be happy to be corrected, if not we
have a more broken system then we expected before.

The objection to a per database locale is that we can not copy a table 
from the template into the database since the index would not be valid 
anymore. To me that is solvable by just reindexing. The current problem 
with encodings does not look solvable at all to me (except to not copy 
tables when we can not reencode the strings).

-- 
/Dennis Björklund

Re: locale

From

Tom Lane

Date:

08 April 2004, 11:39:31

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Thu, 8 Apr 2004, Tom Lane wrote:
>> See my previous point: the index does not actually fail, in our current
>> implementation, because strcoll() is unaffected by the database's
>> encoding setting.

> How can it be? If I have a utf-8 template1 and a table with an index
> sorted according to the utf-8 characters in some locale. Then this table
> and index is copied into a Latin1 database. When I interpret these bytes
> as Latin1 in the index, the ordering does not have to be the same as it 
> was before and the index can not be used.

No, the ordering *will* be the same as it was before, because strcoll()
is still functioning the same.  You'd get the same answer from a sort
operation since it depends on the same operators.

Now, you will probably complain that the sort order doesn't appear
correct according to your Latin1 interpretation --- and you're right.
But the index is not corrupt, it is still consistent in its own terms.

> I don't understand what you mean when you say that strcoll() is unaffected
> by the database's encoding setting. It interprets characters, how can it 
> not be?

It interprets them according to LC_CTYPE, which does not change.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

08 April 2004, 12:32:08

On Thu, 8 Apr 2004, Tom Lane wrote:

> No, the ordering *will* be the same as it was before, because strcoll()
> is still functioning the same.  You'd get the same answer from a sort
> operation since it depends on the same operators.
> 
> It interprets them according to LC_CTYPE, which does not change.

I'm afraid that I don't understand you yet, and would like to have
it explained in more detail if possible. While I feel a bit stupid to not 
understand what you are stating, but I'm sure there are more then me who 
feels like that :-)

Maybe we can look at an example. Let us take some utf-8 strings correctly
ordered in swedish
 Åke Ära

now, since these are utf-8 they are encoded as
 c3 85 6b 65        (Åke) c3 84 72 61        (Ära)

and that is the order they have in the index.

Now, this index is copied into a new database where
the encoding is Latin1. Now we want to in the above table
lookup the string that in Latin1 is represented as
  c3 84 72 61

So we look in the index and see that the first row in the index is
not the same. But, now when we compare these strings as latin1 strings
it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
and 85 are some control characters). Se, we will not find this string
in the index since we think it should have been before the first entry.

We might even insert a new copy of this string in another
position in the index.

So, my question is.

a) What have we gained by copying this table into the latin1 database.  It looks broken to me. As far as I understand
wehave to rebuild  the index to get something that works at least a little.

b) Maybe one should not just reindex but reencode. In some cases that  works and produces good result. For example from
latin1to utf-8.

c) if we are going to reindex anyway, then why not do that and solve the  per database locale also. This is an
independentpoint from a) and b)  that I still want to understand the first two points even if we don't  talk about per
databaselocale.

-- 
/Dennis Björklund

Re: locale

From

Tom Lane

Date:

08 April 2004, 12:47:29

Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Thu, 8 Apr 2004, Tom Lane wrote:
>> No, the ordering *will* be the same as it was before, because strcoll()
>> is still functioning the same.  You'd get the same answer from a sort
>> operation since it depends on the same operators.

> But, now when we compare these strings as latin1 strings
> it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
> we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
> and 85 are some control characters).

You're missing the point: strcoll() is not going to compare them as
latin1 strings.  It's going to interpret the bytes as utf-8 strings,
because that's what LC_CTYPE will tell it to do.  So the sort ordering
of any particular byte string remains the same as it was before, and
the index does not become corrupt.

Whether the index is delivering answers that you find useful is a whole
different question ;-).  For example, if you do a "WHERE col = 'foo'"
type of query, you'll be presenting the latin1 encoding of 'foo', which
may well not equal the utf-8 encoding of 'foo', meaning you won't find
that row even if it exists.  However this would be true whether you used
the index or not --- it's really a data failure and not an index failure.

> a) What have we gained by copying this table into the latin1 database.
>    It looks broken to me.

It looks broken to me too, in terms of user functionality.  I was simply
responding to your assertion that the indexes will be corrupt.  They
won't be.

AFAICS, to support per-database encoding and locale correctly, CREATE
DATABASE would have to be prepared to re-encode *and* re-index every
textual column in the copied database.  I don't really foresee us going
to that much work in order to have a solution that's still half-baked
and non-spec-compliant.  It's much more likely that per-column locale
and encoding will get done instead.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

08 April 2004, 16:19:54

On Thu, 8 Apr 2004, Tom Lane wrote:

> You're missing the point: strcoll() is not going to compare them as
> latin1 strings.  It's going to interpret the bytes as utf-8 strings,
> because that's what LC_CTYPE will tell it to do.

My current understanding of what you are saying now is that LC_CTYPE is
always UTF-8 and all comparisons in the new database are going to be
wrong. This since all strings will be compared as if they where UTF-8.  
LC_CTYPE is per cluster and not per database as some of the other LC_xxxx.

Yes, this actually makes sense. I really hope that this is the way it work
because I think I can understand this. I don't like it, but I can
understand what pg currently do, which is good (unless pg does something
else :-)

Thanks for the explanation.

-- 
/Dennis Björklund

Re: locale

From

Tom Lane

Date:

09 April 2004, 00:25:08

Dennis Bjorklund <db@zigo.dhs.org> writes:
> LC_CTYPE is per cluster and not per database as some of the other LC_xxxx.

Yup, exactly.  If we did not force both LC_COLLATE and LC_CTYPE to have
the same values cluster-wide, then we *would* have index corruption
issues.
        regards, tom lane

Re: locale

From

Dennis Bjorklund

Date:

09 April 2004, 02:18:38

On Thu, 8 Apr 2004, Tom Lane wrote:

> Yup, exactly.  If we did not force both LC_COLLATE and LC_CTYPE to have
> the same values cluster-wide, then we *would* have index corruption
> issues.

We really show warn people that using another encoding in a database then 
what the cluster uses, breaks sorting.

I was under the impression that as long as I've set the right locale when
doing initdb I could then create different databases with different
encodings and it all works, but it does not. I simply trust pg too much
(not without reason since it is an amazing project).

-- 
/Dennis Björklund

Re: locale

From

Peter Eisentraut

Date:

12 April 2004, 14:31:44

Tom Lane wrote:
> It's certainly ungood, but I don't think we can materially improve
> things without a fundamental rewrite along the lines of Peter's
> proposal to support per-column locale/encoding.  Database-level
> settings are simply the wrong tool for this.

Well, the complete redo is about two years out if you ask me.  Allowing 
the locale to be set on a database level would already improve things a 
lot for many people.  Since we have a perfectly good reindex command, I 
think the problems that we have discussed are not showstoppers.

Re: locale

From

Bruce Momjian

Date:

12 April 2004, 15:08:56

Peter Eisentraut wrote:
> Tom Lane wrote:
> > It's certainly ungood, but I don't think we can materially improve
> > things without a fundamental rewrite along the lines of Peter's
> > proposal to support per-column locale/encoding.  Database-level
> > settings are simply the wrong tool for this.
> 
> Well, the complete redo is about two years out if you ask me.  Allowing 
> the locale to be set on a database level would already improve things a 
> lot for many people.  Since we have a perfectly good reindex command, I 
> think the problems that we have discussed are not showstoppers.

I added a TODO item for fixing per-database locales, so we are ready if
someone wants to code it:
o Allow locale to be set at database creation

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073