Re: locale - Mailing list pgsql-hackers

From Dennis Bjorklund
Subject Re: locale
Date
Msg-id Pine.LNX.4.44.0404081729510.4551-100000@zigo.dhs.org
Whole thread Raw
In response to Re: locale  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: locale  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, 8 Apr 2004, Tom Lane wrote:

> No, the ordering *will* be the same as it was before, because strcoll()
> is still functioning the same.  You'd get the same answer from a sort
> operation since it depends on the same operators.
> 
> It interprets them according to LC_CTYPE, which does not change.

I'm afraid that I don't understand you yet, and would like to have
it explained in more detail if possible. While I feel a bit stupid to not 
understand what you are stating, but I'm sure there are more then me who 
feels like that :-)

Maybe we can look at an example. Let us take some utf-8 strings correctly
ordered in swedish
 Åke Ära

now, since these are utf-8 they are encoded as
 c3 85 6b 65        (Åke) c3 84 72 61        (Ära)

and that is the order they have in the index.

Now, this index is copied into a new database where
the encoding is Latin1. Now we want to in the above table
lookup the string that in Latin1 is represented as
  c3 84 72 61

So we look in the index and see that the first row in the index is
not the same. But, now when we compare these strings as latin1 strings
it's no longer the case that c3 84 72 61 > c3 85 6b 65. As latin1 strings
we compare each character and c3 = c3, and then 84 < 85 (in latin1 84
and 85 are some control characters). Se, we will not find this string
in the index since we think it should have been before the first entry.

We might even insert a new copy of this string in another
position in the index.

So, my question is.

a) What have we gained by copying this table into the latin1 database.  It looks broken to me. As far as I understand
wehave to rebuild  the index to get something that works at least a little.
 

b) Maybe one should not just reindex but reencode. In some cases that  works and produces good result. For example from
latin1to utf-8.
 

c) if we are going to reindex anyway, then why not do that and solve the  per database locale also. This is an
independentpoint from a) and b)  that I still want to understand the first two points even if we don't  talk about per
databaselocale.
 


-- 
/Dennis Björklund



pgsql-hackers by date:

Previous
From: Joseph Tate
Date:
Subject: Re: PostgreSQL configuration
Next
From: Tom Lane
Date:
Subject: Re: PostgreSQL configuration