Re: [WIP] collation support revisited (phase 1) - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: [WIP] collation support revisited (phase 1)
Date
Msg-id 20080712130843.GA12026@svana.org
Whole thread Raw
In response to Re: [WIP] collation support revisited (phase 1)  (Zdenek Kotala <Zdenek.Kotala@Sun.COM>)
Responses Re: [WIP] collation support revisited (phase 1)  (Zdenek Kotala <Zdenek.Kotala@Sun.COM>)
List pgsql-hackers
On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
> Background:
> We specify encoding in initdb phase. ANSI specify repertoire, charset,
> encoding and collation. If I understand it correctly, then charset is
> subset of repertoire and specify list of allowed characters for
> language->collation. Encoding is mapping of character set to binary format.
> For example for Czech alphabet(charset) we have 6 different encoding for
> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.

Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.

A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).

> I think if we support UTF8 encoding, than it make sense to create own
> charsets, because system locales could have defined collation for that. We
> need conversion only in case when client encoding is not compatible with
> charset and conversion is not defined.

The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.

locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

pgsql-hackers by date:

Previous
From: Abhijit Menon-Sen
Date:
Subject: Re: posix advises ...
Next
From: Tom Lane
Date:
Subject: Re: [WIP] collation support revisited (phase 1)