On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
> Background:
> We specify encoding in initdb phase. ANSI specify repertoire, charset,
> encoding and collation. If I understand it correctly, then charset is
> subset of repertoire and specify list of allowed characters for
> language->collation. Encoding is mapping of character set to binary format.
> For example for Czech alphabet(charset) we have 6 different encoding for
> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.
Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.
A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).
> I think if we support UTF8 encoding, than it make sense to create own
> charsets, because system locales could have defined collation for that. We
> need conversion only in case when client encoding is not compatible with
> charset and conversion is not defined.
The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.
locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.