Re: [WIP] collation support revisited (phase 1) - Mailing list pgsql-hackers

From Zdenek Kotala
Subject Re: [WIP] collation support revisited (phase 1)
Date
Msg-id 48786510.9080502@sun.com
Whole thread Raw
In response to Re: [WIP] collation support revisited (phase 1)  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: [WIP] collation support revisited (phase 1)  (Martijn van Oosterhout <kleptog@svana.org>)
Re: [WIP] collation support revisited (phase 1)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Alvaro Herrera napsal(a):
> Zdenek Kotala escribió:
> 
>> The example is when you have translation data (vocabulary) in database. 
>> But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
>> string descriptor. See below:
>>
>> — The length or maximum length in characters of the character string type.
>> — The catalog name, schema name, and character set name of the character 
>> set of the character string type.
>> — The catalog name, schema name, and collation name of the collation of 
>> the character string type.
> 
> We already support multiple charsets, and are able to do conversions
> between them.  The set of charsets is hardcoded and it's hard to make a
> case that a user needs to create new ones.  I concur with Martijn's
> suggestion -- there's no need for this to appear in a system catalog.
> 
> Perhaps it could be argued that we need to be able to specify the
> charset a given string is in -- currently all strings are in the server
> encoding (charset) which is fixed at initdb time.  Making the system
> support multiple server encodings would be a major undertaking in itself
> and I'm not sure that there's a point.
> 

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding 
and collation. If I understand it correctly, then charset is subset of 
repertoire and specify list of allowed characters for language->collation. 
Encoding is mapping of character set to binary format. For example for Czech 
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side 
for UTF8 there is specified multi charsets.


I think if we support UTF8 encoding, than it make sense to create own charsets, 
because system locales could have defined collation for that. We need conversion 
only in case when client encoding is not compatible with charset and conversion 
is not defined.
    Any comments?
        Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: CITEXT 2.0 v3
Next
From: Simon Riggs
Date:
Subject: Re: Vacuuming leaked temp tables (once again)