Home > mailing lists

Re: [WIP] collation support revisited (phase 1) - Mailing list pgsql-hackers

From	Zdenek Kotala
Subject	Re: [WIP] collation support revisited (phase 1)
Date	July 12, 2008 08:05:41
Msg-id	48786510.9080502@sun.com Whole thread Raw
In response to	Re: [WIP] collation support revisited (phase 1) (Alvaro Herrera <alvherre@commandprompt.com>)
Responses	Re: [WIP] collation support revisited (phase 1) Re: [WIP] collation support revisited (phase 1)
List	pgsql-hackers

Tree view

Alvaro Herrera napsal(a):
> Zdenek Kotala escribió:
> 
>> The example is when you have translation data (vocabulary) in database. 
>> But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
>> string descriptor. See below:
>>
>> — The length or maximum length in characters of the character string type.
>> — The catalog name, schema name, and character set name of the character 
>> set of the character string type.
>> — The catalog name, schema name, and collation name of the collation of 
>> the character string type.
> 
> We already support multiple charsets, and are able to do conversions
> between them.  The set of charsets is hardcoded and it's hard to make a
> case that a user needs to create new ones.  I concur with Martijn's
> suggestion -- there's no need for this to appear in a system catalog.
> 
> Perhaps it could be argued that we need to be able to specify the
> charset a given string is in -- currently all strings are in the server
> encoding (charset) which is fixed at initdb time.  Making the system
> support multiple server encodings would be a major undertaking in itself
> and I'm not sure that there's a point.
> 

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding 
and collation. If I understand it correctly, then charset is subset of 
repertoire and specify list of allowed characters for language->collation. 
Encoding is mapping of character set to binary format. For example for Czech 
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side 
for UTF8 there is specified multi charsets.


I think if we support UTF8 encoding, than it make sense to create own charsets, 
because system locales could have defined collation for that. We need conversion 
only in case when client encoding is not compatible with charset and conversion 
is not defined.
    Any comments?
        Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

pgsql-hackers by date:

From: "David E. Wheeler"
Date: 12 July 2008, 04:34:41
Subject: Re: PATCH: CITEXT 2.0 v3

From: Simon Riggs
Date: 12 July 2008, 08:11:20
Subject: Re: Vacuuming leaked temp tables (once again)

Re: [WIP] collation support revisited (phase 1) - Mailing list pgsql-hackers

Previous

Next