Re: [WIP] collation support revisited (phase 1) - Mailing list pgsql-hackers

From Zdenek Kotala
Subject Re: [WIP] collation support revisited (phase 1)
Date
Msg-id 488612DE.5060206@sun.com
Whole thread Raw
In response to Re: [WIP] collation support revisited (phase 1)  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
Martijn van Oosterhout napsal(a):
> On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
>> I was trying to sort out the problem with not creating new catalog for
>> character sets and I came up following ideas. Correct me if my ideas are
>> wrong.
>>
>> Since collation has to have a defined character set.
> 
> Not really. AIUI at least glibc and ICU define a collation over all
> possible characters (ie unicode). When you create a locale you take a
> subset and use that. Think about it: if you want to sort strings and
> one of them happens to contain a chinese charater, it can't *fail*.
> Note strcoll() has no error return for unknown characters.

It has.
See http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

The strcoll() function may fail if:
    [EINVAL]        [CX]  The s1 or s2 arguments contain characters outside the domain of 
the collating sequence.


>> I'm suggesting to use
>> already written infrastructure of encodings and to use list of encodings in
>> chklocale.c. Currently databases are not created with specified character
>> set but with specified encoding. I think instead of pointing a record in
>> collation catalog to another record in character set catalog we might use
>> only name (string) of the encoding.
> 
> That's reasonable. From an abstract point of view collations and
> encodings are orthoginal, it's only when you're using POSIX locales
> that there are limitations on how you combine them. I think you can
> assume a collation can handle any characters that can be produced by
> encoding.

I think you are not correct. You cannot use collation over all UNICODE. See 
http://www.unicode.org/reports/tr10/#Common_Misperceptions. Same characters can 
be ordered differently in different languages.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



pgsql-hackers by date:

Previous
From: Markus Wanner
Date:
Subject: Re: Postgres-R: primary key patches
Next
From: Simon Riggs
Date:
Subject: Re: Schema-qualified statements in pg_dump output