Re: [WIP] patch - Collation at database level - Mailing list pgsql-hackers

From Radek Strnad
Subject Re: [WIP] patch - Collation at database level
Date
Msg-id de5165440808020639j4e7226bbu32e840e225e15c3e@mail.gmail.com
Whole thread Raw
In response to Re: [WIP] patch - Collation at database level  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: [WIP] patch - Collation at database level  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
Hello,

the main reason why I've submitted the patch was to start a discussion and know other people's opinion on this problem.

On Tue, Jul 29, 2008 at 10:41 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

Where are the collations going to come from?  

There will be two new catalogs - pg_collate and pg_charset. Each of them will be filled with ANSI standard collations and charsets (ISO8BIT, LATIN1, UTF-8..) and alternatively with default collation set when creating. For instance if you create database cluster with initdb and specify en_US.utf8 there will be standard rows (ISO8BIT, LATIN1, UTF-8..) + one row with en_US.utf8 in template0. Then you can connect to template0 and create other collations if your POSIX locales support them and use them one per each database.

Have the various build and distributions issues been thought about?

Yes. Since POSIX locales doesn't guarantee any collation there will be hard-coded collations implemented regarding ANSI collation standard. Others can be set by command CREATE COLLATION.

 How are they going to be configured (not the SQL syntax, but how will the configuration be applied)?

pg_type, pg_attribute, pg_namespace of each database will be extended with collation oid column that will be specifying collation.

 How are the collations going to be applied at run-time?
 
Collation will be set when connecting to the database with setlocale(LC_COLLATION, XXX) and setlocale(LC_CTYPE, XXX)
 
 How are you going to handle locale and encoding conflicts?

Since I'm currently implementing collation support per database I don't think this is an issue. (It will be in the future I know.)
 
 I also think that the clauses you have attached to your CREATE COLLATION statement (case-insensitive,
accent-insensitive) are an oversimplification of reality.  I suggest you look
up the Unicode collation algorithm to learn about who collations work in
practice.

I already did in the very beginning of the development. The reason why I'm not implementing the whole Unicode collation algorithm is that this patch shold be sort of framework. You'll be able to use different collation functions not only POSIX locales so further development towards full Unicode collation algorithm is possible.

At the end of the next week I'll publish my bachelor thesis concerning this topic where everything will be explained in details so stay tuned.
 
Regards

Radek Strnad

pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Parsing of pg_hba.conf and authentication inconsistencies
Next
From: Martijn van Oosterhout
Date:
Subject: Re: [WIP] patch - Collation at database level