Thread: [WIP] patch - Collation at database level

[WIP] patch - Collation at database level

From
"Radek Strnad"
Date:
Hello,

I'm sending part of the code that I've done and is available for reviewing and I'm asking for your comments and some help because I'm new to PostgreSQL.

Proposal: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

The code contains changes of parser, pg_collation catalog, parsenodes.h for CREATE COLLATION... and DROP COLLATION statements.

Questions that I would like to know answers:

1) Should I move DROP statement from DropCollationStmt to DropStmt? Some statements are seperated and some are included in DropStmt? Is there any reason for that?
2) What type should all names in CREATE and DROP statement in gram.y have? I've chosen qualified_name but I know it's not the best choice.
3) All collations are created from existing collations. How do I ensure that the collation already exists? Is there any possibility to define it in gram.y?
4) For further functionality development is there anything more needed than adding T_CreateCollationStmt and T_DropCollationStmt to /src/backend/tcop/utility.c, and write functionality into collation.c?
5) Also can you look at the pg_catalog and tell me if anything is wrong with it?

Thank you for all your replies in advance.

Regards

Radek Strnad

Attachment

Re: [WIP] patch - Collation at database level

From
Alvaro Herrera
Date:
Radek Strnad escribió:

> 2) What type should all names in CREATE and DROP statement in gram.y have?
> I've chosen qualified_name but I know it's not the best choice.

I think it should be ColId.

> 3) All collations are created from existing collations. How do I ensure that
> the collation already exists? Is there any possibility to define it in
> gram.y?

Certainly not -- shouldn't they come from a catalog?  In that case, it
must come in parse analysis (parser/analyze.c I guess) or perhaps later,
when you actually execute the function to create the new collation.

> 5) Also can you look at the pg_catalog and tell me if anything is wrong with
> it?

Why does a collation have a schema?

What's the "existing collation"?

It seems a bit silly to have enum for what are basically boolean
variables.  Why not just use "true" and "false"?
-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: [WIP] patch - Collation at database level

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Why does a collation have a schema?

Because the SQL spec says so.  Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.
        regards, tom lane


Re: [WIP] patch - Collation at database level

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Why does a collation have a schema?
> 
> Because the SQL spec says so.  Also, if we don't put them in schemas,
> we have no nice way to distinguish built-in and user-defined collations,
> which creates a problem for pg_dump.

Oh, I see :-)  In that case, qualified_name would seem the right symbol
to use in the parser.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: [WIP] patch - Collation at database level

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Why does a collation have a schema?
>
> Because the SQL spec says so.  Also, if we don't put them in schemas,
> we have no nice way to distinguish built-in and user-defined collations,
> which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Wed, Jul 02, 2008 at 07:22:10PM +0100, Gregory Stark wrote:
> > Because the SQL spec says so.  Also, if we don't put them in schemas,
> > we have no nice way to distinguish built-in and user-defined collations,
> > which creates a problem for pg_dump.
>
> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
> to go around declaring what order code points should be sorted in? That seems
> like it would be... quite tedious!

Not that we'll ever use it, but ICU for example allows users to say:
"use collation X but move this code point somewhere else", essentially
allowing users tweak the collation on a small scale. In any case,
whatever collation library is used, we're unlikely to predefine every
possible collation in the system, there's too many (assuming they're
denumerable).

Have a niceday,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] patch - Collation at database level

From
"Radek Strnad"
Date:
My patch should be "sort of" wrapper that will implement guts for further development (collation at column level) like catalogs, creating collations etc. When creating collation user will be able to choose which function to use (by statement STRCOLFN - not in SQL standard). In the first stage I'll implement function that will use system locales. Adding ICU or any other library won't be that big deal.

Radek Strnad


On Wed, Jul 2, 2008 at 8:22 PM, Gregory Stark <stark@enterprisedb.com> wrote:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Why does a collation have a schema?
>
> Because the SQL spec says so.  Also, if we don't put them in schemas,
> we have no nice way to distinguish built-in and user-defined collations,
> which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

--
 Gregory Stark
 EnterpriseDB          http://www.enterprisedb.com
 Ask me about EnterpriseDB's Slony Replication support!

Re: [WIP] patch - Collation at database level

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
> to go around declaring what order code points should be sorted in? That seems
> like it would be... quite tedious!

Hm, that's a good point.  SQL99 has
        <collation definition> ::=             CREATE COLLATION <collation name> FOR             <character set
specification>              FROM <existing collation name>                 [ <pad characteristic> ]
 
        <existing collation name> ::= <collation name>
        <pad characteristic> ::=               NO PAD             | PAD SPACE

which seems pretty stupid if you ask me --- all the mechanism required
to manage a new object type, just to enable PAD SPACE or not?
(Especially when PAD SPACE itself is an utterly broken, useless concept
... but I digress.)  You might as well just provide all the standard
collations in both variants and be done with it.

The statement looks the same in last year's 200n draft, so it's not
like they were just about to add some more capability.

We might be best off to treat collations like index access methods,
ie, they're theoretically add-able but there's no infrastructure for
managing them, and what's expected is that all the ones you need are
created by initdb.
        regards, tom lane


Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Gregory Stark <stark@enterprisedb.com> writes:
>> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
>> to go around declaring what order code points should be sorted in? That seems
>> like it would be... quite tedious!
> 
> Hm, that's a good point.  SQL99 has
> 
>          <collation definition> ::=
>               CREATE COLLATION <collation name> FOR
>               <character set specification>
>                 FROM <existing collation name>
>                   [ <pad characteristic> ]
> 
>          <existing collation name> ::= <collation name>
> 
>          <pad characteristic> ::=
>                 NO PAD
>               | PAD SPACE
> 
> which seems pretty stupid if you ask me --- all the mechanism required
> to manage a new object type, just to enable PAD SPACE or not?
> (Especially when PAD SPACE itself is an utterly broken, useless concept
> ... but I digress.)  You might as well just provide all the standard
> collations in both variants and be done with it.
> 
> The statement looks the same in last year's 200n draft, so it's not
> like they were just about to add some more capability.

The proposed syntax of CREATE COLLATION is:

CREATE COLLATION <collation name> FOR <character set specification>FROM <existing collation name> [STRCOLFN <fn name>][
<padcharacteristic> ] [ <case sensitive> ] [ <accent sensitive> ][ LCCOLLATE <lc_collate> ] [ LCCTYPE <lc_ctype> ]
 

Which extends ANSI specification.


> We might be best off to treat collations like index access methods,
> ie, they're theoretically add-able but there's no infrastructure for
> managing them, and what's expected is that all the ones you need are
> created by initdb.

I think you cannot create all collation at bootstrap. You can only create record 
for actual LC_COLLATION, because I there is not standard way how to obtain 
complete list of supported collations and there is also problem if you install 
new locales after initdb.

When I looked to another DB (MS SQL, MySQL, DB2, Firebird) then only Firebird 
supports CREATE COLLATION command. Other databases has hard coded list of 
locales. Hardcoded solution means to use some lib (e.g ICU) with unified names 
or has locale name mapping for all supported OS.

I personally prefer open solution when I can create own collation and specify 
collation function to handle it.

    Zdenek






Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Gregory Stark <stark@enterprisedb.com> writes:
>> Out of curiosity, what is a "user-defined collation"? Are there SQL statements
>> to go around declaring what order code points should be sorted in? That seems
>> like it would be... quite tedious!
> 

<snip>

> We might be best off to treat collations like index access methods,
> ie, they're theoretically add-able but there's no infrastructure for
> managing them, and what's expected is that all the ones you need are
> created by initdb.

I though more about it and I discussed it with Radek yesterday. The problem is 
that collation must be created before user want to use CREATE DATABASE ... 
COLLATE ... command. It inclines to have have pg_collation as a global catalog, 
but ANSI specifies to use schema name in collation specification and schemes are 
database specific ... It means that pg_collation have to be non-shared catalog 
and new database only inherits collation from template db. And CREATE DATABASE 
have to check list of collation in template database :(.

My conclusion is that CREATE COLLATION does not make much sense. I see two 
possible solutions:

1) have global an local catalog for collation and have modified variants of 
create cmd:
   CREATE COLLATION ... GLOBAL|LOCAL
   CREATE DATABASE will use only collation from global catalog

Local catalog will be useful when full support of collation will be available 
mostly for specifying case sensitivity of collation.


2) Use Tom's suggested approach. Create list of collations in initdb phase. But 
there is problem how to obtain list of supported collation on the server. I 
think, only what is possible to do is to use default locale for creating default 
collation for template1 database.

Any suggestion?
    thanks Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Zdenek Kotala napsal(a):

> 
> I though more about it and I discussed it with Radek yesterday. The 
> problem is that collation must be created before user want to use CREATE 
> DATABASE ... COLLATE ... command. It inclines to have have pg_collation 
> as a global catalog, but ANSI specifies to use schema name in collation 
> specification and schemes are database specific ... It means that 
> pg_collation have to be non-shared catalog and new database only 
> inherits collation from template db. And CREATE DATABASE have to check 
> list of collation in template database :(.

thinking more ...
It must be shared catalog because pg_database will depend on it.
Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:
> Zdenek Kotala napsal(a):
>
> >
> >I though more about it and I discussed it with Radek yesterday. The
> >problem is that collation must be created before user want to use CREATE
> >DATABASE ... COLLATE ... command. It inclines to have have pg_collation
> >as a global catalog, but ANSI specifies to use schema name in collation
> >specification and schemes are database specific ... It means that
> >pg_collation have to be non-shared catalog and new database only
> >inherits collation from template db. And CREATE DATABASE have to check
> >list of collation in template database :(.
>
> thinking more ...
> It must be shared catalog because pg_database will depend on it.

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Martijn van Oosterhout napsal(a):
> On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:
>> Zdenek Kotala napsal(a):
>>
>>> I though more about it and I discussed it with Radek yesterday. The 
>>> problem is that collation must be created before user want to use CREATE 
>>> DATABASE ... COLLATE ... command. It inclines to have have pg_collation 
>>> as a global catalog, but ANSI specifies to use schema name in collation 
>>> specification and schemes are database specific ... It means that 
>>> pg_collation have to be non-shared catalog and new database only 
>>> inherits collation from template db. And CREATE DATABASE have to check 
>>> list of collation in template database :(.
>> thinking more ...
>> It must be shared catalog because pg_database will depend on it.
> 
> Not necessarily. pg_class is not shared yet without it you can't even
> find pg_database. Same deal with pg_type. All it means is that
> pg_collation in template1 must contain all the collations used in
> template1, which shouldn't be hard to arrange.

I think, Collation situation is different, becasue pg_database will contains 
column colname. pg_class (and all bootstrap catalog) only contains row which 
specify that shared table exists and content is cloned to the new database from 
template database. In corner case you can get context specific dependency for 
example if Czech collation will have oid=10 in database test01 and Swedish 
collation will have oid=10 in database test02. How to handle CREATE DATABASE and 
connect database? OK it shouldn't happen in normal situation  but ...
Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Tue, Jul 08, 2008 at 12:00:34PM +0200, Zdenek Kotala wrote:
> >Not necessarily. pg_class is not shared yet without it you can't even
> >find pg_database. Same deal with pg_type. All it means is that
> >pg_collation in template1 must contain all the collations used in
> >template1, which shouldn't be hard to arrange.
>
> I think, Collation situation is different, becasue pg_database will
> contains column colname. pg_class (and all bootstrap catalog) only contains
> row which specify that shared table exists and content is cloned to the new
> database from template database. In corner case you can get context
> specific dependency for example if Czech collation will have oid=10 in
> database test01 and Swedish collation will have oid=10 in database test02.
> How to handle CREATE DATABASE and connect database? OK it shouldn't happen
> in normal situation  but ...

Oh I see, you're referring to the storage of the default collation for
a database. I was jumping ahead to the per-column collation state, when
the collation default is attached to columns, types and domains, and
not at the database level. So there the problem does not exist.

To be honest, I'd suggest storing the collation in pg_database as a
string, rather than as an identifier. This sidesteps the problem
entirly.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Martijn van Oosterhout napsal(a):
> 
> Oh I see, you're referring to the storage of the default collation for
> a database. I was jumping ahead to the per-column collation state, when
> the collation default is attached to columns, types and domains, and
> not at the database level. So there the problem does not exist.

Yeah, but you still need one source/one collation list for database, scheme, 
table and column. And of course shared tables need also collation for their indexes.

> To be honest, I'd suggest storing the collation in pg_database as a
> string, rather than as an identifier. This sidesteps the problem
> entirly.

I don't think that string is good idea. You need to use same approach on all 
levels by my opinion.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> Martijn van Oosterhout napsal(a):
>> Not necessarily. pg_class is not shared yet without it you can't even
>> find pg_database. Same deal with pg_type. All it means is that
>> pg_collation in template1 must contain all the collations used in
>> template1, which shouldn't be hard to arrange.

> I think, Collation situation is different,

All the argument here is based on the premise that we should have
database-level collation specifications, which AFAICS is not required
nor suggested by the SQL spec.  I wonder why we are allowing a
nonstandard half-measure to drive our thinking, rather than solving the
real problem which is column-level collations.
        regards, tom lane


Re: [WIP] patch - Collation at database level

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> ... And of course shared tables need also collation for their indexes.

No, they don't, because the only textual indexes on shared catalogs are
on "name" columns, which are intentionally not locale aware, and
wouldn't be collation aware either.
        regards, tom lane


Re: [WIP] patch - Collation at database level

From
Andrew Dunstan
Date:

Tom Lane wrote:
>
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec.  I wonder why we are allowing a
> nonstandard half-measure to drive our thinking, rather than solving the
> real problem which is column-level collations.
>
>
>   

Agreed. Are we even sure that we want per-database collations as a 
half-way house? Unless we can be sure that we want all the required 
catalog changes for the full requirement, it seems to me a rather messy 
way of getting to where we want to go.

cheers

andrew


Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Tue, Jul 08, 2008 at 10:54:28AM -0400, Andrew Dunstan wrote:
> Agreed. Are we even sure that we want per-database collations as a
> half-way house? Unless we can be sure that we want all the required
> catalog changes for the full requirement, it seems to me a rather messy
> way of getting to where we want to go.

Given that the current projected timeframe for full COLLATE support is
something like 2-3 major releases, I don't really see the problem with
doing this now. Who knows, it might prompt people to do something
sooner.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] patch - Collation at database level

From
Josh Berkus
Date:
Andrew, Tom,

> Agreed. Are we even sure that we want per-database collations as a
> half-way house? Unless we can be sure that we want all the required
> catalog changes for the full requirement, it seems to me a rather messy
> way of getting to where we want to go.

Given that we don't have a delivery date for table or column level collations, 
we don't want to turn down database-level collations.    If nothing else, 
Radek's work will expose what areas of our code are collation-dependant and 
hopefully make the work of more granular collations easier.  And if it takes 
us 3 years to get more granular collations, at least people can use 
database-level ones in the meantime so that they don't need to have separate 
PostgreSQL binaries for every language they want to support fully.

Also ... this is a Summer of Code Project, which we accepted, which at least 
in Google and the student's eyes means we're not going to discard the entire 
premise of the patch.  I'm not exaggerating when I say doing something like 
that could get PostgreSQL permanently banned from Google SoC.  

Tom, I think you need to be on the SoC committee in the future, just to raise 
objections.  Some 15+ PostgreSQL contributors on the SoC committee approved 
Radek's project.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: [WIP] patch - Collation at database level

From
Michael Paesold
Date:
Tom Lane wrote:

> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> Martijn van Oosterhout napsal(a):
>>> Not necessarily. pg_class is not shared yet without it you can't  
>>> even
>>> find pg_database. Same deal with pg_type. All it means is that
>>> pg_collation in template1 must contain all the collations used in
>>> template1, which shouldn't be hard to arrange.
>
>> I think, Collation situation is different,
>
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec.  I wonder why we are allowing a
> nonstandard half-measure to drive our thinking, rather than solving  
> the
> real problem which is column-level collations.

Wouldn't you still need per-database and per-table default collations?  
At least MySQL does have such a concept.

Best Regards
Michael Paesold 


Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> ... And of course shared tables need also collation for their indexes.
> 
> No, they don't, because the only textual indexes on shared catalogs are
> on "name" columns, which are intentionally not locale aware, and
> wouldn't be collation aware either.

Yeah, name uses strcmp, which is not locale aware but from ANSI perspective 
there is collation SQL_IDENTIFIER for it which is fortunately implementation 
defined.

What I see now as the problem is that we need also to know correct collation for 
ORDER  - for example:

select * from pg_shdescription order by description;

...thinking...

but it should solve by collation per column which will work well with 
pg_attribute cloning for new database as Martijn mentioned.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Andrew, Tom,
>> Agreed. Are we even sure that we want per-database collations as a
>> half-way house? Unless we can be sure that we want all the required
>> catalog changes for the full requirement, it seems to me a rather messy
>> way of getting to where we want to go.

> Given that we don't have a delivery date for table or column level collations, 
> we don't want to turn down database-level collations.

I am one hundred percent prepared to turn them down, if they end up
contorting the design in a way that we will have to undo (with
consequent backwards-compatibility problems) to get to the full feature.

If it's a partial implementation of the full feature, that's fine, but
I'm not getting good vibes about that from the discussions so far.
        regards, tom lane


Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> Martijn van Oosterhout napsal(a):
>>> Not necessarily. pg_class is not shared yet without it you can't even
>>> find pg_database. Same deal with pg_type. All it means is that
>>> pg_collation in template1 must contain all the collations used in
>>> template1, which shouldn't be hard to arrange.
> 
>> I think, Collation situation is different,
> 
> All the argument here is based on the premise that we should have
> database-level collation specifications, which AFAICS is not required
> nor suggested by the SQL spec. 

Yeah, it is not required, but by my opinion it should be derived from CREATE 
SCHEMA statement. There is following item:

--- SQL ANSI 2003 page 520 ---

5) If <schema character set specification> is not specified, then a <schema 
character set specification> that specifies an implementation-defined character 
set that contains at least every character that is in <SQL language character> 
is implicit.

----

It is not for collation directly, but if I understand it correctly when you want 
to create schema  then default charset is inherit from parent instance which is 
database (catalog).

Following sentence specified that pg_collation should be database specific.

---- SQL ANSI 2003 page 15 ---
Character sets defined by standards or by SQL-implementations reside in the 
Information Schema (named INFORMATION_SCHEMA) in each catalog, as do collations 
defined by standards and collations, transliterations, and transcodings defined 
by SQL-implementations.
----
> I wonder why we are allowing a> nonstandard half-measure to drive our thinking, rather than solving the> real problem
whichis column-level collations.
 

I try to determine how to implement collation itself - collation catalog 
structure and content and how to create new collation. Column-level collation is 
nice but until we will not have basic infrastructure we cannot start implemented it.
    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Andrew Dunstan napsal(a):
> 
> 
> Tom Lane wrote:
>>
>> All the argument here is based on the premise that we should have
>> database-level collation specifications, which AFAICS is not required
>> nor suggested by the SQL spec.  I wonder why we are allowing a
>> nonstandard half-measure to drive our thinking, rather than solving the
>> real problem which is column-level collations.
>>
>>
>>   
> 
> Agreed. Are we even sure that we want per-database collations as a 
> half-way house? Unless we can be sure that we want all the required 
> catalog changes for the full requirement, it seems to me a rather messy 
> way of getting to where we want to go.

Andrew,
I would like also to see full collation implementation rather then collation per 
database. But from my point of view split collation into small parts is much 
better. Radek's work is mostly about creating infrastructure for full collation 
support. When it will be finished, then "only" changes in parser, executor... 
will be necessary to complete a job.

If you look on most discussion about collation they fell into ICU yes/no problem 
without any real decision how to implemented the feature.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Tue, Jul 08, 2008 at 09:05:11PM +0200, Zdenek Kotala wrote:
> >All the argument here is based on the premise that we should have
> >database-level collation specifications, which AFAICS is not required
> >nor suggested by the SQL spec.
>
> Yeah, it is not required, but by my opinion it should be derived from
> CREATE SCHEMA statement. There is following item:
>
> --- SQL ANSI 2003 page 520 ---
>
> 5) If <schema character set specification> is not specified, then a <schema

Careful, this is a 'character set specification" which has (almost)
nothing to do with collation. It's closer to the encoding field, which
is already in pg_database.

The issue with having a "default database collation" is that it's
unclear where it would be used. In the end the collation is defined by
the types and domains. Columns inherit from the types. I think the only
senseible definition is to decide that all the text/varchar/char types
inherit from the database. It's not in the spec but I think it does
make easier to decide what the default collation is. As an alternative
to:

ALTER TYPE text SET DEFAULT COLLATION TO foo;
<repeat for all other text types>

> I try to determine how to implement collation itself - collation catalog
> structure and content and how to create new collation. Column-level
> collation is nice but until we will not have basic infrastructure we cannot
> start implemented it.

I agree that this patch includes much basic work that needs to be done
for full collation support.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] patch - Collation at database level

From
Zdenek Kotala
Date:
Martijn van Oosterhout napsal(a):
> On Tue, Jul 08, 2008 at 09:05:11PM +0200, Zdenek Kotala wrote:
>>> All the argument here is based on the premise that we should have
>>> database-level collation specifications, which AFAICS is not required
>>> nor suggested by the SQL spec. 
>> Yeah, it is not required, but by my opinion it should be derived from 
>> CREATE SCHEMA statement. There is following item:
>>
>> --- SQL ANSI 2003 page 520 ---
>>
>> 5) If <schema character set specification> is not specified, then a <schema 
> 
> Careful, this is a 'character set specification" which has (almost)
> nothing to do with collation. It's closer to the encoding field, which
> is already in pg_database.

Yeah, I'm confuse why for character set is defined behavior when is not 
specified but there is not definition how to handle it for collation.

Maybe because charset has default collation.

> The issue with having a "default database collation" is that it's
> unclear where it would be used. In the end the collation is defined by
> the types and domains. Columns inherit from the types. I think the only
> senseible definition is to decide that all the text/varchar/char types
> inherit from the database. It's not in the spec but I think it does
> make easier to decide what the default collation is. As an alternative
> to:
> 
> ALTER TYPE text SET DEFAULT COLLATION TO foo;
> <repeat for all other text types>

It is good point I thought about following situation

select 'xxx' < 'yyy';

But how you mentioned because both are text and text type has default collation 
there is no reason to have collation per database.

Hmm, It seems to me that expectation should be create all string data types with 
default collation during initdb phase. It means modify BKI types definition 
according to default locale setting.

    Zdenek


Re: [WIP] patch - Collation at database level

From
Peter Eisentraut
Date:
Am Tuesday, 1. July 2008 schrieb Radek Strnad:
> I'm sending part of the code that I've done and is available for reviewing
> and I'm asking for your comments and some help because I'm new to
> PostgreSQL.
>
> Proposal: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php
>
> The code contains changes of parser, pg_collation catalog, parsenodes.h for
> CREATE COLLATION... and DROP COLLATION statements.

This patch is a small start on the way to adding a catalog and some SQL 
statements that add and remove entries from it.  But I don't see any 
collation support here, which is about 99% of the work that is left to do.  
Where are the collations going to come from?  Have the various build and 
distributions issues been thought about?  How are they going to be configured 
(not the SQL syntax, but how will the configuration be applied)?  How are the 
collations going to be applied at run-time?  How are you going to handle 
locale and encoding conflicts?  I also think that the clauses you have 
attached to your CREATE COLLATION statement (case-insensitive, 
accent-insensitive) are an oversimplification of reality.  I suggest you look 
up the Unicode collation algorithm to learn about who collations work in 
practice.

In my opinion, you are starting this project from the wrong end.  I would 
suggest you approach it like this:

- Find some collation implementations
- Patch PostgreSQL to link with them
- Patch PostgreSQL to apply them for comparison
- Implement system catalog to configure them
- Implement SQL statements to manipulate the system catalog


Re: [WIP] patch - Collation at database level

From
Gregory Stark
Date:
"Peter Eisentraut" <peter_e@gmx.net> writes:

> In my opinion, you are starting this project from the wrong end.  I would 
> suggest you approach it like this:
>
> - Find some collation implementations
> - Patch PostgreSQL to link with them

Well I think the feeling is that we may as well start with the lowest common
denominator of libc's collation implementation. It's the only one everyone's
going to have. Later adding compile-time options to use a different library
and different function calls might be useful but a lot of people are unhappy
about the idea of *requiring* a major outside library for this.

> - Patch PostgreSQL to apply them for comparison

Er, yes. Well we do that already but the tricky bit is keeping track of
multiple collations and applying the right one for each comparison.

So we do need the concept of multiple collations and the syntax to select a
collation for each ordering operation.



--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: [WIP] patch - Collation at database level

From
"Radek Strnad"
Date:
Hello,

the main reason why I've submitted the patch was to start a discussion and know other people's opinion on this problem.

On Tue, Jul 29, 2008 at 10:41 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

Where are the collations going to come from?  

There will be two new catalogs - pg_collate and pg_charset. Each of them will be filled with ANSI standard collations and charsets (ISO8BIT, LATIN1, UTF-8..) and alternatively with default collation set when creating. For instance if you create database cluster with initdb and specify en_US.utf8 there will be standard rows (ISO8BIT, LATIN1, UTF-8..) + one row with en_US.utf8 in template0. Then you can connect to template0 and create other collations if your POSIX locales support them and use them one per each database.

Have the various build and distributions issues been thought about?

Yes. Since POSIX locales doesn't guarantee any collation there will be hard-coded collations implemented regarding ANSI collation standard. Others can be set by command CREATE COLLATION.

 How are they going to be configured (not the SQL syntax, but how will the configuration be applied)?

pg_type, pg_attribute, pg_namespace of each database will be extended with collation oid column that will be specifying collation.

 How are the collations going to be applied at run-time?
 
Collation will be set when connecting to the database with setlocale(LC_COLLATION, XXX) and setlocale(LC_CTYPE, XXX)
 
 How are you going to handle locale and encoding conflicts?

Since I'm currently implementing collation support per database I don't think this is an issue. (It will be in the future I know.)
 
 I also think that the clauses you have attached to your CREATE COLLATION statement (case-insensitive,
accent-insensitive) are an oversimplification of reality.  I suggest you look
up the Unicode collation algorithm to learn about who collations work in
practice.

I already did in the very beginning of the development. The reason why I'm not implementing the whole Unicode collation algorithm is that this patch shold be sort of framework. You'll be able to use different collation functions not only POSIX locales so further development towards full Unicode collation algorithm is possible.

At the end of the next week I'll publish my bachelor thesis concerning this topic where everything will be explained in details so stay tuned.
 
Regards

Radek Strnad

Re: [WIP] patch - Collation at database level

From
Martijn van Oosterhout
Date:
On Sat, Aug 02, 2008 at 03:39:18PM +0200, Radek Strnad wrote:
> >  I also think that the clauses you have attached to your CREATE
> > COLLATION statement (case-insensitive, accent-insensitive) are an
> > oversimplification of reality.  I suggest you look up the Unicode
> > collation algorithm to learn about who collations work in practice.
>
> I already did in the very beginning of the development. The reason why I'm
> not implementing the whole Unicode collation algorithm is that this patch
> shold be sort of framework. You'll be able to use different collation
> functions not only POSIX locales so further development towards full Unicode
> collation algorithm is possible.

Agreed. Ofcourse it's a simplification of reality. POSIX locales are a
simplification of reality, but its the only form of collation currently
available to us. And quite frankly, I don't beleive postgresql should
be in the business of writing collation algorithms, we don't have the
expertese.

FWIW, I think case-insensitive and accent-insensitive are useful modifiers
that we should aim to support in the future.

> At the end of the next week I'll publish my bachelor thesis concerning this
> topic where everything will be explained in details so stay tuned.

Good luck!

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.