Thread: How to pass around collation information

How to pass around collation information

From

Peter Eisentraut

Date:

28 May 2010, 16:28:04

I have been thinking about this collation support business a bit.
Ignoring for the moment where we would get the actual collation routines
from, I wonder how we are going to pass this information around in the
system.  Someone declares a collation on a column in a table, and
somehow this information needs to arrive in bttextcmp() and friends.
Also, functions that take in a string and return one (e.g., substring),
need to take in this information and return it back out.  How should
this work?

Option 1, make it part of the datum.  That way it will pass through the
system just fine, but it would waste a lot of storage and break just
about everything that operates on string types now, as well as
pg_upgrade.  So that's probably out.

Option 2, invent some new mechanism that accompanies a datum or a type
whereever it goes.  Kind of like typmod, but not really.  Then the
collation information would presumably be made available to functions
through the fmgr interface.  The binary representation of data values
stays the same.

Option 2a, while we are at it, are there any other things of this nature
that would be worth supporting at the same time?  I could imagine that
having the option to pass around the ctype locale or the text search
dictionary in a similar way could be useful.  Is this something that
could be combined with typmod or other dormant data type metadata
requirements (PostGIS?, XML?)?

Ideas?

Re: How to pass around collation information

From

alvherre

Date:

28 May 2010, 17:05:50

Excerpts from Peter Eisentraut's message of vie may 28 12:27:52 -0400 2010:

> Option 2, invent some new mechanism that accompanies a datum or a type
> whereever it goes.  Kind of like typmod, but not really.  Then the
> collation information would presumably be made available to functions
> through the fmgr interface.  The binary representation of data values
> stays the same.

Is the collation a property of the datum, or one of the comparison?
If the latter, should it be really be made a sidecar of a datum, or
would it make more sense to attach it to the operation being performed?

I wonder if instead of trying to pass it down multiple layers till
bttextcmp and further down, it would make more sense to set a global
variable somewhere in the high levels, and only have it checked in
varstr_cmp.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: How to pass around collation information

From

Heikki Linnakangas

Date:

28 May 2010, 17:22:30

On 28/05/10 19:27, Peter Eisentraut wrote:
> I have been thinking about this collation support business a bit.
> Ignoring for the moment where we would get the actual collation routines
> from, I wonder how we are going to pass this information around in the
> system.  Someone declares a collation on a column in a table, and
> somehow this information needs to arrive in bttextcmp() and friends.

Yes. Comparison operators need it, as do functions like isalpha().

> Also, functions that take in a string and return one (e.g., substring),
> need to take in this information and return it back out.  How should
> this work?

Hmm, I don't see what substring would need collation for. And it 
certainly shouldn't be returning it. Collation is a property of the 
comparison operators (and isalpha etc.), and the planner needs to deduce 
the right collation for each such operation in the query. That involves 
looking at the tables and columns involved, as well as per-user 
information and any explicit COLLATE clauses in the query, but all that 
happens at plan-time.

> Option 1, make it part of the datum.  That way it will pass through the
> system just fine, but it would waste a lot of storage and break just
> about everything that operates on string types now, as well as
> pg_upgrade.  So that's probably out.

It's also fundamentally wrong, collation is not a property of a datum 
but of the operation.

> Option 2, invent some new mechanism that accompanies a datum or a type
> whereever it goes.  Kind of like typmod, but not really.  Then the
> collation information would presumably be made available to functions
> through the fmgr interface.  The binary representation of data values
> stays the same.

Something like that. I'm thinking that bttextcmp() and friends will 
simply take an extra argument indicating the collation, and we'll teach 
the operator / operator class infrastructure about that too.

One way to approach this is to realize that it's already possible to use 
multiple collations in a database. You just have to define separate < = > operators and operator classes for every
collation,and change all 

your queries to use the right operator depending on the desired 
collation everywhere where you use < = > (including ORDER BYs, with the 
USING <operator> syntax). The behavior is exactly what we want, it's 
just completely inpractical, so we need something to do the same in a 
less cumbersome way.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: How to pass around collation information

From

Peter Eisentraut

Date:

28 May 2010, 17:59:56

On fre, 2010-05-28 at 20:22 +0300, Heikki Linnakangas wrote:
> It's also fundamentally wrong, collation is not a property of a datum 
> but of the operation.

> One way to approach this is to realize that it's already possible to
> use 
> multiple collations in a database. You just have to define separate <
> = 
>  > operators and operator classes for every collation, and change all 
> your queries to use the right operator depending on the desired 
> collation everywhere where you use < = > (including ORDER BYs, with
> the 
> USING <operator> syntax). The behavior is exactly what we want, it's 
> just completely inpractical, so we need something to do the same in a 
> less cumbersome way.

Well, maybe we should step back a little and work out what sort of
feature we actually want, if any.  The feature I'm thinking of is what
people might call "per-column locale", and the SQL standard defines
that.  It would look like this:

CREATE TABLE test (   a varchar COLLATE de,   b varchar COLLATE fr
);

SELECT * FROM test WHERE a > 'baz' ORDER BY b;

So while it's true that the collation is used by the operations (> and
ORDER BY), the information which collation to use comes with the data
values.  It's basically saying, a is in language "de", so sort it like
that unless told otherwise.  There is also an override syntax available,
like this:

SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;

But here again the collation is attached to a data value, and only from
there it is passed to the operator.  What is actually happening is

SELECT * FROM test WHERE (a COLLATE en) > 'baz' ORDER BY (b COLLATE sv);

What you appear to be describing is a "per-operation locale", which also
sounds valid, but it would be a different thing.  It might be thought of
as this:

SELECT * FROM test WHERE a (> COLLATE en) 'baz' ORDER BY COLLATE sv b;

with some suitable global default.

So which one of these should it be?

Re: How to pass around collation information

From

Tom Lane

Date:

28 May 2010, 18:48:54

Peter Eisentraut <peter_e@gmx.net> writes:
> So while it's true that the collation is used by the operations (> and
> ORDER BY), the information which collation to use comes with the data
> values.  It's basically saying, a is in language "de", so sort it like
> that unless told otherwise.  There is also an override syntax available,
> like this:

> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;

That seems fairly bizarre.  What does this mean:
WHERE a COLLATE en > b COLLATE de

?  If it's an error, why is this not an error
WHERE a COLLATE en > b

if b is marked as COLLATE de in its table?

I guess the more general question is whether the spec expects that
collation settings can be derived statically (like type information)
or whether they might sometimes only be known at runtime.

We also need to think about whether we're okay with only applying
collation to built-in types (text, varchar, char) or whether we need
the feature to work for add-on types as well.  In particular, is citext
still a meaningful feature if we have this, or is it superseded by
COLLATE?  In the abstract I'd prefer to let it work for user-defined
types, but if we can have a much simpler implementation by not doing
so, it might be better to give that up.

Is COLLATE a property that can be attached to a domain over text?
        regards, tom lane

Re: How to pass around collation information

From

Robert Haas

Date:

28 May 2010, 19:03:49

On Fri, May 28, 2010 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> So while it's true that the collation is used by the operations (> and
>> ORDER BY), the information which collation to use comes with the data
>> values.  It's basically saying, a is in language "de", so sort it like
>> that unless told otherwise.  There is also an override syntax available,
>> like this:
>
>> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;
>
> That seems fairly bizarre.  What does this mean:
>
>        WHERE a COLLATE en > b COLLATE de
>
> ?  If it's an error, why is this not an error
>
>        WHERE a COLLATE en > b
>
> if b is marked as COLLATE de in its table?

I think we need to think of the comparison operators as ternary, and
the COLLATE syntax applied to columns or present in queries as various
ways of setting defaults or explicit overrides for what the third
argument will end up being.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: How to pass around collation information

From

Peter Eisentraut

Date:

28 May 2010, 19:20:35

On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote:
> I think we need to think of the comparison operators as ternary, and
> the COLLATE syntax applied to columns or present in queries as various
> ways of setting defaults or explicit overrides for what the third
> argument will end up being.

How could this extend to things like isalpha() or upper() that would
need access to ctype information?

Re: How to pass around collation information

From

Peter Eisentraut

Date:

28 May 2010, 19:32:46

On fre, 2010-05-28 at 14:48 -0400, Tom Lane wrote:
> > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;
> 
> That seems fairly bizarre.  What does this mean:
> 
>     WHERE a COLLATE en > b COLLATE de
> 
> ?  If it's an error, why is this not an error
> 
>     WHERE a COLLATE en > b
> 
> if b is marked as COLLATE de in its table?

The way I understand it, a collation "derivation" can be explicit or
implicit.  Explicit derivations override implicit derivations.  If in
the argument set of an operation, explicit collation derivations exist,
they must all be the same.

> I guess the more general question is whether the spec expects that
> collation settings can be derived statically (like type information)
> or whether they might sometimes only be known at runtime.

It looks like it is treated like type information.  The derivation and
validation rules are part of the Syntax Rules.

> We also need to think about whether we're okay with only applying
> collation to built-in types (text, varchar, char) or whether we need
> the feature to work for add-on types as well.  In particular, is citext
> still a meaningful feature if we have this, or is it superseded by
> COLLATE?  In the abstract I'd prefer to let it work for user-defined
> types, but if we can have a much simpler implementation by not doing
> so, it might be better to give that up.

I think if we get this done using the strcoll_l() API to do the work,
which looks like the path of least resistance at the moment, citext
would still be useful because all the standard locales would still be
case sensitive.

> Is COLLATE a property that can be attached to a domain over text?

According to the spec, yes.

Re: How to pass around collation information

From

Dimitri Fontaine

Date:

28 May 2010, 19:53:50

Hi,

Peter Eisentraut <peter_e@gmx.net> writes:
> On fre, 2010-05-28 at 20:22 +0300, Heikki Linnakangas wrote:
>> USING <operator> syntax). The behavior is exactly what we want, it's 
>> just completely inpractical, so we need something to do the same in a 
>> less cumbersome way.

For an example, here is something I did to better understand the system
a while ago. Of course I never got to use it for real:
 http://pgsql.tapoueh.org/btree_fr_ops/

> Well, maybe we should step back a little and work out what sort of
> feature we actually want, if any.  The feature I'm thinking of is what
> people might call "per-column locale", and the SQL standard defines
> that.  It would look like this:
>
> CREATE TABLE test (
>     a varchar COLLATE de,
>     b varchar COLLATE fr
> );
>
> SELECT * FROM test WHERE a > 'baz' ORDER BY b;
>
> So while it's true that the collation is used by the operations (> and
> ORDER BY), the information which collation to use comes with the data
> values.  It's basically saying, a is in language "de", so sort it like
> that unless told otherwise.  There is also an override syntax available,
> like this:
>
> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;
>
> But here again the collation is attached to a data value, and only from
> there it is passed to the operator.  What is actually happening is
>
> SELECT * FROM test WHERE (a COLLATE en) > 'baz' ORDER BY (b COLLATE sv);
>
>
> What you appear to be describing is a "per-operation locale", which also
> sounds valid, but it would be a different thing.  It might be thought of
> as this:
>
> SELECT * FROM test WHERE a (> COLLATE en) 'baz' ORDER BY COLLATE sv b;
>
> with some suitable global default.
>
>
> So which one of these should it be?

My understanding is that what we do is per-operation locale. The locale
information bears no semantic when not attached to some operation on
strings, like sorting or comparing.

So what you're showing here I think is how to attach a collation label
to every string in the system, at the catalog level or dynamically at
the query level. 

Now this collation label will only be used whenever you want to use a
collation aware function or operator. Those functions need to get the
labels for their implementation to have the expected meaning.

So we need both to attach collations to all known strings (defaulting to
the current database collation I guess), as you showed at the SQL level,
and to pass this information down to the functions operating on those
strings.

A confusing example on this grounds would be the following, which I hope
the standard disallow:
 SELECT * FROM test WHERE a COLLATE en > b COLLATE sv;

Regards,
-- 
dim

Re: How to pass around collation information

From

Robert Haas

Date:

28 May 2010, 20:15:36

On Fri, May 28, 2010 at 3:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote:
>> I think we need to think of the comparison operators as ternary, and
>> the COLLATE syntax applied to columns or present in queries as various
>> ways of setting defaults or explicit overrides for what the third
>> argument will end up being.
>
> How could this extend to things like isalpha() or upper() that would
> need access to ctype information?

Good question.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: How to pass around collation information

From

Pavel Stehule

Date:

28 May 2010, 20:24:16

2010/5/28 alvherre <alvherre@commandprompt.com>:
> Excerpts from Peter Eisentraut's message of vie may 28 12:27:52 -0400 2010:
>
>> Option 2, invent some new mechanism that accompanies a datum or a type
>> whereever it goes.  Kind of like typmod, but not really.  Then the
>> collation information would presumably be made available to functions
>> through the fmgr interface.  The binary representation of data values
>> stays the same.
>
> Is the collation a property of the datum, or one of the comparison?
> If the latter, should it be really be made a sidecar of a datum, or
> would it make more sense to attach it to the operation being performed?

>
> I wonder if instead of trying to pass it down multiple layers till
> bttextcmp and further down, it would make more sense to set a global
> variable somewhere in the high levels, and only have it checked in
> varstr_cmp.
>

Maybe collation is property of some operation: func call, sort, ... I
prefer to put collation info to FunctionCallInfo structure. Usually
you cannot change collation per row - collation is attached to
expression.

Regards

Pavel
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: How to pass around collation information

From

Heikki Linnakangas

Date:

28 May 2010, 21:19:10

On 28/05/10 23:15, Robert Haas wrote:
> On Fri, May 28, 2010 at 3:20 PM, Peter Eisentraut<peter_e@gmx.net>  wrote:
>> On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote:
>>> I think we need to think of the comparison operators as ternary, and
>>> the COLLATE syntax applied to columns or present in queries as various
>>> ways of setting defaults or explicit overrides for what the third
>>> argument will end up being.
>>
>> How could this extend to things like isalpha() or upper() that would
>> need access to ctype information?
>
> Good question.  :-(

Strictly speaking, collation and ctype are two different things. Which 
is a convenient way to evade the question :-).

But you could ask, how would we handle more fine-grained ctype in 
upper() then? Perhaps by adding a second argument for ctype. Similarly 
to to_tsvector([config, ] string), you could explicitly pass the ctype 
as an argument, or leave it out in which case a default is used. It 
wouldn't give you per-column ctype, though.

What does the spec have to say about the ctype used for upper() et al BTW?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: How to pass around collation information

From

Martijn van Oosterhout

Date:

28 May 2010, 21:23:31

On Fri, May 28, 2010 at 10:32:34PM +0300, Peter Eisentraut wrote:
> On fre, 2010-05-28 at 14:48 -0400, Tom Lane wrote:
> > > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;
> >
> > That seems fairly bizarre.  What does this mean:
> >
> >     WHERE a COLLATE en > b COLLATE de
> >
> > ?  If it's an error, why is this not an error
> >
> >     WHERE a COLLATE en > b
> >
> > if b is marked as COLLATE de in its table?
>
> The way I understand it, a collation "derivation" can be explicit or
> implicit.  Explicit derivations override implicit derivations.  If in
> the argument set of an operation, explicit collation derivations exist,
> they must all be the same.

The SQL standard has an explicit set of rules for determining the
collations of any particular operation (they apply to
operators/functions not to the datums).

The basic idea is that tables/columns/data types define an implicit
collation, which can be overidden by explicit collations. If there is
ambiguity you throw an error. I implemented this all several years ago,
it's not all that complicated really. IIRC I added a field to the Node type
and each level determined it's collection from the sublevels.

I solved the problem for the OP by providing an extra function to user
defined functions which would return the collation for that particular
call.

The more interesting question I found was that the standard only
defined collation for strings, whereas it can be applied much more
broadly. I described a possible solution several years back, it should
in the archives somewhere. It worked pretty well as I recall.

IIRC The idea was to let each type has its own set of collations and
when using an operator/function you let the collection be determined
by the argument that had the same type as the return type.

It would be nice if COLLATE could finally be implemented, it'd be quite
useful.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: How to pass around collation information

From

Peter Eisentraut

Date:

01 June 2010, 20:51:15

On lör, 2010-05-29 at 00:18 +0300, Heikki Linnakangas wrote:
> What does the spec have to say about the ctype used for upper() et al
> BTW?

It doesn't make any provisions for locale dependencies for that.

Re: How to pass around collation information

From

Peter Eisentraut

Date:

02 June 2010, 19:46:35

On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote:
> The feature I'm thinking of is what
> people might call "per-column locale", and the SQL standard defines
> that.  It would look like this:
> 
> CREATE TABLE test (
>     a varchar COLLATE de,
>     b varchar COLLATE fr
> );
> 
> SELECT * FROM test WHERE a > 'baz' ORDER BY b;

Perhaps it's also worth pointing out there could be use cases other than
supporting multiple natural languages.  For example, it is frequently
requested to be able to sort in ways that doesn't ignore special
characters, binary sort, or perhaps special file name sort that treats
'/' special in some way.  So it could be quite useful to be able to say

CREATE TABLE something (   description text COLLATE en,   code char(6) COLLATE binary,   file text COLLATE
filename_sort
);

or even something like

CREATE DOMAIN filename AS text COLLATE filename_sort;

These are examples where having the collation attached to the column
would appear to make more sense then having it attached only to
operations.

Re: How to pass around collation information

From

Robert Haas

Date:

02 June 2010, 20:56:51

On Wed, Jun 2, 2010 at 3:46 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote:
>> The feature I'm thinking of is what
>> people might call "per-column locale", and the SQL standard defines
>> that.  It would look like this:
>>
>> CREATE TABLE test (
>>     a varchar COLLATE de,
>>     b varchar COLLATE fr
>> );
>>
>> SELECT * FROM test WHERE a > 'baz' ORDER BY b;
>
> Perhaps it's also worth pointing out there could be use cases other than
> supporting multiple natural languages.  For example, it is frequently
> requested to be able to sort in ways that doesn't ignore special
> characters, binary sort, or perhaps special file name sort that treats
> '/' special in some way.  So it could be quite useful to be able to say
>
> CREATE TABLE something (
>    description text COLLATE en,
>    code char(6) COLLATE binary,
>    file text COLLATE filename_sort
> );
>
> or even something like
>
> CREATE DOMAIN filename AS text COLLATE filename_sort;
>
> These are examples where having the collation attached to the column
> would appear to make more sense then having it attached only to
> operations.

But in the end the only purpose of setting it on a column is to set
which one will be used for operations on that column.  And the user
might still override it for a particular query.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: How to pass around collation information

From

Peter Eisentraut

Date:

03 June 2010, 07:43:17

On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote:
> But in the end the only purpose of setting it on a column is to set
> which one will be used for operations on that column.  And the user
> might still override it for a particular query.

Of course.  I'm just saying that it *can* be useful to attach a
collation to a column definition, rather than only allowing it to be
specified along with the sort operation.

Re: How to pass around collation information

From

David Christensen

Date:

03 June 2010, 12:31:06

On Jun 3, 2010, at 2:43 AM, Peter Eisentraut wrote:

> On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote:
>> But in the end the only purpose of setting it on a column is to set
>> which one will be used for operations on that column.  And the user
>> might still override it for a particular query.
>
> Of course.  I'm just saying that it *can* be useful to attach a
> collation to a column definition, rather than only allowing it to be
> specified along with the sort operation.

How does collation relate to per-table/column encodings?  For that matter, are per-table/column encodings spec, and/or
somethingthat we're looking to implement down the line? 

Regards,

David
--
David Christensen
End Point Corporation
david@endpoint.com

Re: How to pass around collation information

From

Peter Eisentraut

Date:

03 June 2010, 14:42:05

On tor, 2010-06-03 at 07:30 -0500, David Christensen wrote:
> How does collation relate to per-table/column encodings?

There is some connection between collations and character sets or
encodings, because a collation is tied to one of those, just as a
necessity of implementation (depending on implementation details).  You
could have per-column (or per-some-other-small-unit) collation with a
global encoding, but a per-column encoding with a global collation
wouldn't work.

> For that matter, are per-table/column encodings spec,

Yes.

>  and/or something that we're looking to implement down the line?

I don't think anyone is seriously planning that.  But per-column
collations would have to come first, anyway.

Of course there is always the possibility that someone comes up with an
alternative plan that invalidates the above, but the above represents
the facts from the SQL standard and other implementations.

Re: How to pass around collation information

From

Robert Haas

Date:

03 June 2010, 15:46:29

On Thu, Jun 3, 2010 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote:
>> But in the end the only purpose of setting it on a column is to set
>> which one will be used for operations on that column.  And the user
>> might still override it for a particular query.
>
> Of course.  I'm just saying that it *can* be useful to attach a
> collation to a column definition, rather than only allowing it to be
> specified along with the sort operation.

Oh, I agree with that, for sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company