Thread: For review: Initial support for COLLATE

For review: Initial support for COLLATE

From
Martijn van Oosterhout
Date:
[Please CC any replies, thanks]

This patch is the beginnings of support for COLLATE. I need to do some
other work for a few days so I'm posting here to get some initial
reviews. Various parts are marked [done] and [not done].

The steps involved are:

- Add COLLATE to grammer as part of expression tree. New CollateClause
  node stores the relevent facts. [done]

- parse_expr goes through tree and determines the appropriate COLLATE
  state for each node as per SQL spec. [done]

  Note: PostgreSQL doesn't really have a way to identify text-like
  types and there's no real need to anyway. The implementation allows
  any node to have a collate state. This can be used for other things,
  see below.

- CREATE COLLATE statement [not done, currently added on the fly]

- Two new datatypes, pg_locale_t and pg_localedata_t. The former
  represents a locale (and will eventually have an OID). The latter is
  an anonymous cookie that stores locale specific information and is
  passed to functions that need it. It may be that once done, this
  latter type will vanish again. [done]

- Several utility functions in the new file pg_xlocale.c for use by the
  rest of the system. [done]

- Add boolean column 'proislocalized' to pg_proc which indicates if the
  output of this function is affected by the LOCALE. eg textcat doesn't
  care but textle does. [done] This is for:

  a) So the parser can complain about functions that look at the
     locale/collate order but it's not clear from the arguments (state
     None per SQL spec) and it's not specified explicitly. [not done]

  b) So when a column or function is indexed, the index code knows if
     the locale is relevent to sorting order. This is particularly
     interesting for btree indexes on character strings. [not done]

  Currently I've marked 58 functions as being locale sensetive, but
  that list will need careful going over. To some extent it can be
  checked automatically by examining which backend functions use the
  new PG_GETLOCALE() macro.

- Check for correct encoding in loaded locales [not done]

- Check the partial indexes do the right thing when matching
  expressions. [not done]

- Docs, regression, etc...

- make check: right now I'm getting some regressions in the rules
  and plpgsql, very odd... Possibly due to the fact that rules get
  collate nodes with locales that don't persist across invokations.

Goals:

- This setup extends the SQL spec a bit, in the sense that COLLATE can
  be attached to anything. It is my intention to allow functions such
  as to_char() and to_timestamp() to be localized. eg:

test=# select cash_out('1.00'::money collate 'nl_NL'), cash_out('1.00'::money collate 'en_AU');
 cash_out | cash_out
----------+----------
 EUR1,00  | $1.00
(1 row)

- Should LOCALE be created as a synonym for COLLATE? It reads more
  naturally.

- Currently LC_COLLATE is fixed at initdb and LC_NUMERIC and
  LC_CURRENCY can be altered. The idea is that eventually even
  LC_COLLATE can be altered anytime (from the users point view anyway),
  as any objects that care will store the collate order they want and
  can execute functions as appropriate.

- Eventually once the transition to full locale support is complete,
  change the backend so it always runs under locale C and only
  functions on userdata are affected by locales. This means that
  unquoted identifiers when converted to lowercase will be lowered as
  per ASCII rules. Judging by [1] this is what people want, but
  feedback would be nice.

[1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg39742.html

  Hence, right now, names compare using normal strcmp, but text,
  varchar, etc use strcoll.

- The patch as it currently stands won't compile at all without system
  xlocale support. The goal is to provide some level of backward
  compatability where the COLLATE clause can be used, but only to
  affect functions like to_char(). Changing COLLATE order would be
  forbidden (ie just like now).

- Eventually, once the parts relevent to locales are sufficiently
  abstracted, look into ICU to plug it in. Unfortunatly, it's a
  completely different model (all utf-16) so that's another phase
  altogether.

- The default type output functions should never be locale specific.
  This is to avoid issues with pgdump and frontends. Create a function
  'localize(anyelement)' to "do the obvious" to force it to happen.

Download:
  Compressed 57K, uncompressed >500K but that's due to rewriting the
  whole of pg_proc. The important code is not so big. Against todays
  CVS.

http://svana.org/kleptog/pgsql/collate2.patch.gz

Examples:

test=# SELECT text('a') < text('B') COLLATE 'C', text('a') < text('B') COLLATE 'en_US.UTF-8';
 ?column? | ?column?
----------+----------
 f        | t
(1 row)

test=# SELECT text('A') < text('b') COLLATE 'C', text('A') < text('b') COLLATE 'en_US.UTF-8';
 ?column? | ?column?
----------+----------
 t        | t
(1 row)

test=# SELECT text('A') COLLATE 'en_US.UTF-8' < text('b') COLLATE 'C';
ERROR:  Conflicting COLLATE clauses

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: For review: Initial support for COLLATE

From
Martijn van Oosterhout
Date:
On Wed, Sep 07, 2005 at 09:12:12PM +0200, Peter Eisentraut wrote:
> No, and in fact the terminology mixup in your patch and description
> concerns me.  If you are talking about collation, then the data types,
> system catalog columns, etc. should talk about collation, not about
> "locale", because that encompasses a number of other things that can be
> handled independent of the collation order.

That's because locale handling == collate handling. If you want to have
a system that allows you to choose what collate order to use when, it
also allows you to control all the other locale attributes with exactly
the same code. You can't seperate them.

To make multiple COLLATE orders work, PostgreSQL needs a new locale
framework, one that is not bound by POSIX. That's why you see both
things referred to.

Look at the POSIX interface, look at ICU. They both provide collate
order *and* date/time formats *and* number formatting. Pretending
they're seperate is silly.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: For review: Initial support for COLLATE

From
Peter Eisentraut
Date:
Martijn van Oosterhout wrote:
> - Should LOCALE be created as a synonym for COLLATE? It reads more
>   naturally.

No, and in fact the terminology mixup in your patch and description
concerns me.  If you are talking about collation, then the data types,
system catalog columns, etc. should talk about collation, not about
"locale", because that encompasses a number of other things that can be
handled independent of the collation order.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: For review: Initial support for COLLATE

From
Peter Eisentraut
Date:
Martijn van Oosterhout wrote:
> Look at the POSIX interface, look at ICU. They both provide collate
> order *and* date/time formats *and* number formatting. Pretending
> they're seperate is silly.

I'm not pretending.  They *are* separate.  That's why you can set them
separately.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: For review: Initial support for COLLATE

From
Martijn van Oosterhout
Date:
On Thu, Sep 08, 2005 at 12:04:42AM +0200, Peter Eisentraut wrote:
> Martijn van Oosterhout wrote:
> > Look at the POSIX interface, look at ICU. They both provide collate
> > order *and* date/time formats *and* number formatting. Pretending
> > they're seperate is silly.
>
> I'm not pretending.  They *are* separate.  That's why you can set them
> separately.

They're provided by the same library and set using the same interface.
Currently LC_COLLATE is fixed for the database, but LC_TIME and
LC_NUMERIC can vary between queries. For COLLATE support you need to be
able to specify a different LC_COLLATE for different parts of the
query, this provides that. The POSIX interface isn't flexible enough
for that.

To support COLLATE PostgreSQL's locale support needs to be upgraded and
made more flexible. This is not a surprise. Anything that doesn't use
the new interface isn't affected. I'm a little confused where the
problem is.

If you're worried about functions using the COLLATE tag for something
else, well then we won't do that. Each function/operator decides for
itself whether it's meaningful for them or not...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: For review: Initial support for COLLATE

From
Peter Eisentraut
Date:
Martijn van Oosterhout wrote:
> To support COLLATE PostgreSQL's locale support needs to be upgraded
> and made more flexible. This is not a surprise. Anything that doesn't
> use the new interface isn't affected. I'm a little confused where the
> problem is.

You were proposing to effectively make aliases out of the terms
collation and locale.  I was asking you not to do that because the term
locale encompasses more issues than collation.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: For review: Initial support for COLLATE

From
Martijn van Oosterhout
Date:
On Thu, Sep 08, 2005 at 07:00:08PM +0200, Peter Eisentraut wrote:
> Martijn van Oosterhout wrote:
> > To support COLLATE PostgreSQL's locale support needs to be upgraded
> > and made more flexible. This is not a surprise. Anything that doesn't
> > use the new interface isn't affected. I'm a little confused where the
> > problem is.
>
> You were proposing to effectively make aliases out of the terms
> collation and locale.  I was asking you not to do that because the term
> locale encompasses more issues than collation.

Ok, I won't do that then...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment