Thread: Per-column collation, the finale

Per-column collation, the finale

From
Peter Eisentraut
Date:
I've been going over this patch with a fine-tooth comb for the last two
weeks, fixed some small bugs, added comments, made initdb a little
friendlier, but no substantial changes.

I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
support and associated things now.


Attachment

Re: Per-column collation, the finale

From
Robert Haas
Date:
On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> I've been going over this patch with a fine-tooth comb for the last two
> weeks, fixed some small bugs, added comments, made initdb a little
> friendlier, but no substantial changes.
>
> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
> support and associated things now.

Maybe I'm all wet here, but isn't it time to commit what you've got
and punt the things that aren't done to 9.2?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Per-column collation, the finale

From
Euler Taveira de Oliveira
Date:
Em 14-01-2011 20:47, Robert Haas escreveu:
> On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut<peter_e@gmx.net>  wrote:
>> I've been going over this patch with a fine-tooth comb for the last two
>> weeks, fixed some small bugs, added comments, made initdb a little
>> friendlier, but no substantial changes.
>>
>> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
>> support and associated things now.
>
> Maybe I'm all wet here, but isn't it time to commit what you've got
> and punt the things that aren't done to 9.2?
>
I think Peter want another person to take a look at his patch. I personally 
would like to eyeball his patch (but it will be during the week).


--   Euler Taveira de Oliveira  http://www.timbira.com/


Re: Per-column collation, the finale

From
Itagaki Takahiro
Date:
On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut <peter_e@gmx.net> wrote:
> I've been going over this patch with a fine-tooth comb for the last two
> weeks, fixed some small bugs, added comments, made initdb a little
> friendlier, but no substantial changes.

What can I do to test the patch?
No regression tests are included in it, and I have an almost empty
pg_collation catalog with it:

=# SELECT * FROM pg_collation;collname | collnamespace | collencoding | collcollate | collctype
----------+---------------+--------------+-------------+-----------default  |            11 |            0 |
|
 
(1 row)

-- 
Itagaki Takahiro


Re: Per-column collation, the finale

From
Peter Eisentraut
Date:
On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote:
> On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut <peter_e@gmx.net> wrote:
> > I've been going over this patch with a fine-tooth comb for the last two
> > weeks, fixed some small bugs, added comments, made initdb a little
> > friendlier, but no substantial changes.
> 
> What can I do to test the patch?
> No regression tests are included in it,

Please refer to the regress.sgml hunk about running the test.

> and I have an almost empty pg_collation catalog with it:
> 
> =# SELECT * FROM pg_collation;
>  collname | collnamespace | collencoding | collcollate | collctype
> ----------+---------------+--------------+-------------+-----------
>  default  |            11 |            0 |             |
> (1 row)

The initdb output should say something about how it got there.




Re: Per-column collation, the finale

From
Noah Misch
Date:
On Wed, Jan 26, 2011 at 12:35:07AM +0200, Peter Eisentraut wrote:
> On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote:
> > and I have an almost empty pg_collation catalog with it:
> > 
> > =# SELECT * FROM pg_collation;
> >  collname | collnamespace | collencoding | collcollate | collctype
> > ----------+---------------+--------------+-------------+-----------
> >  default  |            11 |            0 |             |
> > (1 row)
> 
> The initdb output should say something about how it got there.

FWIW, I tried and had the same problem.  initdb gave:
 creating collations ... not supported on this platform

"configure" was failing to detect locale_t for me, and this fixed it:

*** a/configure.in
--- b/configure.in
***************
*** 1116,1122 **** AC_TYPE_INTPTR_T AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT 
! AC_CHECK_TYPES([locale_t], [#include <locale.h>])  AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred],
[],[],
 
--- 1116,1122 ---- AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT 
! AC_CHECK_TYPES([locale_t], [], [], [#include <locale.h>])  AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct
sockcred],[], [],
 


Re: Per-column collation, the finale

From
Noah Misch
Date:
Hi Peter,

I'm reviewing this patch as part of CommitFest 2011-01.

On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote:
> I've been going over this patch with a fine-tooth comb for the last two
> weeks, fixed some small bugs, added comments, made initdb a little
> friendlier, but no substantial changes.
> 
> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
> support and associated things now.

This no longer applies cleanly against git master, so I've done my testing
against 52713d0.  I tested on Ubtunu 8.04 (GNU libc 2.7) and used the
configure.in change I sent earlier[1] to get locale support detected.  Since the
time to seriously verify every patch hunk would be measured in weeks, I have not
attempted that level of review detail.  I've just commented on what stood out
while reading and using the patch.

If you go about supporting non-GNU libc systems, you may want to look at the
gnulib locale module[2] for a starting point.

The new documentation is helpful.  It would be useful to document the
implications of marking your user-defined type COLLATABLE.  As best I can tell,
you should only set COLLATABLE if functions that would be expected to respect
collation, particularly members of a B-tree operator family, do check the fmgr
collation.  Otherwise, the user might specify collations for data of your type
and be surprised to get no differentiated behavior.  Are there other
considerations?  Also, should implementers of COLLATABLE types observe any
restrictions on when to respect the collation?  For example, is equality
permitted to be sensitive to collation?  On that note, the following is
accepted; should it be?
 CREATE TABLE parent (key text COLLATE "sv_SE"); CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); CREATE TABLE child
(keytext COLLATE "tr_TR" REFERENCES parent(key));
 

Does the system expect any invariants to hold on the platform implementation of
collations?  For example, that they be consistent with equality in some way?  If
so, would it be practical and/or helpful to provide a tool for testing a given
platform locale for conformance?

pg_attribute entries for an index currently retain the attcollation of the table
columns.  They should have the collation of the index columns.  At that point,
pg_index.indcollation would exist for optimization only.

Though there's no reason it ought to happen with the first commit, it would be
really excellent for "make check" to detect when the platform has sufficient
support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
locales).  Manual execution is fine for something like numeric_big, but this is
so platform-sensitive that testing it widely is important.

It would likewise be nice to have a way to display the collation of an arbitrary
expression.  A merger of pg_typeof and format_type would cover that.

Four call sites gain code like this (example is from varstr_cmp):

+#ifdef HAVE_LOCALE_T
+        locale_t    mylocale = pg_newlocale_from_collation(collid);
+#else
+        check_default_collation(collid);
+#endif
...
+#ifdef HAVE_LOCALE_T
+        result = strcoll_l(a1p, a2p, mylocale);
+#else        result = strcoll(a1p, a2p);
+#endif

Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
that merely calls check_default_collation, "#define strcoll_l(a, b, c)
strcoll(a, b)", etc.  I can see six TODO sites with similar needs, and the
abstraction would make the mainline code significantly cleaner.  Even so, it
might not pay off.  Thoughts?

Couldn't check_default_collation be a no-op except on assert-only builds?  It's
currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
have added non-default collations.  CREATE COLLATION will presumably be made to
fail unconditionally on such builds, too.

wchar2char and char2wchar take a collation argument, but they only use it to
"Assert(!lc_ctype_is_c(collation))".  Maybe that's best, but it seems weird.

The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
The documentation also says "An index can only support one collation.", but it's
one collation per column.

This surprised me; not sure how much can/should be done:
 [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID"; ERROR:  collations are
notsupported by type integer [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc
COLLATE"id_ID"; -- worked ...
 

This too:
 [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=#
SELECTprosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; -- worked ... [local] test=#
SELECTprosrc,                 count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; ERROR:  column "pg_proc.prosrc"
mustappear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT prosrc, count(*) FROM pg_proc
GROUPBY prosrc COLLATE...
 

You implement the standard's requirement to forbid "CAST(foo AS text COLLATE
"id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead.  You also
reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"".
Is that desirable, needed to avoid syntactic ambiguity, or just an unintended
consequence?  In the absence of other considerations, it seems undesirable to
require the parentheses.

UNION on disparate collations seems to error or strip them, depending on how
it's written:
 CREATE TABLE t_sv (c text COLLATE "sv_SE"); CREATE TABLE t_tr (c text COLLATE "tr_TR"); CREATE TABLE t_both AS SELECT
cFROM t_sv UNION ALL SELECT c FROM t_tr; -- t_both.c has default collation CREATE TABLE t_otherboth AS SELECT
('foo'::text)COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; -- ERROR:  42P21: collation mismatch
betweenexplicit collations "sv_SE" and "tr_TR"
 

COLLATE is getting recognized in various places where it does not actually do
anything.  Probably a few more places need SimpleTypenameWithoutCollation:
 CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); ALTER OPERATOR || (text
COLLATE"id_ID", text) OWNER TO nm; CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; ALTER AGGREGATE
min(textcollate "id_ID") OWNER TO nm; CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT
1';CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; PREPARE foo(text
collate"id_ID") AS SELECT $1; -- maybe this one does something?  didn't verify
 


I ran this benchmark:
 -- Setup: table of 10M random Unicode characters SELECT setseed(0); CREATE TABLE t (c) AS SELECT chr(1 + (random() *
65534)::int)FROM generate_series(1,10000000) gen(n);
 
 -- en_US.UTF8, unpatched: 65.73s -- en_US.UTF8,   patched: 74.43s -- id_ID.UTF8, unpatched: 65.77s -- id_ID.UTF8,
patched:74.53s SELECT min(c)
 
 -- en_US.UTF8,   patched: 86.46s -- id_ID.UTF8,   patched: 83.70s --SELECT min(c COLLATE "id_ID")
 FROM (     SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t
 UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t )
t_all;

The 13% slowdown with the feature unused seems worrisome, but the additional 16%
slowdown when using the feature to adopt a different locale seems fine.  Here is
an opreport (entries >1%) from master (actually 52713d0) running the first test
SELECT under en_US.UTF8:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
59288    12.7660  ExecProcNode
44183     9.5136  advance_transition_function
43211     9.3043  slot_deform_tuple
36019     7.7557  ExecProject
33495     7.2122  heapgettup_pagemode
26394     5.6832  varstr_cmp
26316     5.6664  advance_aggregates
23895     5.1451  ExecAgg
14472     3.1161  ExecStoreTuple
12939     2.7860  ExecScan
12750     2.7454  HeapTupleSatisfiesMVCC
12250     2.6377  heapgetpage
11695     2.5182  text_smaller
11559     2.4889  slot_getsomeattrs
10774     2.3199  ExecClearTuple
9092      1.9577  text_cmp
7724      1.6631  SeqNext
7354      1.5835  heap_getnext
7257      1.5626  ExecAppend
6647      1.4312  pg_detoast_datum_packed
6505      1.4007  AllocSetReset
5692      1.2256  XidInMVCCSnapshot

Patched, same test case:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
58488    12.9975  ExecProcNode
45430    10.0957  advance_transition_function
42843     9.5208  slot_deform_tuple
35877     7.9728  ExecProject
32926     7.3170  heapgettup_pagemode
27181     6.0403  varstr_cmp
26342     5.8539  advance_aggregates
22243     4.9430  ExecAgg
14452     3.2116  ExecStoreTuple
13171     2.9269  ExecScan
11708     2.6018  text_smaller
11417     2.5371  slot_getsomeattrs
10758     2.3907  heapgetpage
10710     2.3800  ExecClearTuple
10035     2.2300  HeapTupleSatisfiesMVCC
8998      1.9996  text_cmp
7721      1.7158  SeqNext
7329      1.6287  heap_getnext
7324      1.6276  ExecAppend
6534      1.4520  AllocSetReset
6481      1.4402  pg_detoast_datum_packed
5862      1.3027  XidInMVCCSnapshot
4608      1.0240  MemoryContextReset

Nothing really pops out there.


pg_dump successfully preserves collation for table columns, index columns, and
domains.

Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
to have a different concept of collation naming.  This patch creates a "default"
collation that is implicitly the LC_COLLATE/LC_CTYPE of the database.  The
standard has specific names for specific default collations based on the
character repertoire (our server_encoding, I suppose).  So, the most-conformant
thing would be to set the name of the pg_collation of oid = 100 at CREATE
DATABASE time, based on the chosen ENCODING.  Note sure how valuable this is.

The standard requires that collations be GRANTable objects, but that seems
fairly useless in practice.

Section 11.4 puts the COLLATE clause as the last part of the column definition.
This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work,
but it does not.  It appears CREATE TABLE gets the COLLATE clause via
SimpleTypname, but to support the standard syntax, it would need to arrive via
ColConstraintElem or some such.  I have not looked at the problems this might
bring.  A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE =
'foo') COLLATE "id_ID"" (11.24).  Also, 11.4 syntax rule 9 seems to forbid the
following, which the patch accepts.  Seems just as well to part from the spec in
this regard, though:
 CREATE DOMAIN dom AS text; CREATE TABLE t (c dom COLLATE "id_ID");

Overall, this patch implements a useful feature, and I could not find any deep
unsoundness in the implementation.  There's obviously plenty left to do
(multi-platform support, DDL for collations, various TODOs in the patch), but
one need not expect to cover all that in one patch.  I'd be most concerned about
the performance regression when not using the feature.

I'm sure I've missed plenty, and even this review probably contains errors.

Thanks,
nm

[1] http://archives.postgresql.org/pgsql-hackers/2011-01/msg02443.php
[2] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=m4/locale_h.m4


Re: Per-column collation, the finale

From
Robert Haas
Date:
On Sat, Jan 29, 2011 at 2:52 AM, Noah Misch <noah@leadboat.com> wrote:
> The 13% slowdown with the feature unused seems worrisome

Very worrisome.  This is a frequently-requested feature, but this
seems like a potential show-stopper.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Per-column collation, the finale

From
Peter Eisentraut
Date:
On lör, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
> I'm reviewing this patch as part of CommitFest 2011-01.

Thank you very much for this thorough review.

> This no longer applies cleanly against git master, so I've done my testing
> against 52713d0. 

I will send an updated patch soon, when I have figured out the issues
discussed below.

> If you go about supporting non-GNU libc systems, you may want to look at the
> gnulib locale module[2] for a starting point.

I don't have a Mac system, but I'll see about figuring this out.

> The new documentation is helpful.  It would be useful to document the
> implications of marking your user-defined type COLLATABLE.  As best I can tell,
> you should only set COLLATABLE if functions that would be expected to respect
> collation, particularly members of a B-tree operator family, do check the fmgr
> collation.  Otherwise, the user might specify collations for data of your type
> and be surprised to get no differentiated behavior.  Are there other
> considerations?  Also, should implementers of COLLATABLE types observe any
> restrictions on when to respect the collation?  For example, is equality
> permitted to be sensitive to collation?

I improved the documentation in the CREATE TYPE man page about this.
Note, however, that the questions you raise are not new.  There already
is a set of types that observe locale information in a set of associated
functions.  The answers are already hidden somewhere; this feature just
exposes them in a different way.  Over time, we may want to document
more of this, but at the moment it's not clear how much to actually
expose.

> On that note, the following is accepted; should it be?
> 
>   CREATE TABLE parent (key text COLLATE "sv_SE");
>   CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
>   CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));
> 
> Does the system expect any invariants to hold on the platform implementation of
> collations?  For example, that they be consistent with equality in some way?  If
> so, would it be practical and/or helpful to provide a tool for testing a given
> platform locale for conformance?

We discussed this previously in this thread.  Currently, collation
support does not affect equality.  Values are only equal if they are
bytewise equal.  That's why the above is currently OK, but it should
probably be restricted in the future.  I'll make a note about it.

> pg_attribute entries for an index currently retain the attcollation of the table
> columns.  They should have the collation of the index columns.  At that point,
> pg_index.indcollation would exist for optimization only.

Fixed.

> Though there's no reason it ought to happen with the first commit, it would be
> really excellent for "make check" to detect when the platform has sufficient
> support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
> locales).  Manual execution is fine for something like numeric_big, but this is
> so platform-sensitive that testing it widely is important.

I solicited ideas about this a while ago, but no one came up with a
solution.  If we do get Mac or Windows support, it would be a good time
to revisit this and devise a platform-independent approach, if possible.

> It would likewise be nice to have a way to display the collation of an arbitrary
> expression.  A merger of pg_typeof and format_type would cover that.

Agreed.  I have a follow-up patch pending for that.

> Four call sites gain code like this (example is from varstr_cmp):
> 
> +#ifdef HAVE_LOCALE_T
> +        locale_t    mylocale = pg_newlocale_from_collation(collid);
> +#else
> +        check_default_collation(collid);
> +#endif
> ...
> +#ifdef HAVE_LOCALE_T
> +        result = strcoll_l(a1p, a2p, mylocale);
> +#else
>          result = strcoll(a1p, a2p);
> +#endif
> 
> Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
> that merely calls check_default_collation, "#define strcoll_l(a, b, c)
> strcoll(a, b)", etc.  I can see six TODO sites with similar needs, and the
> abstraction would make the mainline code significantly cleaner.  Even so, it
> might not pay off.  Thoughts?

I had considered that, but I'm mainly hesitant about introducing a fake
locale_t type into the public namespace.  Maybe it should be wrapped
into a pg_locale_t; then we can do whatever we want.

> Couldn't check_default_collation be a no-op except on assert-only builds?  It's
> currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
> have added non-default collations.  CREATE COLLATION will presumably be made to
> fail unconditionally on such builds, too.

Unless you register the HAVE_LOCALE_T state in pg_control, you need to
be prepared to deal with schemas that contain nondefault collations even
though the server binary doesn't support it.

> wchar2char and char2wchar take a collation argument, but they only use it to
> "Assert(!lc_ctype_is_c(collation))".  Maybe that's best, but it seems weird.

Yeah, it's weird, but that is the existing interface.

> The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
> The documentation also says "An index can only support one collation.", but it's
> one collation per column.

Fixed, leftover from previous code.

> This surprised me; not sure how much can/should be done:
> 
>   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID";
>   ERROR:  collations are not supported by type integer
>   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
>   -- worked ...

Yeah, this could be fixed.  I'll make a note about this.  Note that this
is actually SQL92 syntax, removed in SQL99, so it's not essential.

> This too:
> 
>   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
>   -- worked ...
>   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
>   -- worked ...
>   [local] test=# SELECT prosrc,                 count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
>   ERROR:  column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
>   LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...

This is correct, I think.  At least if you want to keep open the
possibility that a collation could also determine equality.

> You implement the standard's requirement to forbid "CAST(foo AS text COLLATE
> "id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead.  You also
> reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"".
> Is that desirable, needed to avoid syntactic ambiguity, or just an unintended
> consequence?  In the absence of other considerations, it seems undesirable to
> require the parentheses.

This is just how the parser interprets it.  The :: syntax is full of odd
ambiguities like that.

> UNION on disparate collations seems to error or strip them, depending on how
> it's written:
> 
>   CREATE TABLE t_sv (c text COLLATE "sv_SE");
>   CREATE TABLE t_tr (c text COLLATE "tr_TR");
>   CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
>   -- t_both.c has default collation
>   CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
>   -- ERROR:  42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"

This is correct per SQL standard.  In the second case, the explicit
COLLATE clauses cause a conflict, so you get an error.  In the first
case, you have implicit collation derivations that conflict, so the
result has actually no collation (not default collation), which means
you cannot sort the resulting value at all.  It's actually a bit strange
here that we allow the creation of a table based on a query that has a
no collation derivation, but I did not see anything in the SQL standard
that prohibits it.  Maybe a notice would be useful, like when you create
a column of type "unknown".

> COLLATE is getting recognized in various places where it does not actually do
> anything.  Probably a few more places need SimpleTypenameWithoutCollation:
> 
>   CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
>   ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
>   CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
>   ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
>   CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
>   CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';

All of these places also accept useless typmods.  The parser is
currently pretty loose about that.  It would need some major
restructuring to fix that.

>   PREPARE foo(text collate "id_ID") AS SELECT $1; -- maybe this one does something?  didn't verify

It doesn't, but it probably could.  I'll check it out.

> The 13% slowdown with the feature unused seems worrisome, but the additional 16%
> slowdown when using the feature to adopt a different locale seems fine.

I figured out this problem.  It needs some small restructuring in
varstr_cmp to fix.  It kind of relates to how the
pg_newlocale_from_collation/check_default_collation restructuring is
done.  But it's definitely fixable.

Well, my numbers were actually smaller than 13%, but it was noticeable.
I'll give you something new to test later.

> Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
> to have a different concept of collation naming.  This patch creates a "default"
> collation that is implicitly the LC_COLLATE/LC_CTYPE of the database.  The
> standard has specific names for specific default collations based on the
> character repertoire (our server_encoding, I suppose).  So, the most-conformant
> thing would be to set the name of the pg_collation of oid = 100 at CREATE
> DATABASE time, based on the chosen ENCODING.  Note sure how valuable this is.

The problem is, you can't reach across databases when you create a new
one.  This is really the root of all evil with this default collation
business.

> The standard requires that collations be GRANTable objects, but that seems
> fairly useless in practice.

Yeah, omitted for now.

> Section 11.4 puts the COLLATE clause as the last part of the column definition.
> This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work,
> but it does not.  It appears CREATE TABLE gets the COLLATE clause via
> SimpleTypname, but to support the standard syntax, it would need to arrive via
> ColConstraintElem or some such.  I have not looked at the problems this might
> bring.  A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE =
> 'foo') COLLATE "id_ID"" (11.24).

This is only under the optional feature F692 "Extended collation
support", which I chose to not bother with, because it doesn't provide
any additional functionality.  (We could consider it if it were required
for compatibility with some other implementation.)

>   Also, 11.4 syntax rule 9 seems to forbid the
> following, which the patch accepts.  Seems just as well to part from the spec in
> this regard, though:
> 
>   CREATE DOMAIN dom AS text;
>   CREATE TABLE t (c dom COLLATE "id_ID");

Good catch.  Doesn't seem useful to restrict this, though.

This is good stuff.  I'll send you a new patch in a day or three for
perhaps another round of performance tests.  Some of the other issues
above can perhaps be postponed for follow-up patches.




Re: Per-column collation, the finale

From
Noah Misch
Date:
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote:
> On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
> > The new documentation is helpful.  It would be useful to document the
> > implications of marking your user-defined type COLLATABLE.  As best I can tell,
> > you should only set COLLATABLE if functions that would be expected to respect
> > collation, particularly members of a B-tree operator family, do check the fmgr
> > collation.  Otherwise, the user might specify collations for data of your type
> > and be surprised to get no differentiated behavior.  Are there other
> > considerations?  Also, should implementers of COLLATABLE types observe any
> > restrictions on when to respect the collation?  For example, is equality
> > permitted to be sensitive to collation?
> 
> I improved the documentation in the CREATE TYPE man page about this.
> Note, however, that the questions you raise are not new.  There already
> is a set of types that observe locale information in a set of associated
> functions.  The answers are already hidden somewhere; this feature just
> exposes them in a different way.  Over time, we may want to document
> more of this, but at the moment it's not clear how much to actually
> expose.

That's a helpful way to think about it.

> > On that note, the following is accepted; should it be?
> > 
> >   CREATE TABLE parent (key text COLLATE "sv_SE");
> >   CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
> >   CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));
> > 
> > Does the system expect any invariants to hold on the platform implementation of
> > collations?  For example, that they be consistent with equality in some way?  If
> > so, would it be practical and/or helpful to provide a tool for testing a given
> > platform locale for conformance?
> 
> We discussed this previously in this thread.  Currently, collation
> support does not affect equality.  Values are only equal if they are
> bytewise equal.  That's why the above is currently OK, but it should
> probably be restricted in the future.  I'll make a note about it.

I see that this is true for texteq, bpchareq and citext_eq.  However, nothing
stops a user from creating a COLLATABLE type and making the "=" operator for its
B-tree operator class sensitive to collation, right?  Could that break any
system assumptions?  That is, do we need to document that user-defined hash and
B-tree operator classes must ignore locale in their "=" operators?

> > Though there's no reason it ought to happen with the first commit, it would be
> > really excellent for "make check" to detect when the platform has sufficient
> > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
> > locales).  Manual execution is fine for something like numeric_big, but this is
> > so platform-sensitive that testing it widely is important.
> 
> I solicited ideas about this a while ago, but no one came up with a
> solution.  If we do get Mac or Windows support, it would be a good time
> to revisit this and devise a platform-independent approach, if possible.

As a rough idea, we could introduce the notion of a "skip condition" SQL command
for a pg_regress test file.  When the command throws an error, pg_regress skips
that file.  For these tests, the skip condition would resemble: SELECT 'foo' COLLATE "sv_SE", 'bar' COLLATE "tr_TR"

> > Four call sites gain code like this (example is from varstr_cmp):
> > 
> > +#ifdef HAVE_LOCALE_T
> > +        locale_t    mylocale = pg_newlocale_from_collation(collid);
> > +#else
> > +        check_default_collation(collid);
> > +#endif
> > ...
> > +#ifdef HAVE_LOCALE_T
> > +        result = strcoll_l(a1p, a2p, mylocale);
> > +#else
> >          result = strcoll(a1p, a2p);
> > +#endif
> > 
> > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
> > that merely calls check_default_collation, "#define strcoll_l(a, b, c)
> > strcoll(a, b)", etc.  I can see six TODO sites with similar needs, and the
> > abstraction would make the mainline code significantly cleaner.  Even so, it
> > might not pay off.  Thoughts?
> 
> I had considered that, but I'm mainly hesitant about introducing a fake
> locale_t type into the public namespace.  Maybe it should be wrapped
> into a pg_locale_t; then we can do whatever we want.

True.  There are advantages and disadvantages to both.  No strong opinion here.

> > Couldn't check_default_collation be a no-op except on assert-only builds?  It's
> > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
> > have added non-default collations.  CREATE COLLATION will presumably be made to
> > fail unconditionally on such builds, too.
> 
> Unless you register the HAVE_LOCALE_T state in pg_control, you need to
> be prepared to deal with schemas that contain nondefault collations even
> though the server binary doesn't support it.

Makes sense.

> >   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
> >   -- worked ...
> >   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> >   -- worked ...
> >   [local] test=# SELECT prosrc,                 count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> >   ERROR:  column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
> >   LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...
> 
> This is correct, I think.  At least if you want to keep open the
> possibility that a collation could also determine equality.

Yes; I agree, now.  Replace "COLLATE "id_ID"" with "||''", and you get the same.

> > UNION on disparate collations seems to error or strip them, depending on how
> > it's written:
> > 
> >   CREATE TABLE t_sv (c text COLLATE "sv_SE");
> >   CREATE TABLE t_tr (c text COLLATE "tr_TR");
> >   CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
> >   -- t_both.c has default collation
> >   CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
> >   -- ERROR:  42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"
> 
> This is correct per SQL standard.  In the second case, the explicit
> COLLATE clauses cause a conflict, so you get an error.  In the first
> case, you have implicit collation derivations that conflict, so the
> result has actually no collation (not default collation), which means
> you cannot sort the resulting value at all.  It's actually a bit strange
> here that we allow the creation of a table based on a query that has a
> no collation derivation, but I did not see anything in the SQL standard
> that prohibits it.  Maybe a notice would be useful, like when you create
> a column of type "unknown".

After rereading about collation derivation several times, I think I see why it's
correct.  MS SQL Server has the same behaviors, too.  (The exception I found:
SQL Server rejects the first UNION ALL above.  Both implementations reject it
with UNION ALL changed to UNION.)

> > COLLATE is getting recognized in various places where it does not actually do
> > anything.  Probably a few more places need SimpleTypenameWithoutCollation:
> > 
> >   CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
> >   ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
> >   CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
> >   ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
> >   CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
> >   CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';
> 
> All of these places also accept useless typmods.  The parser is
> currently pretty loose about that.  It would need some major
> restructuring to fix that.

Okay.  Peripheral to this patch, then.

> > The 13% slowdown with the feature unused seems worrisome, but the additional 16%
> > slowdown when using the feature to adopt a different locale seems fine.
> 
> I figured out this problem.  It needs some small restructuring in
> varstr_cmp to fix.  It kind of relates to how the
> pg_newlocale_from_collation/check_default_collation restructuring is
> done.  But it's definitely fixable.
> 
> Well, my numbers were actually smaller than 13%, but it was noticeable.
> I'll give you something new to test later.

Great.

> > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
> > to have a different concept of collation naming.  This patch creates a "default"
> > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database.  The
> > standard has specific names for specific default collations based on the
> > character repertoire (our server_encoding, I suppose).  So, the most-conformant
> > thing would be to set the name of the pg_collation of oid = 100 at CREATE
> > DATABASE time, based on the chosen ENCODING.  Note sure how valuable this is.
> 
> The problem is, you can't reach across databases when you create a new
> one.  This is really the root of all evil with this default collation
> business.

Understood.  Doesn't seem important to jump through hoops for this.

> This is good stuff.  I'll send you a new patch in a day or three for
> perhaps another round of performance tests.  Some of the other issues
> above can perhaps be postponed for follow-up patches.

I agree -- if the performance-when-unused gets solid, none of my other comments
ought to hold things up.

Thanks,
nm


Re: Per-column collation, the finale

From
Peter Eisentraut
Date:
On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote:
> > This is good stuff.  I'll send you a new patch in a day or three for
> > perhaps another round of performance tests.  Some of the other
> issues
> > above can perhaps be postponed for follow-up patches.
>
> I agree -- if the performance-when-unused gets solid, none of my other
> comments ought to hold things up.

Here is a new patch.

The main change is in varstr_cmp(), avoiding the calls to
pg_newlocale_from_collation() when the default locale is used.  This
accounts for the performance regression in my tests.  It also addresses
some of your refactoring ideas.


Attachment

Re: Per-column collation, the finale

From
Noah Misch
Date:
On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote:
> On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote:
> > > This is good stuff.  I'll send you a new patch in a day or three for
> > > perhaps another round of performance tests.  Some of the other
> > issues
> > > above can perhaps be postponed for follow-up patches.
> >
> > I agree -- if the performance-when-unused gets solid, none of my other
> > comments ought to hold things up.
>
> Here is a new patch.
>
> The main change is in varstr_cmp(), avoiding the calls to
> pg_newlocale_from_collation() when the default locale is used.  This
> accounts for the performance regression in my tests.  It also addresses
> some of your refactoring ideas.

Looks good and tests well.  I've attached the same benchmark script with updated
timings, and I've marked the patch Ready for Committer.

nm

Attachment

Re: Per-column collation, the finale

From
Bruce Momjian
Date:
Noah Misch wrote:
> On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote:
> > On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote:
> > > > This is good stuff.  I'll send you a new patch in a day or three for
> > > > perhaps another round of performance tests.  Some of the other
> > > issues
> > > > above can perhaps be postponed for follow-up patches.
> > > 
> > > I agree -- if the performance-when-unused gets solid, none of my other
> > > comments ought to hold things up. 
> > 
> > Here is a new patch.
> > 
> > The main change is in varstr_cmp(), avoiding the calls to
> > pg_newlocale_from_collation() when the default locale is used.  This
> > accounts for the performance regression in my tests.  It also addresses
> > some of your refactoring ideas.
> 
> Looks good and tests well.  I've attached the same benchmark script with updated
> timings, and I've marked the patch Ready for Committer.

Nice to see that performance hit is removed now!

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Per-column collation, the finale

From
Peter Eisentraut
Date:
On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote:
> Looks good and tests well.  I've attached the same benchmark script
> with updated timings, and I've marked the patch Ready for Committer.

Committed.  Thanks everyone.



Re: Per-column collation, the finale

From
Thom Brown
Date:
On 8 February 2011 21:08, Peter Eisentraut <peter_e@gmx.net> wrote:
> On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote:
>> Looks good and tests well.  I've attached the same benchmark script
>> with updated timings, and I've marked the patch Ready for Committer.
>
> Committed.  Thanks everyone.

Awesome work Peter.  Few questions:

postgres=# create table meow (id serial, stuff text collate "de_XX");
NOTICE:  CREATE TABLE will create implicit sequence "meow_id_seq" for
serial column "meow.id"
ERROR:  collation "de_XX" for current database encoding "UTF8" does not exist
LINE 1: create table meow (id serial, stuff text collate "de_XX");

I wouldn't expect to see that first notice.  Shouldn't that step come
a bit later?

A bit of weirdness, I'm allowed to specify more than one collation on
a single column ordering...

postgres=# select * from meow order by stuff collate "en_GB" collate
"de_DE" desc;id | stuff
----+------- 2 | meow2 1 | meow
(2 rows)

Is this the same principal as casting, where they can be chained?
Which one wins in this case?  Although if anyone is actually doing
this, then it's just silly anyway. (says Thom having just done it)

Also, if a locale is installed after initdb, is it then impossible to
get pg_collate to pick up that new locale?  If a locale is somehow
removed from the system, what happens on the database side when
attempting to use a collated column? (I don't wish to TIAS on my own
system)

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935


Re: Per-column collation, the finale

From
Peter Eisentraut
Date:
On tis, 2011-02-08 at 22:17 +0000, Thom Brown wrote:
> postgres=# create table meow (id serial, stuff text collate "de_XX");
> NOTICE:  CREATE TABLE will create implicit sequence "meow_id_seq" for
> serial column "meow.id"
> ERROR:  collation "de_XX" for current database encoding "UTF8" does not exist
> LINE 1: create table meow (id serial, stuff text collate "de_XX");
> 
> I wouldn't expect to see that first notice.  Shouldn't that step come
> a bit later?

This isn't much different from writing

create table meow (id serial, stuff nonsense);

You'll still get the notice before it errors out on type-not-found.

> A bit of weirdness, I'm allowed to specify more than one collation on
> a single column ordering...

> Is this the same principal as casting, where they can be chained?
> Which one wins in this case?

Yeah, last one wins.

> Also, if a locale is installed after initdb, is it then impossible to
> get pg_collate to pick up that new locale?

Currently, you can insert it yourself into pg_collation.  I have a
CREATE COLLATION patch in the works.

> If a locale is somehow
> removed from the system, what happens on the database side when
> attempting to use a collated column?

Then you're hosed, but that has always been the case, with per-cluster
and per-database locales.