Re: Per-column collation, the finale - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Per-column collation, the finale
Date
Msg-id 20110129075253.GA18784@tornado.leadboat.com
Whole thread Raw
In response to Per-column collation, the finale  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: Per-column collation, the finale
Re: Per-column collation, the finale
List pgsql-hackers
Hi Peter,

I'm reviewing this patch as part of CommitFest 2011-01.

On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote:
> I've been going over this patch with a fine-tooth comb for the last two
> weeks, fixed some small bugs, added comments, made initdb a little
> friendlier, but no substantial changes.
> 
> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
> support and associated things now.

This no longer applies cleanly against git master, so I've done my testing
against 52713d0.  I tested on Ubtunu 8.04 (GNU libc 2.7) and used the
configure.in change I sent earlier[1] to get locale support detected.  Since the
time to seriously verify every patch hunk would be measured in weeks, I have not
attempted that level of review detail.  I've just commented on what stood out
while reading and using the patch.

If you go about supporting non-GNU libc systems, you may want to look at the
gnulib locale module[2] for a starting point.

The new documentation is helpful.  It would be useful to document the
implications of marking your user-defined type COLLATABLE.  As best I can tell,
you should only set COLLATABLE if functions that would be expected to respect
collation, particularly members of a B-tree operator family, do check the fmgr
collation.  Otherwise, the user might specify collations for data of your type
and be surprised to get no differentiated behavior.  Are there other
considerations?  Also, should implementers of COLLATABLE types observe any
restrictions on when to respect the collation?  For example, is equality
permitted to be sensitive to collation?  On that note, the following is
accepted; should it be?
 CREATE TABLE parent (key text COLLATE "sv_SE"); CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); CREATE TABLE child
(keytext COLLATE "tr_TR" REFERENCES parent(key));
 

Does the system expect any invariants to hold on the platform implementation of
collations?  For example, that they be consistent with equality in some way?  If
so, would it be practical and/or helpful to provide a tool for testing a given
platform locale for conformance?

pg_attribute entries for an index currently retain the attcollation of the table
columns.  They should have the collation of the index columns.  At that point,
pg_index.indcollation would exist for optimization only.

Though there's no reason it ought to happen with the first commit, it would be
really excellent for "make check" to detect when the platform has sufficient
support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
locales).  Manual execution is fine for something like numeric_big, but this is
so platform-sensitive that testing it widely is important.

It would likewise be nice to have a way to display the collation of an arbitrary
expression.  A merger of pg_typeof and format_type would cover that.

Four call sites gain code like this (example is from varstr_cmp):

+#ifdef HAVE_LOCALE_T
+        locale_t    mylocale = pg_newlocale_from_collation(collid);
+#else
+        check_default_collation(collid);
+#endif
...
+#ifdef HAVE_LOCALE_T
+        result = strcoll_l(a1p, a2p, mylocale);
+#else        result = strcoll(a1p, a2p);
+#endif

Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
that merely calls check_default_collation, "#define strcoll_l(a, b, c)
strcoll(a, b)", etc.  I can see six TODO sites with similar needs, and the
abstraction would make the mainline code significantly cleaner.  Even so, it
might not pay off.  Thoughts?

Couldn't check_default_collation be a no-op except on assert-only builds?  It's
currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
have added non-default collations.  CREATE COLLATION will presumably be made to
fail unconditionally on such builds, too.

wchar2char and char2wchar take a collation argument, but they only use it to
"Assert(!lc_ctype_is_c(collation))".  Maybe that's best, but it seems weird.

The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
The documentation also says "An index can only support one collation.", but it's
one collation per column.

This surprised me; not sure how much can/should be done:
 [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID"; ERROR:  collations are
notsupported by type integer [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc
COLLATE"id_ID"; -- worked ...
 

This too:
 [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=#
SELECTprosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; -- worked ... [local] test=#
SELECTprosrc,                 count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; ERROR:  column "pg_proc.prosrc"
mustappear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT prosrc, count(*) FROM pg_proc
GROUPBY prosrc COLLATE...
 

You implement the standard's requirement to forbid "CAST(foo AS text COLLATE
"id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead.  You also
reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"".
Is that desirable, needed to avoid syntactic ambiguity, or just an unintended
consequence?  In the absence of other considerations, it seems undesirable to
require the parentheses.

UNION on disparate collations seems to error or strip them, depending on how
it's written:
 CREATE TABLE t_sv (c text COLLATE "sv_SE"); CREATE TABLE t_tr (c text COLLATE "tr_TR"); CREATE TABLE t_both AS SELECT
cFROM t_sv UNION ALL SELECT c FROM t_tr; -- t_both.c has default collation CREATE TABLE t_otherboth AS SELECT
('foo'::text)COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; -- ERROR:  42P21: collation mismatch
betweenexplicit collations "sv_SE" and "tr_TR"
 

COLLATE is getting recognized in various places where it does not actually do
anything.  Probably a few more places need SimpleTypenameWithoutCollation:
 CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); ALTER OPERATOR || (text
COLLATE"id_ID", text) OWNER TO nm; CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; ALTER AGGREGATE
min(textcollate "id_ID") OWNER TO nm; CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT
1';CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; PREPARE foo(text
collate"id_ID") AS SELECT $1; -- maybe this one does something?  didn't verify
 


I ran this benchmark:
 -- Setup: table of 10M random Unicode characters SELECT setseed(0); CREATE TABLE t (c) AS SELECT chr(1 + (random() *
65534)::int)FROM generate_series(1,10000000) gen(n);
 
 -- en_US.UTF8, unpatched: 65.73s -- en_US.UTF8,   patched: 74.43s -- id_ID.UTF8, unpatched: 65.77s -- id_ID.UTF8,
patched:74.53s SELECT min(c)
 
 -- en_US.UTF8,   patched: 86.46s -- id_ID.UTF8,   patched: 83.70s --SELECT min(c COLLATE "id_ID")
 FROM (     SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t
 UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t     UNION ALL SELECT c FROM t )
t_all;

The 13% slowdown with the feature unused seems worrisome, but the additional 16%
slowdown when using the feature to adopt a different locale seems fine.  Here is
an opreport (entries >1%) from master (actually 52713d0) running the first test
SELECT under en_US.UTF8:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
59288    12.7660  ExecProcNode
44183     9.5136  advance_transition_function
43211     9.3043  slot_deform_tuple
36019     7.7557  ExecProject
33495     7.2122  heapgettup_pagemode
26394     5.6832  varstr_cmp
26316     5.6664  advance_aggregates
23895     5.1451  ExecAgg
14472     3.1161  ExecStoreTuple
12939     2.7860  ExecScan
12750     2.7454  HeapTupleSatisfiesMVCC
12250     2.6377  heapgetpage
11695     2.5182  text_smaller
11559     2.4889  slot_getsomeattrs
10774     2.3199  ExecClearTuple
9092      1.9577  text_cmp
7724      1.6631  SeqNext
7354      1.5835  heap_getnext
7257      1.5626  ExecAppend
6647      1.4312  pg_detoast_datum_packed
6505      1.4007  AllocSetReset
5692      1.2256  XidInMVCCSnapshot

Patched, same test case:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
58488    12.9975  ExecProcNode
45430    10.0957  advance_transition_function
42843     9.5208  slot_deform_tuple
35877     7.9728  ExecProject
32926     7.3170  heapgettup_pagemode
27181     6.0403  varstr_cmp
26342     5.8539  advance_aggregates
22243     4.9430  ExecAgg
14452     3.2116  ExecStoreTuple
13171     2.9269  ExecScan
11708     2.6018  text_smaller
11417     2.5371  slot_getsomeattrs
10758     2.3907  heapgetpage
10710     2.3800  ExecClearTuple
10035     2.2300  HeapTupleSatisfiesMVCC
8998      1.9996  text_cmp
7721      1.7158  SeqNext
7329      1.6287  heap_getnext
7324      1.6276  ExecAppend
6534      1.4520  AllocSetReset
6481      1.4402  pg_detoast_datum_packed
5862      1.3027  XidInMVCCSnapshot
4608      1.0240  MemoryContextReset

Nothing really pops out there.


pg_dump successfully preserves collation for table columns, index columns, and
domains.

Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
to have a different concept of collation naming.  This patch creates a "default"
collation that is implicitly the LC_COLLATE/LC_CTYPE of the database.  The
standard has specific names for specific default collations based on the
character repertoire (our server_encoding, I suppose).  So, the most-conformant
thing would be to set the name of the pg_collation of oid = 100 at CREATE
DATABASE time, based on the chosen ENCODING.  Note sure how valuable this is.

The standard requires that collations be GRANTable objects, but that seems
fairly useless in practice.

Section 11.4 puts the COLLATE clause as the last part of the column definition.
This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work,
but it does not.  It appears CREATE TABLE gets the COLLATE clause via
SimpleTypname, but to support the standard syntax, it would need to arrive via
ColConstraintElem or some such.  I have not looked at the problems this might
bring.  A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE =
'foo') COLLATE "id_ID"" (11.24).  Also, 11.4 syntax rule 9 seems to forbid the
following, which the patch accepts.  Seems just as well to part from the spec in
this regard, though:
 CREATE DOMAIN dom AS text; CREATE TABLE t (c dom COLLATE "id_ID");

Overall, this patch implements a useful feature, and I could not find any deep
unsoundness in the implementation.  There's obviously plenty left to do
(multi-platform support, DDL for collations, various TODOs in the patch), but
one need not expect to cover all that in one patch.  I'd be most concerned about
the performance regression when not using the feature.

I'm sure I've missed plenty, and even this review probably contains errors.

Thanks,
nm

[1] http://archives.postgresql.org/pgsql-hackers/2011-01/msg02443.php
[2] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=m4/locale_h.m4


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Include WAL in base backup
Next
From: Noah Misch
Date:
Subject: Re: ALTER TYPE 3: add facility to identify further no-work cases