Re: Per-column collation, the finale - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Per-column collation, the finale |
Date | |
Msg-id | 20110129075253.GA18784@tornado.leadboat.com Whole thread Raw |
In response to | Per-column collation, the finale (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: Per-column collation, the finale
Re: Per-column collation, the finale |
List | pgsql-hackers |
Hi Peter, I'm reviewing this patch as part of CommitFest 2011-01. On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote: > I've been going over this patch with a fine-tooth comb for the last two > weeks, fixed some small bugs, added comments, made initdb a little > friendlier, but no substantial changes. > > I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION > support and associated things now. This no longer applies cleanly against git master, so I've done my testing against 52713d0. I tested on Ubtunu 8.04 (GNU libc 2.7) and used the configure.in change I sent earlier[1] to get locale support detected. Since the time to seriously verify every patch hunk would be measured in weeks, I have not attempted that level of review detail. I've just commented on what stood out while reading and using the patch. If you go about supporting non-GNU libc systems, you may want to look at the gnulib locale module[2] for a starting point. The new documentation is helpful. It would be useful to document the implications of marking your user-defined type COLLATABLE. As best I can tell, you should only set COLLATABLE if functions that would be expected to respect collation, particularly members of a B-tree operator family, do check the fmgr collation. Otherwise, the user might specify collations for data of your type and be surprised to get no differentiated behavior. Are there other considerations? Also, should implementers of COLLATABLE types observe any restrictions on when to respect the collation? For example, is equality permitted to be sensitive to collation? On that note, the following is accepted; should it be? CREATE TABLE parent (key text COLLATE "sv_SE"); CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); CREATE TABLE child (keytext COLLATE "tr_TR" REFERENCES parent(key)); Does the system expect any invariants to hold on the platform implementation of collations? For example, that they be consistent with equality in some way? If so, would it be practical and/or helpful to provide a tool for testing a given platform locale for conformance? pg_attribute entries for an index currently retain the attcollation of the table columns. They should have the collation of the index columns. At that point, pg_index.indcollation would exist for optimization only. Though there's no reason it ought to happen with the first commit, it would be really excellent for "make check" to detect when the platform has sufficient support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR locales). Manual execution is fine for something like numeric_big, but this is so platform-sensitive that testing it widely is important. It would likewise be nice to have a way to display the collation of an arbitrary expression. A merger of pg_typeof and format_type would cover that. Four call sites gain code like this (example is from varstr_cmp): +#ifdef HAVE_LOCALE_T + locale_t mylocale = pg_newlocale_from_collation(collid); +#else + check_default_collation(collid); +#endif ... +#ifdef HAVE_LOCALE_T + result = strcoll_l(a1p, a2p, mylocale); +#else result = strcoll(a1p, a2p); +#endif Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation that merely calls check_default_collation, "#define strcoll_l(a, b, c) strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the abstraction would make the mainline code significantly cleaner. Even so, it might not pay off. Thoughts? Couldn't check_default_collation be a no-op except on assert-only builds? It's currently only called in !HAVE_LOCALE_T branches, in which case initdb will not have added non-default collations. CREATE COLLATION will presumably be made to fail unconditionally on such builds, too. wchar2char and char2wchar take a collation argument, but they only use it to "Assert(!lc_ctype_is_c(collation))". Maybe that's best, but it seems weird. The documentation for pg_type.typcollation marks it as a bool, but it's an oid. The documentation also says "An index can only support one collation.", but it's one collation per column. This surprised me; not sure how much can/should be done: [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID"; ERROR: collations are notsupported by type integer [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE"id_ID"; -- worked ... This too: [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=# SELECTprosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; -- worked ... [local] test=# SELECTprosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; ERROR: column "pg_proc.prosrc" mustappear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUPBY prosrc COLLATE... You implement the standard's requirement to forbid "CAST(foo AS text COLLATE "id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead. You also reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"". Is that desirable, needed to avoid syntactic ambiguity, or just an unintended consequence? In the absence of other considerations, it seems undesirable to require the parentheses. UNION on disparate collations seems to error or strip them, depending on how it's written: CREATE TABLE t_sv (c text COLLATE "sv_SE"); CREATE TABLE t_tr (c text COLLATE "tr_TR"); CREATE TABLE t_both AS SELECT cFROM t_sv UNION ALL SELECT c FROM t_tr; -- t_both.c has default collation CREATE TABLE t_otherboth AS SELECT ('foo'::text)COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; -- ERROR: 42P21: collation mismatch betweenexplicit collations "sv_SE" and "tr_TR" COLLATE is getting recognized in various places where it does not actually do anything. Probably a few more places need SimpleTypenameWithoutCollation: CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); ALTER OPERATOR || (text COLLATE"id_ID", text) OWNER TO nm; CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; ALTER AGGREGATE min(textcollate "id_ID") OWNER TO nm; CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; PREPARE foo(text collate"id_ID") AS SELECT $1; -- maybe this one does something? didn't verify I ran this benchmark: -- Setup: table of 10M random Unicode characters SELECT setseed(0); CREATE TABLE t (c) AS SELECT chr(1 + (random() * 65534)::int)FROM generate_series(1,10000000) gen(n); -- en_US.UTF8, unpatched: 65.73s -- en_US.UTF8, patched: 74.43s -- id_ID.UTF8, unpatched: 65.77s -- id_ID.UTF8, patched:74.53s SELECT min(c) -- en_US.UTF8, patched: 86.46s -- id_ID.UTF8, patched: 83.70s --SELECT min(c COLLATE "id_ID") FROM ( SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t ) t_all; The 13% slowdown with the feature unused seems worrisome, but the additional 16% slowdown when using the feature to adopt a different locale seems fine. Here is an opreport (entries >1%) from master (actually 52713d0) running the first test SELECT under en_US.UTF8: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 59288 12.7660 ExecProcNode 44183 9.5136 advance_transition_function 43211 9.3043 slot_deform_tuple 36019 7.7557 ExecProject 33495 7.2122 heapgettup_pagemode 26394 5.6832 varstr_cmp 26316 5.6664 advance_aggregates 23895 5.1451 ExecAgg 14472 3.1161 ExecStoreTuple 12939 2.7860 ExecScan 12750 2.7454 HeapTupleSatisfiesMVCC 12250 2.6377 heapgetpage 11695 2.5182 text_smaller 11559 2.4889 slot_getsomeattrs 10774 2.3199 ExecClearTuple 9092 1.9577 text_cmp 7724 1.6631 SeqNext 7354 1.5835 heap_getnext 7257 1.5626 ExecAppend 6647 1.4312 pg_detoast_datum_packed 6505 1.4007 AllocSetReset 5692 1.2256 XidInMVCCSnapshot Patched, same test case: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 58488 12.9975 ExecProcNode 45430 10.0957 advance_transition_function 42843 9.5208 slot_deform_tuple 35877 7.9728 ExecProject 32926 7.3170 heapgettup_pagemode 27181 6.0403 varstr_cmp 26342 5.8539 advance_aggregates 22243 4.9430 ExecAgg 14452 3.2116 ExecStoreTuple 13171 2.9269 ExecScan 11708 2.6018 text_smaller 11417 2.5371 slot_getsomeattrs 10758 2.3907 heapgetpage 10710 2.3800 ExecClearTuple 10035 2.2300 HeapTupleSatisfiesMVCC 8998 1.9996 text_cmp 7721 1.7158 SeqNext 7329 1.6287 heap_getnext 7324 1.6276 ExecAppend 6534 1.4520 AllocSetReset 6481 1.4402 pg_detoast_datum_packed 5862 1.3027 XidInMVCCSnapshot 4608 1.0240 MemoryContextReset Nothing really pops out there. pg_dump successfully preserves collation for table columns, index columns, and domains. Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem to have a different concept of collation naming. This patch creates a "default" collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The standard has specific names for specific default collations based on the character repertoire (our server_encoding, I suppose). So, the most-conformant thing would be to set the name of the pg_collation of oid = 100 at CREATE DATABASE time, based on the chosen ENCODING. Note sure how valuable this is. The standard requires that collations be GRANTable objects, but that seems fairly useless in practice. Section 11.4 puts the COLLATE clause as the last part of the column definition. This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work, but it does not. It appears CREATE TABLE gets the COLLATE clause via SimpleTypname, but to support the standard syntax, it would need to arrive via ColConstraintElem or some such. I have not looked at the problems this might bring. A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE = 'foo') COLLATE "id_ID"" (11.24). Also, 11.4 syntax rule 9 seems to forbid the following, which the patch accepts. Seems just as well to part from the spec in this regard, though: CREATE DOMAIN dom AS text; CREATE TABLE t (c dom COLLATE "id_ID"); Overall, this patch implements a useful feature, and I could not find any deep unsoundness in the implementation. There's obviously plenty left to do (multi-platform support, DDL for collations, various TODOs in the patch), but one need not expect to cover all that in one patch. I'd be most concerned about the performance regression when not using the feature. I'm sure I've missed plenty, and even this review probably contains errors. Thanks, nm [1] http://archives.postgresql.org/pgsql-hackers/2011-01/msg02443.php [2] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=m4/locale_h.m4
pgsql-hackers by date: