Thread: Per-column collation, the finale
I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION support and associated things now.
Attachment
On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > I've been going over this patch with a fine-tooth comb for the last two > weeks, fixed some small bugs, added comments, made initdb a little > friendlier, but no substantial changes. > > I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION > support and associated things now. Maybe I'm all wet here, but isn't it time to commit what you've got and punt the things that aren't done to 9.2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Em 14-01-2011 20:47, Robert Haas escreveu: > On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut<peter_e@gmx.net> wrote: >> I've been going over this patch with a fine-tooth comb for the last two >> weeks, fixed some small bugs, added comments, made initdb a little >> friendlier, but no substantial changes. >> >> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION >> support and associated things now. > > Maybe I'm all wet here, but isn't it time to commit what you've got > and punt the things that aren't done to 9.2? > I think Peter want another person to take a look at his patch. I personally would like to eyeball his patch (but it will be during the week). -- Euler Taveira de Oliveira http://www.timbira.com/
On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut <peter_e@gmx.net> wrote: > I've been going over this patch with a fine-tooth comb for the last two > weeks, fixed some small bugs, added comments, made initdb a little > friendlier, but no substantial changes. What can I do to test the patch? No regression tests are included in it, and I have an almost empty pg_collation catalog with it: =# SELECT * FROM pg_collation;collname | collnamespace | collencoding | collcollate | collctype ----------+---------------+--------------+-------------+-----------default | 11 | 0 | | (1 row) -- Itagaki Takahiro
On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote: > On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut <peter_e@gmx.net> wrote: > > I've been going over this patch with a fine-tooth comb for the last two > > weeks, fixed some small bugs, added comments, made initdb a little > > friendlier, but no substantial changes. > > What can I do to test the patch? > No regression tests are included in it, Please refer to the regress.sgml hunk about running the test. > and I have an almost empty pg_collation catalog with it: > > =# SELECT * FROM pg_collation; > collname | collnamespace | collencoding | collcollate | collctype > ----------+---------------+--------------+-------------+----------- > default | 11 | 0 | | > (1 row) The initdb output should say something about how it got there.
On Wed, Jan 26, 2011 at 12:35:07AM +0200, Peter Eisentraut wrote: > On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote: > > and I have an almost empty pg_collation catalog with it: > > > > =# SELECT * FROM pg_collation; > > collname | collnamespace | collencoding | collcollate | collctype > > ----------+---------------+--------------+-------------+----------- > > default | 11 | 0 | | > > (1 row) > > The initdb output should say something about how it got there. FWIW, I tried and had the same problem. initdb gave: creating collations ... not supported on this platform "configure" was failing to detect locale_t for me, and this fixed it: *** a/configure.in --- b/configure.in *************** *** 1116,1122 **** AC_TYPE_INTPTR_T AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT ! AC_CHECK_TYPES([locale_t], [#include <locale.h>]) AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred], [],[], --- 1116,1122 ---- AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT ! AC_CHECK_TYPES([locale_t], [], [], [#include <locale.h>]) AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred],[], [],
Hi Peter, I'm reviewing this patch as part of CommitFest 2011-01. On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote: > I've been going over this patch with a fine-tooth comb for the last two > weeks, fixed some small bugs, added comments, made initdb a little > friendlier, but no substantial changes. > > I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION > support and associated things now. This no longer applies cleanly against git master, so I've done my testing against 52713d0. I tested on Ubtunu 8.04 (GNU libc 2.7) and used the configure.in change I sent earlier[1] to get locale support detected. Since the time to seriously verify every patch hunk would be measured in weeks, I have not attempted that level of review detail. I've just commented on what stood out while reading and using the patch. If you go about supporting non-GNU libc systems, you may want to look at the gnulib locale module[2] for a starting point. The new documentation is helpful. It would be useful to document the implications of marking your user-defined type COLLATABLE. As best I can tell, you should only set COLLATABLE if functions that would be expected to respect collation, particularly members of a B-tree operator family, do check the fmgr collation. Otherwise, the user might specify collations for data of your type and be surprised to get no differentiated behavior. Are there other considerations? Also, should implementers of COLLATABLE types observe any restrictions on when to respect the collation? For example, is equality permitted to be sensitive to collation? On that note, the following is accepted; should it be? CREATE TABLE parent (key text COLLATE "sv_SE"); CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); CREATE TABLE child (keytext COLLATE "tr_TR" REFERENCES parent(key)); Does the system expect any invariants to hold on the platform implementation of collations? For example, that they be consistent with equality in some way? If so, would it be practical and/or helpful to provide a tool for testing a given platform locale for conformance? pg_attribute entries for an index currently retain the attcollation of the table columns. They should have the collation of the index columns. At that point, pg_index.indcollation would exist for optimization only. Though there's no reason it ought to happen with the first commit, it would be really excellent for "make check" to detect when the platform has sufficient support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR locales). Manual execution is fine for something like numeric_big, but this is so platform-sensitive that testing it widely is important. It would likewise be nice to have a way to display the collation of an arbitrary expression. A merger of pg_typeof and format_type would cover that. Four call sites gain code like this (example is from varstr_cmp): +#ifdef HAVE_LOCALE_T + locale_t mylocale = pg_newlocale_from_collation(collid); +#else + check_default_collation(collid); +#endif ... +#ifdef HAVE_LOCALE_T + result = strcoll_l(a1p, a2p, mylocale); +#else result = strcoll(a1p, a2p); +#endif Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation that merely calls check_default_collation, "#define strcoll_l(a, b, c) strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the abstraction would make the mainline code significantly cleaner. Even so, it might not pay off. Thoughts? Couldn't check_default_collation be a no-op except on assert-only builds? It's currently only called in !HAVE_LOCALE_T branches, in which case initdb will not have added non-default collations. CREATE COLLATION will presumably be made to fail unconditionally on such builds, too. wchar2char and char2wchar take a collation argument, but they only use it to "Assert(!lc_ctype_is_c(collation))". Maybe that's best, but it seems weird. The documentation for pg_type.typcollation marks it as a bool, but it's an oid. The documentation also says "An index can only support one collation.", but it's one collation per column. This surprised me; not sure how much can/should be done: [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID"; ERROR: collations are notsupported by type integer [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE"id_ID"; -- worked ... This too: [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=# SELECTprosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; -- worked ... [local] test=# SELECTprosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; ERROR: column "pg_proc.prosrc" mustappear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUPBY prosrc COLLATE... You implement the standard's requirement to forbid "CAST(foo AS text COLLATE "id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead. You also reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"". Is that desirable, needed to avoid syntactic ambiguity, or just an unintended consequence? In the absence of other considerations, it seems undesirable to require the parentheses. UNION on disparate collations seems to error or strip them, depending on how it's written: CREATE TABLE t_sv (c text COLLATE "sv_SE"); CREATE TABLE t_tr (c text COLLATE "tr_TR"); CREATE TABLE t_both AS SELECT cFROM t_sv UNION ALL SELECT c FROM t_tr; -- t_both.c has default collation CREATE TABLE t_otherboth AS SELECT ('foo'::text)COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; -- ERROR: 42P21: collation mismatch betweenexplicit collations "sv_SE" and "tr_TR" COLLATE is getting recognized in various places where it does not actually do anything. Probably a few more places need SimpleTypenameWithoutCollation: CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); ALTER OPERATOR || (text COLLATE"id_ID", text) OWNER TO nm; CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; ALTER AGGREGATE min(textcollate "id_ID") OWNER TO nm; CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; PREPARE foo(text collate"id_ID") AS SELECT $1; -- maybe this one does something? didn't verify I ran this benchmark: -- Setup: table of 10M random Unicode characters SELECT setseed(0); CREATE TABLE t (c) AS SELECT chr(1 + (random() * 65534)::int)FROM generate_series(1,10000000) gen(n); -- en_US.UTF8, unpatched: 65.73s -- en_US.UTF8, patched: 74.43s -- id_ID.UTF8, unpatched: 65.77s -- id_ID.UTF8, patched:74.53s SELECT min(c) -- en_US.UTF8, patched: 86.46s -- id_ID.UTF8, patched: 83.70s --SELECT min(c COLLATE "id_ID") FROM ( SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t ) t_all; The 13% slowdown with the feature unused seems worrisome, but the additional 16% slowdown when using the feature to adopt a different locale seems fine. Here is an opreport (entries >1%) from master (actually 52713d0) running the first test SELECT under en_US.UTF8: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 59288 12.7660 ExecProcNode 44183 9.5136 advance_transition_function 43211 9.3043 slot_deform_tuple 36019 7.7557 ExecProject 33495 7.2122 heapgettup_pagemode 26394 5.6832 varstr_cmp 26316 5.6664 advance_aggregates 23895 5.1451 ExecAgg 14472 3.1161 ExecStoreTuple 12939 2.7860 ExecScan 12750 2.7454 HeapTupleSatisfiesMVCC 12250 2.6377 heapgetpage 11695 2.5182 text_smaller 11559 2.4889 slot_getsomeattrs 10774 2.3199 ExecClearTuple 9092 1.9577 text_cmp 7724 1.6631 SeqNext 7354 1.5835 heap_getnext 7257 1.5626 ExecAppend 6647 1.4312 pg_detoast_datum_packed 6505 1.4007 AllocSetReset 5692 1.2256 XidInMVCCSnapshot Patched, same test case: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 58488 12.9975 ExecProcNode 45430 10.0957 advance_transition_function 42843 9.5208 slot_deform_tuple 35877 7.9728 ExecProject 32926 7.3170 heapgettup_pagemode 27181 6.0403 varstr_cmp 26342 5.8539 advance_aggregates 22243 4.9430 ExecAgg 14452 3.2116 ExecStoreTuple 13171 2.9269 ExecScan 11708 2.6018 text_smaller 11417 2.5371 slot_getsomeattrs 10758 2.3907 heapgetpage 10710 2.3800 ExecClearTuple 10035 2.2300 HeapTupleSatisfiesMVCC 8998 1.9996 text_cmp 7721 1.7158 SeqNext 7329 1.6287 heap_getnext 7324 1.6276 ExecAppend 6534 1.4520 AllocSetReset 6481 1.4402 pg_detoast_datum_packed 5862 1.3027 XidInMVCCSnapshot 4608 1.0240 MemoryContextReset Nothing really pops out there. pg_dump successfully preserves collation for table columns, index columns, and domains. Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem to have a different concept of collation naming. This patch creates a "default" collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The standard has specific names for specific default collations based on the character repertoire (our server_encoding, I suppose). So, the most-conformant thing would be to set the name of the pg_collation of oid = 100 at CREATE DATABASE time, based on the chosen ENCODING. Note sure how valuable this is. The standard requires that collations be GRANTable objects, but that seems fairly useless in practice. Section 11.4 puts the COLLATE clause as the last part of the column definition. This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work, but it does not. It appears CREATE TABLE gets the COLLATE clause via SimpleTypname, but to support the standard syntax, it would need to arrive via ColConstraintElem or some such. I have not looked at the problems this might bring. A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE = 'foo') COLLATE "id_ID"" (11.24). Also, 11.4 syntax rule 9 seems to forbid the following, which the patch accepts. Seems just as well to part from the spec in this regard, though: CREATE DOMAIN dom AS text; CREATE TABLE t (c dom COLLATE "id_ID"); Overall, this patch implements a useful feature, and I could not find any deep unsoundness in the implementation. There's obviously plenty left to do (multi-platform support, DDL for collations, various TODOs in the patch), but one need not expect to cover all that in one patch. I'd be most concerned about the performance regression when not using the feature. I'm sure I've missed plenty, and even this review probably contains errors. Thanks, nm [1] http://archives.postgresql.org/pgsql-hackers/2011-01/msg02443.php [2] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=m4/locale_h.m4
On Sat, Jan 29, 2011 at 2:52 AM, Noah Misch <noah@leadboat.com> wrote: > The 13% slowdown with the feature unused seems worrisome Very worrisome. This is a frequently-requested feature, but this seems like a potential show-stopper. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On lör, 2011-01-29 at 02:52 -0500, Noah Misch wrote: > I'm reviewing this patch as part of CommitFest 2011-01. Thank you very much for this thorough review. > This no longer applies cleanly against git master, so I've done my testing > against 52713d0. I will send an updated patch soon, when I have figured out the issues discussed below. > If you go about supporting non-GNU libc systems, you may want to look at the > gnulib locale module[2] for a starting point. I don't have a Mac system, but I'll see about figuring this out. > The new documentation is helpful. It would be useful to document the > implications of marking your user-defined type COLLATABLE. As best I can tell, > you should only set COLLATABLE if functions that would be expected to respect > collation, particularly members of a B-tree operator family, do check the fmgr > collation. Otherwise, the user might specify collations for data of your type > and be surprised to get no differentiated behavior. Are there other > considerations? Also, should implementers of COLLATABLE types observe any > restrictions on when to respect the collation? For example, is equality > permitted to be sensitive to collation? I improved the documentation in the CREATE TYPE man page about this. Note, however, that the questions you raise are not new. There already is a set of types that observe locale information in a set of associated functions. The answers are already hidden somewhere; this feature just exposes them in a different way. Over time, we may want to document more of this, but at the moment it's not clear how much to actually expose. > On that note, the following is accepted; should it be? > > CREATE TABLE parent (key text COLLATE "sv_SE"); > CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); > CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key)); > > Does the system expect any invariants to hold on the platform implementation of > collations? For example, that they be consistent with equality in some way? If > so, would it be practical and/or helpful to provide a tool for testing a given > platform locale for conformance? We discussed this previously in this thread. Currently, collation support does not affect equality. Values are only equal if they are bytewise equal. That's why the above is currently OK, but it should probably be restricted in the future. I'll make a note about it. > pg_attribute entries for an index currently retain the attcollation of the table > columns. They should have the collation of the index columns. At that point, > pg_index.indcollation would exist for optimization only. Fixed. > Though there's no reason it ought to happen with the first commit, it would be > really excellent for "make check" to detect when the platform has sufficient > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR > locales). Manual execution is fine for something like numeric_big, but this is > so platform-sensitive that testing it widely is important. I solicited ideas about this a while ago, but no one came up with a solution. If we do get Mac or Windows support, it would be a good time to revisit this and devise a platform-independent approach, if possible. > It would likewise be nice to have a way to display the collation of an arbitrary > expression. A merger of pg_typeof and format_type would cover that. Agreed. I have a follow-up patch pending for that. > Four call sites gain code like this (example is from varstr_cmp): > > +#ifdef HAVE_LOCALE_T > + locale_t mylocale = pg_newlocale_from_collation(collid); > +#else > + check_default_collation(collid); > +#endif > ... > +#ifdef HAVE_LOCALE_T > + result = strcoll_l(a1p, a2p, mylocale); > +#else > result = strcoll(a1p, a2p); > +#endif > > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation > that merely calls check_default_collation, "#define strcoll_l(a, b, c) > strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the > abstraction would make the mainline code significantly cleaner. Even so, it > might not pay off. Thoughts? I had considered that, but I'm mainly hesitant about introducing a fake locale_t type into the public namespace. Maybe it should be wrapped into a pg_locale_t; then we can do whatever we want. > Couldn't check_default_collation be a no-op except on assert-only builds? It's > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not > have added non-default collations. CREATE COLLATION will presumably be made to > fail unconditionally on such builds, too. Unless you register the HAVE_LOCALE_T state in pg_control, you need to be prepared to deal with schemas that contain nondefault collations even though the server binary doesn't support it. > wchar2char and char2wchar take a collation argument, but they only use it to > "Assert(!lc_ctype_is_c(collation))". Maybe that's best, but it seems weird. Yeah, it's weird, but that is the existing interface. > The documentation for pg_type.typcollation marks it as a bool, but it's an oid. > The documentation also says "An index can only support one collation.", but it's > one collation per column. Fixed, leftover from previous code. > This surprised me; not sure how much can/should be done: > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID"; > ERROR: collations are not supported by type integer > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > -- worked ... Yeah, this could be fixed. I'll make a note about this. Note that this is actually SQL92 syntax, removed in SQL99, so it's not essential. > This too: > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; > -- worked ... > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > -- worked ... > [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function > LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE... This is correct, I think. At least if you want to keep open the possibility that a collation could also determine equality. > You implement the standard's requirement to forbid "CAST(foo AS text COLLATE > "id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead. You also > reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"". > Is that desirable, needed to avoid syntactic ambiguity, or just an unintended > consequence? In the absence of other considerations, it seems undesirable to > require the parentheses. This is just how the parser interprets it. The :: syntax is full of odd ambiguities like that. > UNION on disparate collations seems to error or strip them, depending on how > it's written: > > CREATE TABLE t_sv (c text COLLATE "sv_SE"); > CREATE TABLE t_tr (c text COLLATE "tr_TR"); > CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr; > -- t_both.c has default collation > CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; > -- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR" This is correct per SQL standard. In the second case, the explicit COLLATE clauses cause a conflict, so you get an error. In the first case, you have implicit collation derivations that conflict, so the result has actually no collation (not default collation), which means you cannot sort the resulting value at all. It's actually a bit strange here that we allow the creation of a table based on a query that has a no collation derivation, but I did not see anything in the SQL standard that prohibits it. Maybe a notice would be useful, like when you create a column of type "unknown". > COLLATE is getting recognized in various places where it does not actually do > anything. Probably a few more places need SimpleTypenameWithoutCollation: > > CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); > ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm; > CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; > ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm; > CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1'; > CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; All of these places also accept useless typmods. The parser is currently pretty loose about that. It would need some major restructuring to fix that. > PREPARE foo(text collate "id_ID") AS SELECT $1; -- maybe this one does something? didn't verify It doesn't, but it probably could. I'll check it out. > The 13% slowdown with the feature unused seems worrisome, but the additional 16% > slowdown when using the feature to adopt a different locale seems fine. I figured out this problem. It needs some small restructuring in varstr_cmp to fix. It kind of relates to how the pg_newlocale_from_collation/check_default_collation restructuring is done. But it's definitely fixable. Well, my numbers were actually smaller than 13%, but it was noticeable. I'll give you something new to test later. > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem > to have a different concept of collation naming. This patch creates a "default" > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The > standard has specific names for specific default collations based on the > character repertoire (our server_encoding, I suppose). So, the most-conformant > thing would be to set the name of the pg_collation of oid = 100 at CREATE > DATABASE time, based on the chosen ENCODING. Note sure how valuable this is. The problem is, you can't reach across databases when you create a new one. This is really the root of all evil with this default collation business. > The standard requires that collations be GRANTable objects, but that seems > fairly useless in practice. Yeah, omitted for now. > Section 11.4 puts the COLLATE clause as the last part of the column definition. > This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work, > but it does not. It appears CREATE TABLE gets the COLLATE clause via > SimpleTypname, but to support the standard syntax, it would need to arrive via > ColConstraintElem or some such. I have not looked at the problems this might > bring. A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE = > 'foo') COLLATE "id_ID"" (11.24). This is only under the optional feature F692 "Extended collation support", which I chose to not bother with, because it doesn't provide any additional functionality. (We could consider it if it were required for compatibility with some other implementation.) > Also, 11.4 syntax rule 9 seems to forbid the > following, which the patch accepts. Seems just as well to part from the spec in > this regard, though: > > CREATE DOMAIN dom AS text; > CREATE TABLE t (c dom COLLATE "id_ID"); Good catch. Doesn't seem useful to restrict this, though. This is good stuff. I'll send you a new patch in a day or three for perhaps another round of performance tests. Some of the other issues above can perhaps be postponed for follow-up patches.
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote: > On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote: > > The new documentation is helpful. It would be useful to document the > > implications of marking your user-defined type COLLATABLE. As best I can tell, > > you should only set COLLATABLE if functions that would be expected to respect > > collation, particularly members of a B-tree operator family, do check the fmgr > > collation. Otherwise, the user might specify collations for data of your type > > and be surprised to get no differentiated behavior. Are there other > > considerations? Also, should implementers of COLLATABLE types observe any > > restrictions on when to respect the collation? For example, is equality > > permitted to be sensitive to collation? > > I improved the documentation in the CREATE TYPE man page about this. > Note, however, that the questions you raise are not new. There already > is a set of types that observe locale information in a set of associated > functions. The answers are already hidden somewhere; this feature just > exposes them in a different way. Over time, we may want to document > more of this, but at the moment it's not clear how much to actually > expose. That's a helpful way to think about it. > > On that note, the following is accepted; should it be? > > > > CREATE TABLE parent (key text COLLATE "sv_SE"); > > CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); > > CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key)); > > > > Does the system expect any invariants to hold on the platform implementation of > > collations? For example, that they be consistent with equality in some way? If > > so, would it be practical and/or helpful to provide a tool for testing a given > > platform locale for conformance? > > We discussed this previously in this thread. Currently, collation > support does not affect equality. Values are only equal if they are > bytewise equal. That's why the above is currently OK, but it should > probably be restricted in the future. I'll make a note about it. I see that this is true for texteq, bpchareq and citext_eq. However, nothing stops a user from creating a COLLATABLE type and making the "=" operator for its B-tree operator class sensitive to collation, right? Could that break any system assumptions? That is, do we need to document that user-defined hash and B-tree operator classes must ignore locale in their "=" operators? > > Though there's no reason it ought to happen with the first commit, it would be > > really excellent for "make check" to detect when the platform has sufficient > > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR > > locales). Manual execution is fine for something like numeric_big, but this is > > so platform-sensitive that testing it widely is important. > > I solicited ideas about this a while ago, but no one came up with a > solution. If we do get Mac or Windows support, it would be a good time > to revisit this and devise a platform-independent approach, if possible. As a rough idea, we could introduce the notion of a "skip condition" SQL command for a pg_regress test file. When the command throws an error, pg_regress skips that file. For these tests, the skip condition would resemble: SELECT 'foo' COLLATE "sv_SE", 'bar' COLLATE "tr_TR" > > Four call sites gain code like this (example is from varstr_cmp): > > > > +#ifdef HAVE_LOCALE_T > > + locale_t mylocale = pg_newlocale_from_collation(collid); > > +#else > > + check_default_collation(collid); > > +#endif > > ... > > +#ifdef HAVE_LOCALE_T > > + result = strcoll_l(a1p, a2p, mylocale); > > +#else > > result = strcoll(a1p, a2p); > > +#endif > > > > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation > > that merely calls check_default_collation, "#define strcoll_l(a, b, c) > > strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the > > abstraction would make the mainline code significantly cleaner. Even so, it > > might not pay off. Thoughts? > > I had considered that, but I'm mainly hesitant about introducing a fake > locale_t type into the public namespace. Maybe it should be wrapped > into a pg_locale_t; then we can do whatever we want. True. There are advantages and disadvantages to both. No strong opinion here. > > Couldn't check_default_collation be a no-op except on assert-only builds? It's > > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not > > have added non-default collations. CREATE COLLATION will presumably be made to > > fail unconditionally on such builds, too. > > Unless you register the HAVE_LOCALE_T state in pg_control, you need to > be prepared to deal with schemas that contain nondefault collations even > though the server binary doesn't support it. Makes sense. > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; > > -- worked ... > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > > -- worked ... > > [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > > ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function > > LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE... > > This is correct, I think. At least if you want to keep open the > possibility that a collation could also determine equality. Yes; I agree, now. Replace "COLLATE "id_ID"" with "||''", and you get the same. > > UNION on disparate collations seems to error or strip them, depending on how > > it's written: > > > > CREATE TABLE t_sv (c text COLLATE "sv_SE"); > > CREATE TABLE t_tr (c text COLLATE "tr_TR"); > > CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr; > > -- t_both.c has default collation > > CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; > > -- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR" > > This is correct per SQL standard. In the second case, the explicit > COLLATE clauses cause a conflict, so you get an error. In the first > case, you have implicit collation derivations that conflict, so the > result has actually no collation (not default collation), which means > you cannot sort the resulting value at all. It's actually a bit strange > here that we allow the creation of a table based on a query that has a > no collation derivation, but I did not see anything in the SQL standard > that prohibits it. Maybe a notice would be useful, like when you create > a column of type "unknown". After rereading about collation derivation several times, I think I see why it's correct. MS SQL Server has the same behaviors, too. (The exception I found: SQL Server rejects the first UNION ALL above. Both implementations reject it with UNION ALL changed to UNION.) > > COLLATE is getting recognized in various places where it does not actually do > > anything. Probably a few more places need SimpleTypenameWithoutCollation: > > > > CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); > > ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm; > > CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; > > ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm; > > CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1'; > > CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; > > All of these places also accept useless typmods. The parser is > currently pretty loose about that. It would need some major > restructuring to fix that. Okay. Peripheral to this patch, then. > > The 13% slowdown with the feature unused seems worrisome, but the additional 16% > > slowdown when using the feature to adopt a different locale seems fine. > > I figured out this problem. It needs some small restructuring in > varstr_cmp to fix. It kind of relates to how the > pg_newlocale_from_collation/check_default_collation restructuring is > done. But it's definitely fixable. > > Well, my numbers were actually smaller than 13%, but it was noticeable. > I'll give you something new to test later. Great. > > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem > > to have a different concept of collation naming. This patch creates a "default" > > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The > > standard has specific names for specific default collations based on the > > character repertoire (our server_encoding, I suppose). So, the most-conformant > > thing would be to set the name of the pg_collation of oid = 100 at CREATE > > DATABASE time, based on the chosen ENCODING. Note sure how valuable this is. > > The problem is, you can't reach across databases when you create a new > one. This is really the root of all evil with this default collation > business. Understood. Doesn't seem important to jump through hoops for this. > This is good stuff. I'll send you a new patch in a day or three for > perhaps another round of performance tests. Some of the other issues > above can perhaps be postponed for follow-up patches. I agree -- if the performance-when-unused gets solid, none of my other comments ought to hold things up. Thanks, nm
On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote: > > This is good stuff. I'll send you a new patch in a day or three for > > perhaps another round of performance tests. Some of the other > issues > > above can perhaps be postponed for follow-up patches. > > I agree -- if the performance-when-unused gets solid, none of my other > comments ought to hold things up. Here is a new patch. The main change is in varstr_cmp(), avoiding the calls to pg_newlocale_from_collation() when the default locale is used. This accounts for the performance regression in my tests. It also addresses some of your refactoring ideas.
Attachment
On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote: > On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote: > > > This is good stuff. I'll send you a new patch in a day or three for > > > perhaps another round of performance tests. Some of the other > > issues > > > above can perhaps be postponed for follow-up patches. > > > > I agree -- if the performance-when-unused gets solid, none of my other > > comments ought to hold things up. > > Here is a new patch. > > The main change is in varstr_cmp(), avoiding the calls to > pg_newlocale_from_collation() when the default locale is used. This > accounts for the performance regression in my tests. It also addresses > some of your refactoring ideas. Looks good and tests well. I've attached the same benchmark script with updated timings, and I've marked the patch Ready for Committer. nm
Attachment
Noah Misch wrote: > On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote: > > On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote: > > > > This is good stuff. I'll send you a new patch in a day or three for > > > > perhaps another round of performance tests. Some of the other > > > issues > > > > above can perhaps be postponed for follow-up patches. > > > > > > I agree -- if the performance-when-unused gets solid, none of my other > > > comments ought to hold things up. > > > > Here is a new patch. > > > > The main change is in varstr_cmp(), avoiding the calls to > > pg_newlocale_from_collation() when the default locale is used. This > > accounts for the performance regression in my tests. It also addresses > > some of your refactoring ideas. > > Looks good and tests well. I've attached the same benchmark script with updated > timings, and I've marked the patch Ready for Committer. Nice to see that performance hit is removed now! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote: > Looks good and tests well. I've attached the same benchmark script > with updated timings, and I've marked the patch Ready for Committer. Committed. Thanks everyone.
On 8 February 2011 21:08, Peter Eisentraut <peter_e@gmx.net> wrote: > On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote: >> Looks good and tests well. I've attached the same benchmark script >> with updated timings, and I've marked the patch Ready for Committer. > > Committed. Thanks everyone. Awesome work Peter. Few questions: postgres=# create table meow (id serial, stuff text collate "de_XX"); NOTICE: CREATE TABLE will create implicit sequence "meow_id_seq" for serial column "meow.id" ERROR: collation "de_XX" for current database encoding "UTF8" does not exist LINE 1: create table meow (id serial, stuff text collate "de_XX"); I wouldn't expect to see that first notice. Shouldn't that step come a bit later? A bit of weirdness, I'm allowed to specify more than one collation on a single column ordering... postgres=# select * from meow order by stuff collate "en_GB" collate "de_DE" desc;id | stuff ----+------- 2 | meow2 1 | meow (2 rows) Is this the same principal as casting, where they can be chained? Which one wins in this case? Although if anyone is actually doing this, then it's just silly anyway. (says Thom having just done it) Also, if a locale is installed after initdb, is it then impossible to get pg_collate to pick up that new locale? If a locale is somehow removed from the system, what happens on the database side when attempting to use a collated column? (I don't wish to TIAS on my own system) -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935
On tis, 2011-02-08 at 22:17 +0000, Thom Brown wrote: > postgres=# create table meow (id serial, stuff text collate "de_XX"); > NOTICE: CREATE TABLE will create implicit sequence "meow_id_seq" for > serial column "meow.id" > ERROR: collation "de_XX" for current database encoding "UTF8" does not exist > LINE 1: create table meow (id serial, stuff text collate "de_XX"); > > I wouldn't expect to see that first notice. Shouldn't that step come > a bit later? This isn't much different from writing create table meow (id serial, stuff nonsense); You'll still get the notice before it errors out on type-not-found. > A bit of weirdness, I'm allowed to specify more than one collation on > a single column ordering... > Is this the same principal as casting, where they can be chained? > Which one wins in this case? Yeah, last one wins. > Also, if a locale is installed after initdb, is it then impossible to > get pg_collate to pick up that new locale? Currently, you can insert it yourself into pg_collation. I have a CREATE COLLATION patch in the works. > If a locale is somehow > removed from the system, what happens on the database side when > attempting to use a collated column? Then you're hosed, but that has always been the case, with per-cluster and per-database locales.