Re: Per-column collation, the finale - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: Per-column collation, the finale |
Date | |
Msg-id | 20110203051053.GA713@tornado.gateway.2wire.net Whole thread Raw |
In response to | Re: Per-column collation, the finale (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: Per-column collation, the finale
|
List | pgsql-hackers |
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote: > On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote: > > The new documentation is helpful. It would be useful to document the > > implications of marking your user-defined type COLLATABLE. As best I can tell, > > you should only set COLLATABLE if functions that would be expected to respect > > collation, particularly members of a B-tree operator family, do check the fmgr > > collation. Otherwise, the user might specify collations for data of your type > > and be surprised to get no differentiated behavior. Are there other > > considerations? Also, should implementers of COLLATABLE types observe any > > restrictions on when to respect the collation? For example, is equality > > permitted to be sensitive to collation? > > I improved the documentation in the CREATE TYPE man page about this. > Note, however, that the questions you raise are not new. There already > is a set of types that observe locale information in a set of associated > functions. The answers are already hidden somewhere; this feature just > exposes them in a different way. Over time, we may want to document > more of this, but at the moment it's not clear how much to actually > expose. That's a helpful way to think about it. > > On that note, the following is accepted; should it be? > > > > CREATE TABLE parent (key text COLLATE "sv_SE"); > > CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID"); > > CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key)); > > > > Does the system expect any invariants to hold on the platform implementation of > > collations? For example, that they be consistent with equality in some way? If > > so, would it be practical and/or helpful to provide a tool for testing a given > > platform locale for conformance? > > We discussed this previously in this thread. Currently, collation > support does not affect equality. Values are only equal if they are > bytewise equal. That's why the above is currently OK, but it should > probably be restricted in the future. I'll make a note about it. I see that this is true for texteq, bpchareq and citext_eq. However, nothing stops a user from creating a COLLATABLE type and making the "=" operator for its B-tree operator class sensitive to collation, right? Could that break any system assumptions? That is, do we need to document that user-defined hash and B-tree operator classes must ignore locale in their "=" operators? > > Though there's no reason it ought to happen with the first commit, it would be > > really excellent for "make check" to detect when the platform has sufficient > > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR > > locales). Manual execution is fine for something like numeric_big, but this is > > so platform-sensitive that testing it widely is important. > > I solicited ideas about this a while ago, but no one came up with a > solution. If we do get Mac or Windows support, it would be a good time > to revisit this and devise a platform-independent approach, if possible. As a rough idea, we could introduce the notion of a "skip condition" SQL command for a pg_regress test file. When the command throws an error, pg_regress skips that file. For these tests, the skip condition would resemble: SELECT 'foo' COLLATE "sv_SE", 'bar' COLLATE "tr_TR" > > Four call sites gain code like this (example is from varstr_cmp): > > > > +#ifdef HAVE_LOCALE_T > > + locale_t mylocale = pg_newlocale_from_collation(collid); > > +#else > > + check_default_collation(collid); > > +#endif > > ... > > +#ifdef HAVE_LOCALE_T > > + result = strcoll_l(a1p, a2p, mylocale); > > +#else > > result = strcoll(a1p, a2p); > > +#endif > > > > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation > > that merely calls check_default_collation, "#define strcoll_l(a, b, c) > > strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the > > abstraction would make the mainline code significantly cleaner. Even so, it > > might not pay off. Thoughts? > > I had considered that, but I'm mainly hesitant about introducing a fake > locale_t type into the public namespace. Maybe it should be wrapped > into a pg_locale_t; then we can do whatever we want. True. There are advantages and disadvantages to both. No strong opinion here. > > Couldn't check_default_collation be a no-op except on assert-only builds? It's > > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not > > have added non-default collations. CREATE COLLATION will presumably be made to > > fail unconditionally on such builds, too. > > Unless you register the HAVE_LOCALE_T state in pg_control, you need to > be prepared to deal with schemas that contain nondefault collations even > though the server binary doesn't support it. Makes sense. > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc; > > -- worked ... > > [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > > -- worked ... > > [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID"; > > ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function > > LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE... > > This is correct, I think. At least if you want to keep open the > possibility that a collation could also determine equality. Yes; I agree, now. Replace "COLLATE "id_ID"" with "||''", and you get the same. > > UNION on disparate collations seems to error or strip them, depending on how > > it's written: > > > > CREATE TABLE t_sv (c text COLLATE "sv_SE"); > > CREATE TABLE t_tr (c text COLLATE "tr_TR"); > > CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr; > > -- t_both.c has default collation > > CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR"; > > -- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR" > > This is correct per SQL standard. In the second case, the explicit > COLLATE clauses cause a conflict, so you get an error. In the first > case, you have implicit collation derivations that conflict, so the > result has actually no collation (not default collation), which means > you cannot sort the resulting value at all. It's actually a bit strange > here that we allow the creation of a table based on a query that has a > no collation derivation, but I did not see anything in the SQL standard > that prohibits it. Maybe a notice would be useful, like when you create > a column of type "unknown". After rereading about collation derivation several times, I think I see why it's correct. MS SQL Server has the same behaviors, too. (The exception I found: SQL Server rejects the first UNION ALL above. Both implementations reject it with UNION ALL changed to UNION.) > > COLLATE is getting recognized in various places where it does not actually do > > anything. Probably a few more places need SimpleTypenameWithoutCollation: > > > > CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text); > > ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm; > > CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION; > > ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm; > > CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1'; > > CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text'; > > All of these places also accept useless typmods. The parser is > currently pretty loose about that. It would need some major > restructuring to fix that. Okay. Peripheral to this patch, then. > > The 13% slowdown with the feature unused seems worrisome, but the additional 16% > > slowdown when using the feature to adopt a different locale seems fine. > > I figured out this problem. It needs some small restructuring in > varstr_cmp to fix. It kind of relates to how the > pg_newlocale_from_collation/check_default_collation restructuring is > done. But it's definitely fixable. > > Well, my numbers were actually smaller than 13%, but it was noticeable. > I'll give you something new to test later. Great. > > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem > > to have a different concept of collation naming. This patch creates a "default" > > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The > > standard has specific names for specific default collations based on the > > character repertoire (our server_encoding, I suppose). So, the most-conformant > > thing would be to set the name of the pg_collation of oid = 100 at CREATE > > DATABASE time, based on the chosen ENCODING. Note sure how valuable this is. > > The problem is, you can't reach across databases when you create a new > one. This is really the root of all evil with this default collation > business. Understood. Doesn't seem important to jump through hoops for this. > This is good stuff. I'll send you a new patch in a day or three for > perhaps another round of performance tests. Some of the other issues > above can perhaps be postponed for follow-up patches. I agree -- if the performance-when-unused gets solid, none of my other comments ought to hold things up. Thanks, nm
pgsql-hackers by date: