Re: Per-column collation, the finale - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Per-column collation, the finale
Date
Msg-id 20110203051053.GA713@tornado.gateway.2wire.net
Whole thread Raw
In response to Re: Per-column collation, the finale  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: Per-column collation, the finale
List pgsql-hackers
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote:
> On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
> > The new documentation is helpful.  It would be useful to document the
> > implications of marking your user-defined type COLLATABLE.  As best I can tell,
> > you should only set COLLATABLE if functions that would be expected to respect
> > collation, particularly members of a B-tree operator family, do check the fmgr
> > collation.  Otherwise, the user might specify collations for data of your type
> > and be surprised to get no differentiated behavior.  Are there other
> > considerations?  Also, should implementers of COLLATABLE types observe any
> > restrictions on when to respect the collation?  For example, is equality
> > permitted to be sensitive to collation?
> 
> I improved the documentation in the CREATE TYPE man page about this.
> Note, however, that the questions you raise are not new.  There already
> is a set of types that observe locale information in a set of associated
> functions.  The answers are already hidden somewhere; this feature just
> exposes them in a different way.  Over time, we may want to document
> more of this, but at the moment it's not clear how much to actually
> expose.

That's a helpful way to think about it.

> > On that note, the following is accepted; should it be?
> > 
> >   CREATE TABLE parent (key text COLLATE "sv_SE");
> >   CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
> >   CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));
> > 
> > Does the system expect any invariants to hold on the platform implementation of
> > collations?  For example, that they be consistent with equality in some way?  If
> > so, would it be practical and/or helpful to provide a tool for testing a given
> > platform locale for conformance?
> 
> We discussed this previously in this thread.  Currently, collation
> support does not affect equality.  Values are only equal if they are
> bytewise equal.  That's why the above is currently OK, but it should
> probably be restricted in the future.  I'll make a note about it.

I see that this is true for texteq, bpchareq and citext_eq.  However, nothing
stops a user from creating a COLLATABLE type and making the "=" operator for its
B-tree operator class sensitive to collation, right?  Could that break any
system assumptions?  That is, do we need to document that user-defined hash and
B-tree operator classes must ignore locale in their "=" operators?

> > Though there's no reason it ought to happen with the first commit, it would be
> > really excellent for "make check" to detect when the platform has sufficient
> > support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
> > locales).  Manual execution is fine for something like numeric_big, but this is
> > so platform-sensitive that testing it widely is important.
> 
> I solicited ideas about this a while ago, but no one came up with a
> solution.  If we do get Mac or Windows support, it would be a good time
> to revisit this and devise a platform-independent approach, if possible.

As a rough idea, we could introduce the notion of a "skip condition" SQL command
for a pg_regress test file.  When the command throws an error, pg_regress skips
that file.  For these tests, the skip condition would resemble: SELECT 'foo' COLLATE "sv_SE", 'bar' COLLATE "tr_TR"

> > Four call sites gain code like this (example is from varstr_cmp):
> > 
> > +#ifdef HAVE_LOCALE_T
> > +        locale_t    mylocale = pg_newlocale_from_collation(collid);
> > +#else
> > +        check_default_collation(collid);
> > +#endif
> > ...
> > +#ifdef HAVE_LOCALE_T
> > +        result = strcoll_l(a1p, a2p, mylocale);
> > +#else
> >          result = strcoll(a1p, a2p);
> > +#endif
> > 
> > Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
> > that merely calls check_default_collation, "#define strcoll_l(a, b, c)
> > strcoll(a, b)", etc.  I can see six TODO sites with similar needs, and the
> > abstraction would make the mainline code significantly cleaner.  Even so, it
> > might not pay off.  Thoughts?
> 
> I had considered that, but I'm mainly hesitant about introducing a fake
> locale_t type into the public namespace.  Maybe it should be wrapped
> into a pg_locale_t; then we can do whatever we want.

True.  There are advantages and disadvantages to both.  No strong opinion here.

> > Couldn't check_default_collation be a no-op except on assert-only builds?  It's
> > currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
> > have added non-default collations.  CREATE COLLATION will presumably be made to
> > fail unconditionally on such builds, too.
> 
> Unless you register the HAVE_LOCALE_T state in pg_control, you need to
> be prepared to deal with schemas that contain nondefault collations even
> though the server binary doesn't support it.

Makes sense.

> >   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
> >   -- worked ...
> >   [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> >   -- worked ...
> >   [local] test=# SELECT prosrc,                 count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> >   ERROR:  column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
> >   LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...
> 
> This is correct, I think.  At least if you want to keep open the
> possibility that a collation could also determine equality.

Yes; I agree, now.  Replace "COLLATE "id_ID"" with "||''", and you get the same.

> > UNION on disparate collations seems to error or strip them, depending on how
> > it's written:
> > 
> >   CREATE TABLE t_sv (c text COLLATE "sv_SE");
> >   CREATE TABLE t_tr (c text COLLATE "tr_TR");
> >   CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
> >   -- t_both.c has default collation
> >   CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
> >   -- ERROR:  42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"
> 
> This is correct per SQL standard.  In the second case, the explicit
> COLLATE clauses cause a conflict, so you get an error.  In the first
> case, you have implicit collation derivations that conflict, so the
> result has actually no collation (not default collation), which means
> you cannot sort the resulting value at all.  It's actually a bit strange
> here that we allow the creation of a table based on a query that has a
> no collation derivation, but I did not see anything in the SQL standard
> that prohibits it.  Maybe a notice would be useful, like when you create
> a column of type "unknown".

After rereading about collation derivation several times, I think I see why it's
correct.  MS SQL Server has the same behaviors, too.  (The exception I found:
SQL Server rejects the first UNION ALL above.  Both implementations reject it
with UNION ALL changed to UNION.)

> > COLLATE is getting recognized in various places where it does not actually do
> > anything.  Probably a few more places need SimpleTypenameWithoutCollation:
> > 
> >   CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
> >   ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
> >   CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
> >   ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
> >   CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
> >   CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';
> 
> All of these places also accept useless typmods.  The parser is
> currently pretty loose about that.  It would need some major
> restructuring to fix that.

Okay.  Peripheral to this patch, then.

> > The 13% slowdown with the feature unused seems worrisome, but the additional 16%
> > slowdown when using the feature to adopt a different locale seems fine.
> 
> I figured out this problem.  It needs some small restructuring in
> varstr_cmp to fix.  It kind of relates to how the
> pg_newlocale_from_collation/check_default_collation restructuring is
> done.  But it's definitely fixable.
> 
> Well, my numbers were actually smaller than 13%, but it was noticeable.
> I'll give you something new to test later.

Great.

> > Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
> > to have a different concept of collation naming.  This patch creates a "default"
> > collation that is implicitly the LC_COLLATE/LC_CTYPE of the database.  The
> > standard has specific names for specific default collations based on the
> > character repertoire (our server_encoding, I suppose).  So, the most-conformant
> > thing would be to set the name of the pg_collation of oid = 100 at CREATE
> > DATABASE time, based on the chosen ENCODING.  Note sure how valuable this is.
> 
> The problem is, you can't reach across databases when you create a new
> one.  This is really the root of all evil with this default collation
> business.

Understood.  Doesn't seem important to jump through hoops for this.

> This is good stuff.  I'll send you a new patch in a day or three for
> perhaps another round of performance tests.  Some of the other issues
> above can perhaps be postponed for follow-up patches.

I agree -- if the performance-when-unused gets solid, none of my other comments
ought to hold things up.

Thanks,
nm


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: sepgsql contrib module
Next
From: Robert Haas
Date:
Subject: Re: sepgsql contrib module