Re: Open issues for collations - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Open issues for collations
Date
Msg-id 21742.1303137667@sss.pgh.pa.us
Whole thread Raw
In response to Re: Open issues for collations  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Open issues for collations
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> [ assorted comments on original issues ]

I believe that all the collation issues I complained about on 26 March
are now resolved, except for the question of getting some more test
cases, and the question of adding a cares-about-collation flag to
pg_proc.  I've added the latter to the TODO list since everybody agreed
it was too late to consider it for 9.1.

However, I've come across a new issue that maybe requires discussion:
what collation should be associated with a multi-row VALUES in FROM?
For instance, in
SELECT ... FROM  (VALUES (1, 'foo'), (2, 'bar' COLLATE "C")) v(a,b),  ...

what collation should be imputed to references to v.b?

The way the code currently works is that the first row of the VALUES
list is inspected to determine what collation to report --- so in this
example, you'd get default collation, ignoring the COLLATE clause in the
second row.  There are several problems with this:

1. I think it violates the SQL spec.  SQL:2008 7.3 <table value
constructor> says that the column types of a VALUES construct are
resolved per the rules of section 9.3, and 9.3 is the one with the
standard verbiage about resolving a common collation, so it's hard
to see how "use the first row and ignore the rest" satisfies the spec.

2. It doesn't seem to satisfy the POLA --- in the above example,
ignoring the explicit COLLATE clause is rather surprising.  We could
document that that's what it does, perhaps, but if you can attach
COLLATE to any input of an operator or function and get the same
results, it's hard to explain why the same isn't true of a VALUES
column.  Especially when VALUES resolves data types in a much more
symmetrical fashion, eg this works:
  (VALUES (1, '1'), (2, '2'::real)) v(a,b)

3. It's not hard to imagine people thinking they can get row-by-row-
varying collation behavior from something like
SELECT ... FROM  (VALUES (1, 'foo' COLLATE "en_US"), (2, 'bar' COLLATE "C")) v(a,b),  ...

So not throwing an error, but rather silently doing something other than
what this SQL seems to say, seems pretty unfriendly to me.

The reason I'm expending so much verbiage on this is that fixing it
seems to require an addition to struct RangeTblEntry, ie, a catversion
bump and forced initdb, so that we can store a list of the resolved
column collations for an RTE_VALUES RTE.  I don't really think there's
much choice though.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: JDBC connections to 9.1
Next
From: Noah Misch
Date:
Subject: Re: Typed table DDL loose ends