Thread: Making view dump/restore safe at the column-alias level

Making view dump/restore safe at the column-alias level

From
Tom Lane
Date:
In commit 11e131854f8231a21613f834c40fe9d046926387 we rearranged
ruleutils.c's handling of relation aliases to ensure that views can
always be dumped and reloaded even in the face of confusing table
renamings.  I was reminded by
http://archives.postgresql.org/pgsql-general/2012-12/msg00654.php
that this is only half of the problem: you can still get burnt by
ambiguous column references, and pretty easily at that.

Aside from plain old ambiguity, there is a nastier problem: JOIN USING
and NATURAL JOIN depend on particular column names matching up, which
they might not do anymore after a column rename.  We have discussed
this previously (though I can't find the archives reference right now),
and the best anybody came up with was to invent some syntax extension
that would allow matching differently-named columns in USING, perhaps
along the lines of USING (leftcol = rightcol, ...).  But that's pretty
ugly and nobody volunteered to actually do it.

I had an idea though about how we might fix this without that.  Assume
that the problem is strictly ruleutils' to fix, ie we are not going to
invent new syntax and we are not going to change the existing methods
of assigning aliases to subselect columns.  We clearly will need to let
ruleutils assign new column aliases that are unique within each RTE
entry.  I think though that we can fix the JOIN USING problem if we
introduce an additional idea that alias choices can be forced top-down.
So a JOIN USING RTE would force the two columns being merged to be given
the same alias already assigned to the merged column in the JOIN RTE.
(If we ever get around to implementing the CORRESPONDING clause in
UNION/INTERSECT/EXCEPT, it would have to do something similar.)  We'd
similarly force the output aliases at the top level of a view to be the
view's known result column names (which presumably are distinct thanks
to pg_attribute's unique constraint).  Otherwise, as we descend the
query tree, we can assign distinct column aliases to each column of an
RTE, preferring the original name when possible but otherwise making it
unique by adding a number, as we already did with the relation aliases.

In the case of view-printing, once these aliases are all assigned we can
represent them in the SQL output easily enough; that code is already
there.  I'm not sure whether it's a good idea for EXPLAIN to use this
same kind of logic, since there's not currently anyplace in EXPLAIN
output to show nondefault column aliases.  It might be more confusing
than otherwise to use generated aliases in EXPLAIN, even if the original
aliases conflict.

If we're going to do something like this, now (9.3) would be a good time
since we already made changes in alias-assignment in the earlier commit.

Comments, better ideas?
        regards, tom lane



Re: Making view dump/restore safe at the column-alias level

From
Robert Haas
Date:
On Fri, Dec 21, 2012 at 6:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In commit 11e131854f8231a21613f834c40fe9d046926387 we rearranged
> ruleutils.c's handling of relation aliases to ensure that views can
> always be dumped and reloaded even in the face of confusing table
> renamings.  I was reminded by
> http://archives.postgresql.org/pgsql-general/2012-12/msg00654.php
> that this is only half of the problem: you can still get burnt by
> ambiguous column references, and pretty easily at that.
>
> Aside from plain old ambiguity, there is a nastier problem: JOIN USING
> and NATURAL JOIN depend on particular column names matching up, which
> they might not do anymore after a column rename.  We have discussed
> this previously (though I can't find the archives reference right now),
> and the best anybody came up with was to invent some syntax extension
> that would allow matching differently-named columns in USING, perhaps
> along the lines of USING (leftcol = rightcol, ...).  But that's pretty
> ugly and nobody volunteered to actually do it.
>
> I had an idea though about how we might fix this without that.  Assume
> that the problem is strictly ruleutils' to fix, ie we are not going to
> invent new syntax and we are not going to change the existing methods
> of assigning aliases to subselect columns.  We clearly will need to let
> ruleutils assign new column aliases that are unique within each RTE
> entry.  I think though that we can fix the JOIN USING problem if we
> introduce an additional idea that alias choices can be forced top-down.
> So a JOIN USING RTE would force the two columns being merged to be given
> the same alias already assigned to the merged column in the JOIN RTE.
> (If we ever get around to implementing the CORRESPONDING clause in
> UNION/INTERSECT/EXCEPT, it would have to do something similar.)  We'd
> similarly force the output aliases at the top level of a view to be the
> view's known result column names (which presumably are distinct thanks
> to pg_attribute's unique constraint).  Otherwise, as we descend the
> query tree, we can assign distinct column aliases to each column of an
> RTE, preferring the original name when possible but otherwise making it
> unique by adding a number, as we already did with the relation aliases.
>
> In the case of view-printing, once these aliases are all assigned we can
> represent them in the SQL output easily enough; that code is already
> there.  I'm not sure whether it's a good idea for EXPLAIN to use this
> same kind of logic, since there's not currently anyplace in EXPLAIN
> output to show nondefault column aliases.  It might be more confusing
> than otherwise to use generated aliases in EXPLAIN, even if the original
> aliases conflict.
>
> If we're going to do something like this, now (9.3) would be a good time
> since we already made changes in alias-assignment in the earlier commit.
>
> Comments, better ideas?
>
>                         regards, tom lane

I'm having a hard time following this.  Can you provide a concrete example?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making view dump/restore safe at the column-alias level

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm having a hard time following this.  Can you provide a concrete example?

regression=# create table t1 (x int, y int);
CREATE TABLE
regression=# create table t2 (x int, z int);
CREATE TABLE
regression=# create view v1 as select * from t1 join t2 using (x);
CREATE VIEW
regression=# \d+ v1                  View "public.v1"Column |  Type   | Modifiers | Storage | Description 
--------+---------+-----------+---------+-------------x      | integer |           | plain   | y      | integer |
   | plain   | z      | integer |           | plain   | 
 
View definition:SELECT t1.x, t1.y, t2.z  FROM t1  JOIN t2 USING (x);
regression=# alter table t2 rename column x to q;
ALTER TABLE
regression=# \d+ v1                  View "public.v1"Column |  Type   | Modifiers | Storage | Description 
--------+---------+-----------+---------+-------------x      | integer |           | plain   | y      | integer |
   | plain   | z      | integer |           | plain   | 
 
View definition:SELECT t1.x, t1.y, t2.z  FROM t1  JOIN t2 USING (x);

At this point the dumped view definition is wrong: if you try to execute
it you get

regression=# SELECT t1.x, t1.y, t2.z
regression-#    FROM t1
regression-#    JOIN t2 USING (x);
ERROR:  column "x" specified in USING clause does not exist in right table

I'm suggesting that we could fix this by emitting something that forces
the right alias to be assigned to t2.q:

SELECT t1.x, t1.y, t2.z  FROM t1  JOIN t2 AS t2(x,z)  USING (x);

The implementation I have in mind is to recurse down the join tree and
have any JOIN USING item forcibly propagate the common column name as
the alias-to-use for each of the two input columns.

Also consider

regression=# create view v2 as select * from (select 1,2) as a(x,y)
regression-# union select * from (select 3,4) as b;
CREATE VIEW
regression=# \d+ v2                  View "public.v2"Column |  Type   | Modifiers | Storage | Description 
--------+---------+-----------+---------+-------------x      | integer |           | plain   | y      | integer |
   | plain   | 
 
View definition:        SELECT a.x, a.y          FROM ( SELECT 1, 2) a(x, y)
UNION         SELECT b."?column?" AS x, b."?column?" AS y          FROM ( SELECT 3, 4) b;

That view definition doesn't work either, as complained of today in
pgsql-general.  To fix this we just need to force the columns of b
to be given distinct aliases.  The minimum-new-code solution would
probably be to produce
        SELECT a.x, a.y          FROM ( SELECT 1, 2) a(x, y)
UNION         SELECT b."?column?" AS x, b."?column?_1" AS y          FROM ( SELECT 3, 4) b("?column?", "?column?_1")

using the same add-some-digits-until-unique logic we are using for
relation aliases.  This could be done by considering all the column
aliases of each RTE when we arrive at it during the recursive scan.

On further reflection I think my worry about the top-level aliases
was unfounded --- we prevent views from being created at all unless
the top-level column names are all distinct.   But we definitely
have got issues for lower-level aliases, as these examples show.
        regards, tom lane



Re: Making view dump/restore safe at the column-alias level

From
Robert Haas
Date:
On Fri, Dec 21, 2012 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I'm having a hard time following this.  Can you provide a concrete example?
>
> regression=# create table t1 (x int, y int);
> CREATE TABLE
> regression=# create table t2 (x int, z int);
> CREATE TABLE
> regression=# create view v1 as select * from t1 join t2 using (x);
> CREATE VIEW
> regression=# \d+ v1
>                    View "public.v1"
>  Column |  Type   | Modifiers | Storage | Description
> --------+---------+-----------+---------+-------------
>  x      | integer |           | plain   |
>  y      | integer |           | plain   |
>  z      | integer |           | plain   |
> View definition:
>  SELECT t1.x, t1.y, t2.z
>    FROM t1
>    JOIN t2 USING (x);
> regression=# alter table t2 rename column x to q;
> ALTER TABLE
> regression=# \d+ v1
>                    View "public.v1"
>  Column |  Type   | Modifiers | Storage | Description
> --------+---------+-----------+---------+-------------
>  x      | integer |           | plain   |
>  y      | integer |           | plain   |
>  z      | integer |           | plain   |
> View definition:
>  SELECT t1.x, t1.y, t2.z
>    FROM t1
>    JOIN t2 USING (x);
>
> At this point the dumped view definition is wrong: if you try to execute
> it you get
>
> regression=# SELECT t1.x, t1.y, t2.z
> regression-#    FROM t1
> regression-#    JOIN t2 USING (x);
> ERROR:  column "x" specified in USING clause does not exist in right table
>
> I'm suggesting that we could fix this by emitting something that forces
> the right alias to be assigned to t2.q:
>
> SELECT t1.x, t1.y, t2.z
>    FROM t1
>    JOIN t2 AS t2(x,z)
>    USING (x);

Sneaky.  I didn't know that would even work, but it seems like a
sensible approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making view dump/restore safe at the column-alias level

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Dec 21, 2012 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm suggesting that we could fix this by emitting something that forces
>> the right alias to be assigned to t2.q:
>>
>> SELECT t1.x, t1.y, t2.z
>> FROM t1
>> JOIN t2 AS t2(x,z)
>> USING (x);

> Sneaky.  I didn't know that would even work, but it seems like a
> sensible approach.

I've been idly hacking away at this over Christmas break, and attached
is a draft patch.  The problems turned out to be considerably more
extensive than I'd realized --- in addition to the stated issue with
forcing input aliases for JOIN USING colums to match, I found that:

* For joins without aliases, we can't qualify join column references,
since obviously there's no relation name to use.  (And we can't add one
without breaking queries, because SQL specifies that an aliased JOIN hides
relation names within it.)  That's okay in simple cases because we can
just print the name of the referenced input column instead.  However,
that doesn't work for merged columns in FULL JOIN USING, because in a full
join a merged output column doesn't behave the same as either input.  The
only solution I can see for this is to force the column aliases for such
columns to be unique query-wide, not just within the join RTE.  Then they
can be referenced without a relation name and still be unambiguous.

* When printing a join's column alias list, we were just blindly printing
the user's original alias list.  However, addition or removal of columns
from either input table invalidates that: we have to be able to add or
remove aliases to match the new column set.

I believe the attached patch covers all cases arising from column
additions, deletions, or renames.  It's awfully large though --- about
1100 lines added to ruleutils.c.  We could possibly make it a bit smaller
if we changed the parser to save more information about JOIN USING
columns, so that we don't need the grotty flatten_join_using_qual hack.
But that would only save about 100 lines, and it would add more elsewhere,
so I'm not sure it's worth the trouble.  (It would also prevent anyone
from trying to use the patch in the back branches, not that I plan to
take the risk of back-patching.)

On the whole I think this is a "must fix" bug, so we don't have a lot of
choice, unless someone has a proposal for a different and more compact
way of solving the problem.

            regards, tom lane


Attachment

Re: Making view dump/restore safe at the column-alias level

From
Greg Stark
Date:
On Mon, Dec 31, 2012 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On the whole I think this is a "must fix" bug, so we don't have a lot of
> choice, unless someone has a proposal for a different and more compact
> way of solving the problem.

The only more compact way of handling things that I can see is adding
syntax to let us explicitly select exactly the columns we need. But
then the resulting view definitions would be Postgres-specific instead
of standard SQL which would defeat a large part of the motivation to
going to such lengths.

I do wonder whether the SQL standard will do something obtuse enough
that that's the only option for a large swathe of queries. Or is that
the case already? The query syntax you're using here, is it standard
SQL? Is it widely supported?

-- 
greg



Re: Making view dump/restore safe at the column-alias level

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> I do wonder whether the SQL standard will do something obtuse enough
> that that's the only option for a large swathe of queries. Or is that
> the case already? The query syntax you're using here, is it standard
> SQL? Is it widely supported?

Yeah, it's standard --- there's nothing here that wasn't in SQL92.
(Although I notice that SQL still hasn't got any ALTER TABLE RENAME
command, much less a column rename command.  I wonder whether the
committee is aware of these difficulties and has shied away from adding
RENAME because of them?)

As for widely supported, I can't imagine that the big boys don't have
this, although a quick test shows that mysql only has table aliases
not column aliases, ie you can do "FROM t1 AS t1x" but not
"FROM t1 AS t1x(y)".  Still, if that's a consideration, inventing
our own syntax would be even further away from the goal.  Also, the
patch goes to some lengths to not print column aliases unnecessarily
--- in fact, there are cases where the old code would print column
aliases but the patch will not.
        regards, tom lane