Thread: Making view dump/restore safe at the column-alias level
In commit 11e131854f8231a21613f834c40fe9d046926387 we rearranged ruleutils.c's handling of relation aliases to ensure that views can always be dumped and reloaded even in the face of confusing table renamings. I was reminded by http://archives.postgresql.org/pgsql-general/2012-12/msg00654.php that this is only half of the problem: you can still get burnt by ambiguous column references, and pretty easily at that. Aside from plain old ambiguity, there is a nastier problem: JOIN USING and NATURAL JOIN depend on particular column names matching up, which they might not do anymore after a column rename. We have discussed this previously (though I can't find the archives reference right now), and the best anybody came up with was to invent some syntax extension that would allow matching differently-named columns in USING, perhaps along the lines of USING (leftcol = rightcol, ...). But that's pretty ugly and nobody volunteered to actually do it. I had an idea though about how we might fix this without that. Assume that the problem is strictly ruleutils' to fix, ie we are not going to invent new syntax and we are not going to change the existing methods of assigning aliases to subselect columns. We clearly will need to let ruleutils assign new column aliases that are unique within each RTE entry. I think though that we can fix the JOIN USING problem if we introduce an additional idea that alias choices can be forced top-down. So a JOIN USING RTE would force the two columns being merged to be given the same alias already assigned to the merged column in the JOIN RTE. (If we ever get around to implementing the CORRESPONDING clause in UNION/INTERSECT/EXCEPT, it would have to do something similar.) We'd similarly force the output aliases at the top level of a view to be the view's known result column names (which presumably are distinct thanks to pg_attribute's unique constraint). Otherwise, as we descend the query tree, we can assign distinct column aliases to each column of an RTE, preferring the original name when possible but otherwise making it unique by adding a number, as we already did with the relation aliases. In the case of view-printing, once these aliases are all assigned we can represent them in the SQL output easily enough; that code is already there. I'm not sure whether it's a good idea for EXPLAIN to use this same kind of logic, since there's not currently anyplace in EXPLAIN output to show nondefault column aliases. It might be more confusing than otherwise to use generated aliases in EXPLAIN, even if the original aliases conflict. If we're going to do something like this, now (9.3) would be a good time since we already made changes in alias-assignment in the earlier commit. Comments, better ideas? regards, tom lane
On Fri, Dec 21, 2012 at 6:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In commit 11e131854f8231a21613f834c40fe9d046926387 we rearranged > ruleutils.c's handling of relation aliases to ensure that views can > always be dumped and reloaded even in the face of confusing table > renamings. I was reminded by > http://archives.postgresql.org/pgsql-general/2012-12/msg00654.php > that this is only half of the problem: you can still get burnt by > ambiguous column references, and pretty easily at that. > > Aside from plain old ambiguity, there is a nastier problem: JOIN USING > and NATURAL JOIN depend on particular column names matching up, which > they might not do anymore after a column rename. We have discussed > this previously (though I can't find the archives reference right now), > and the best anybody came up with was to invent some syntax extension > that would allow matching differently-named columns in USING, perhaps > along the lines of USING (leftcol = rightcol, ...). But that's pretty > ugly and nobody volunteered to actually do it. > > I had an idea though about how we might fix this without that. Assume > that the problem is strictly ruleutils' to fix, ie we are not going to > invent new syntax and we are not going to change the existing methods > of assigning aliases to subselect columns. We clearly will need to let > ruleutils assign new column aliases that are unique within each RTE > entry. I think though that we can fix the JOIN USING problem if we > introduce an additional idea that alias choices can be forced top-down. > So a JOIN USING RTE would force the two columns being merged to be given > the same alias already assigned to the merged column in the JOIN RTE. > (If we ever get around to implementing the CORRESPONDING clause in > UNION/INTERSECT/EXCEPT, it would have to do something similar.) We'd > similarly force the output aliases at the top level of a view to be the > view's known result column names (which presumably are distinct thanks > to pg_attribute's unique constraint). Otherwise, as we descend the > query tree, we can assign distinct column aliases to each column of an > RTE, preferring the original name when possible but otherwise making it > unique by adding a number, as we already did with the relation aliases. > > In the case of view-printing, once these aliases are all assigned we can > represent them in the SQL output easily enough; that code is already > there. I'm not sure whether it's a good idea for EXPLAIN to use this > same kind of logic, since there's not currently anyplace in EXPLAIN > output to show nondefault column aliases. It might be more confusing > than otherwise to use generated aliases in EXPLAIN, even if the original > aliases conflict. > > If we're going to do something like this, now (9.3) would be a good time > since we already made changes in alias-assignment in the earlier commit. > > Comments, better ideas? > > regards, tom lane I'm having a hard time following this. Can you provide a concrete example? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm having a hard time following this. Can you provide a concrete example? regression=# create table t1 (x int, y int); CREATE TABLE regression=# create table t2 (x int, z int); CREATE TABLE regression=# create view v1 as select * from t1 join t2 using (x); CREATE VIEW regression=# \d+ v1 View "public.v1"Column | Type | Modifiers | Storage | Description --------+---------+-----------+---------+-------------x | integer | | plain | y | integer | | plain | z | integer | | plain | View definition:SELECT t1.x, t1.y, t2.z FROM t1 JOIN t2 USING (x); regression=# alter table t2 rename column x to q; ALTER TABLE regression=# \d+ v1 View "public.v1"Column | Type | Modifiers | Storage | Description --------+---------+-----------+---------+-------------x | integer | | plain | y | integer | | plain | z | integer | | plain | View definition:SELECT t1.x, t1.y, t2.z FROM t1 JOIN t2 USING (x); At this point the dumped view definition is wrong: if you try to execute it you get regression=# SELECT t1.x, t1.y, t2.z regression-# FROM t1 regression-# JOIN t2 USING (x); ERROR: column "x" specified in USING clause does not exist in right table I'm suggesting that we could fix this by emitting something that forces the right alias to be assigned to t2.q: SELECT t1.x, t1.y, t2.z FROM t1 JOIN t2 AS t2(x,z) USING (x); The implementation I have in mind is to recurse down the join tree and have any JOIN USING item forcibly propagate the common column name as the alias-to-use for each of the two input columns. Also consider regression=# create view v2 as select * from (select 1,2) as a(x,y) regression-# union select * from (select 3,4) as b; CREATE VIEW regression=# \d+ v2 View "public.v2"Column | Type | Modifiers | Storage | Description --------+---------+-----------+---------+-------------x | integer | | plain | y | integer | | plain | View definition: SELECT a.x, a.y FROM ( SELECT 1, 2) a(x, y) UNION SELECT b."?column?" AS x, b."?column?" AS y FROM ( SELECT 3, 4) b; That view definition doesn't work either, as complained of today in pgsql-general. To fix this we just need to force the columns of b to be given distinct aliases. The minimum-new-code solution would probably be to produce SELECT a.x, a.y FROM ( SELECT 1, 2) a(x, y) UNION SELECT b."?column?" AS x, b."?column?_1" AS y FROM ( SELECT 3, 4) b("?column?", "?column?_1") using the same add-some-digits-until-unique logic we are using for relation aliases. This could be done by considering all the column aliases of each RTE when we arrive at it during the recursive scan. On further reflection I think my worry about the top-level aliases was unfounded --- we prevent views from being created at all unless the top-level column names are all distinct. But we definitely have got issues for lower-level aliases, as these examples show. regards, tom lane
On Fri, Dec 21, 2012 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm having a hard time following this. Can you provide a concrete example? > > regression=# create table t1 (x int, y int); > CREATE TABLE > regression=# create table t2 (x int, z int); > CREATE TABLE > regression=# create view v1 as select * from t1 join t2 using (x); > CREATE VIEW > regression=# \d+ v1 > View "public.v1" > Column | Type | Modifiers | Storage | Description > --------+---------+-----------+---------+------------- > x | integer | | plain | > y | integer | | plain | > z | integer | | plain | > View definition: > SELECT t1.x, t1.y, t2.z > FROM t1 > JOIN t2 USING (x); > regression=# alter table t2 rename column x to q; > ALTER TABLE > regression=# \d+ v1 > View "public.v1" > Column | Type | Modifiers | Storage | Description > --------+---------+-----------+---------+------------- > x | integer | | plain | > y | integer | | plain | > z | integer | | plain | > View definition: > SELECT t1.x, t1.y, t2.z > FROM t1 > JOIN t2 USING (x); > > At this point the dumped view definition is wrong: if you try to execute > it you get > > regression=# SELECT t1.x, t1.y, t2.z > regression-# FROM t1 > regression-# JOIN t2 USING (x); > ERROR: column "x" specified in USING clause does not exist in right table > > I'm suggesting that we could fix this by emitting something that forces > the right alias to be assigned to t2.q: > > SELECT t1.x, t1.y, t2.z > FROM t1 > JOIN t2 AS t2(x,z) > USING (x); Sneaky. I didn't know that would even work, but it seems like a sensible approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Dec 21, 2012 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm suggesting that we could fix this by emitting something that forces >> the right alias to be assigned to t2.q: >> >> SELECT t1.x, t1.y, t2.z >> FROM t1 >> JOIN t2 AS t2(x,z) >> USING (x); > Sneaky. I didn't know that would even work, but it seems like a > sensible approach. I've been idly hacking away at this over Christmas break, and attached is a draft patch. The problems turned out to be considerably more extensive than I'd realized --- in addition to the stated issue with forcing input aliases for JOIN USING colums to match, I found that: * For joins without aliases, we can't qualify join column references, since obviously there's no relation name to use. (And we can't add one without breaking queries, because SQL specifies that an aliased JOIN hides relation names within it.) That's okay in simple cases because we can just print the name of the referenced input column instead. However, that doesn't work for merged columns in FULL JOIN USING, because in a full join a merged output column doesn't behave the same as either input. The only solution I can see for this is to force the column aliases for such columns to be unique query-wide, not just within the join RTE. Then they can be referenced without a relation name and still be unambiguous. * When printing a join's column alias list, we were just blindly printing the user's original alias list. However, addition or removal of columns from either input table invalidates that: we have to be able to add or remove aliases to match the new column set. I believe the attached patch covers all cases arising from column additions, deletions, or renames. It's awfully large though --- about 1100 lines added to ruleutils.c. We could possibly make it a bit smaller if we changed the parser to save more information about JOIN USING columns, so that we don't need the grotty flatten_join_using_qual hack. But that would only save about 100 lines, and it would add more elsewhere, so I'm not sure it's worth the trouble. (It would also prevent anyone from trying to use the patch in the back branches, not that I plan to take the risk of back-patching.) On the whole I think this is a "must fix" bug, so we don't have a lot of choice, unless someone has a proposal for a different and more compact way of solving the problem. regards, tom lane
Attachment
On Mon, Dec 31, 2012 at 12:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > On the whole I think this is a "must fix" bug, so we don't have a lot of > choice, unless someone has a proposal for a different and more compact > way of solving the problem. The only more compact way of handling things that I can see is adding syntax to let us explicitly select exactly the columns we need. But then the resulting view definitions would be Postgres-specific instead of standard SQL which would defeat a large part of the motivation to going to such lengths. I do wonder whether the SQL standard will do something obtuse enough that that's the only option for a large swathe of queries. Or is that the case already? The query syntax you're using here, is it standard SQL? Is it widely supported? -- greg
Greg Stark <stark@mit.edu> writes: > I do wonder whether the SQL standard will do something obtuse enough > that that's the only option for a large swathe of queries. Or is that > the case already? The query syntax you're using here, is it standard > SQL? Is it widely supported? Yeah, it's standard --- there's nothing here that wasn't in SQL92. (Although I notice that SQL still hasn't got any ALTER TABLE RENAME command, much less a column rename command. I wonder whether the committee is aware of these difficulties and has shied away from adding RENAME because of them?) As for widely supported, I can't imagine that the big boys don't have this, although a quick test shows that mysql only has table aliases not column aliases, ie you can do "FROM t1 AS t1x" but not "FROM t1 AS t1x(y)". Still, if that's a consideration, inventing our own syntax would be even further away from the goal. Also, the patch goes to some lengths to not print column aliases unnecessarily --- in fact, there are cases where the old code would print column aliases but the patch will not. regards, tom lane