Re: New CORRESPONDING clause design - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: New CORRESPONDING clause design |
Date | |
Msg-id | 30942.1491069285@sss.pgh.pa.us Whole thread Raw |
In response to | Re: New CORRESPONDING clause design (Pavel Stehule <pavel.stehule@gmail.com>) |
List | pgsql-hackers |
Pavel Stehule <pavel.stehule@gmail.com> writes: > [ corresponding_clause_v12.patch ] I worked on this for awhile but eventually decided that it's not very close to being committable. The main thing that's scaring me off is a realization that there are a *lot* of places that assume that the output columns of a set operation are one-for-one with the columns of the inputs. One such place is the subquery qual pushdown logic in allpaths.c, which this patch hasn't touched, resulting in regression=# create table t1 (a int, b int, c int); CREATE TABLE regression=# create table t2 (a int, b int, c int); CREATE TABLE regression=# create view vv2 as regression-# select * from t1 union all corresponding by (b,c) select * from t2; CREATE VIEW regression=# select * from vv2 where c = 44; ERROR: wrong number of tlist entries Another is the code that converts a UNION ALL subquery into an appendrel. The best thing there might be to just reject CORRESPONDING in is_simple_union_all, but this patch doesn't, which means we get the wrong results from regression=# create table t3 (b int, a int, c int); CREATE TABLE regression=# explain verbose select * from t1 union all corresponding select * from t3; QUERYPLAN --------------------------------------------------------------------Append (cost=0.00..60.80 rows=4080 width=12) -> SeqScan on public.t1 (cost=0.00..30.40 rows=2040 width=12) Output: t1.a, t1.b, t1.c -> Seq Scan on public.t3 (cost=0.00..30.40rows=2040 width=12) Output: t3.b, t3.a, t3.c (5 rows) Notice it's failed to rearrange the columns to match. There are also such assumptions in ruleutils.c. Trying to reverse-list the above view gives regression=# \d+ vv2 View "public.vv2"Column | Type | Collation | Nullable | Default | Storage| Description --------+---------+-----------+----------+---------+---------+-------------b | integer | | | | plain | c | integer | | | | plain | View definition:SELECT t1.a AS b, t1.b AS c, t1.c FROM t1 UNION ALL CORRESPONDING BY (b, c)SELECT t2.a AS b, t2.b AS c, t2.c FROM t2; which is obviously wrong. The reason for that is that the code is trying to ensure that the SELECT's output column names match the view's column names, so it sticks AS clauses on the select-list expressions. If we go a little further: regression=# alter table vv2 rename column c to ceetwo; ALTER TABLE regression=# \d+ vv2 View "public.vv2"Column | Type | Collation | Nullable | Default | Storage| Description --------+---------+-----------+----------+---------+---------+-------------b | integer | | | | plain | ceetwo | integer | | | | plain | View definition:SELECT t1.a AS b, t1.b AS ceetwo, t1.c FROM t1 UNION ALL CORRESPONDING BY (b, c)SELECT t2.a AS b, t2.b AS ceetwo, t2.c FROM t2; Now things are a *complete* mess, because this view definition would successfully parse during a dump-and-reload, but it means something else than it did before; the CORRESPONDING BY list has failed to track the change in names of the columns. In general, there's a lot of subtle logic in ruleutils.c about what we need to do to ensure that reverse-listed views keep the same semantic meaning in the face of column renamings or additions in their input tables. CORRESPONDING is really scary in this situation because its semantics depend on column names. I tried to break it by adding/renaming columns of the input tables to see if I could get ruleutils to print something that didn't mean what it meant before. I didn't succeed immediately, but I am thinking it would be smart for us always to reverse-list the construct as CORRESPONDING BY with an explicit list of column names. We don't ever reverse-list a NATURAL JOIN as such, preferring to use an explicit JOIN USING list, for closely related reasons. (You might care to study the logic in ruleutils.c around the usingNames list, which exists to ensure that JOIN USING doesn't get broken by column renames. It's entirely likely that we're going to need something equivalently complicated for CORRESPONDING BY. Or if we don't, I'd want to have some comments in there that clearly explain why we don't, because otherwise somebody will break it in future.) I also found that WITH RECURSIVE fails immediately if you stick CORRESPONDING into the recursive union step; eg in this lightly modified version of a query from with.sql, CREATE TEMPORARY TABLE tree( id INTEGER PRIMARY KEY, parent_id INTEGER REFERENCES tree(id) ); WITH RECURSIVE t(id, path) AS ( select 1 as id, ARRAY[]::integer[] as path UNION ALL CORRESPONDING SELECT tree.id, t.path || tree.id as path FROM tree JOIN t ON (tree.parent_id = t.id) ) SELECT t1.*, t2.* FROM t AS t1 JOIN t AS t2 ON(t1.path[1] = t2.path[1] ANDarray_upper(t1.path,1) = 1 ANDarray_upper(t2.path,1)> 1)ORDER BY t1.id, t2.id; ERROR: column t.id does not exist LINE 5: FROM tree JOIN t ON (tree.parent_id = t.id) ^ HINT: Perhaps you meant to reference the column "tree.id". I was expecting to find trouble if the CORRESPONDING had to rearrange columns (breaking the 1-to-1 assumption), but it fails even without that. There may well be other places in the system that are making similar assumptions about 1-to-1 mapping between setop inputs and outputs. This comment in plan_set_operations is pretty scary, for instance: * Find the leftmost component Query. We need to useits column names for * all generated tlists (else SELECT INTO won't work right). because the column names in the component query certainly won't be 1-to-1 with upper tlists if CORRESPONDING is active. I wonder whether it wouldn't be a smarter plan to abandon this implementation approach and rely on inserting an explicit level of sub-selects during the parse analysis transformation. That is, SELECT * FROM foo UNION CORRESPONDING BY (x, y) SELECT * FROM bar would get converted into SELECT x, y FROM (SELECT * FROM foo) ss1 UNION SELECT x, y FROM (SELECT * FROM bar) ss2 This isn't very pretty, but I would have a lot fewer worries about how long we'd be finding bugs induced by the feature. If we do stick with trying to represent CORRESPONDING BY explicitly in the parse tree, I'm not very satisfied with the representation you've chosen for SetOperationStmt.correspondingColumns. The patch says + List *correspondingColumns; /* list of corresponding column names */ which is a flat out lie. debug_print_parse and some study of the code showed me that it's actually a list of TargetEntry nodes containing copies of the left-hand input's target expression trees. This seems bulky, bizarre, and asymmetric; among other things it requires the planner to re-derive the column matching that the parser had already figured out. There certainly doesn't seem to be any value in carrying around an extra copy of the left-hand expressions. I'd be inclined to suggest that the best representation in SetOperationStmt is two integer lists of the left-hand and right-hand column numbers of the CORRESPONDING columns. This would probably simplify matters greatly for the planner. It would mean that ruleutils.c would have to do a bit more work to find out the column names to print in CORRESPONDING BY; but as I showed above, just printing the names that appeared at parse time is a doomed strategy anyway. Also, just in passing --- while it may well be a good idea to add a location field to struct Value, I'm not on board with doing so and then using it in only one place. As I said earlier, I think that CORRESPONDING's column name list should act like every other column name list in the grammar, and that includes making use of error cursor locations where appropriate. And changing struct Value has repercussions way beyond that, for instance it's not clear why we'd keep A_Const as a separate node type if Value has a location field. I think if you want to push that forward, it would be a good idea to submit a separate patch that just focuses on adding error cursor location reports everywhere that adding locations to Values would enable. I'll set this back to Waiting on Author, but I think the chances of getting to a committable patch before the end of the commitfest are about nil. regards, tom lane
pgsql-hackers by date: