[HACKERS] Pulling up more complicated subqueries - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | [HACKERS] Pulling up more complicated subqueries |
Date | |
Msg-id | 67e353e8-c20e-7980-a282-538779edf25b@iki.fi Whole thread Raw |
Responses |
Re: [HACKERS] Pulling up more complicated subqueries
Re: [HACKERS] Pulling up more complicated subqueries |
List | pgsql-hackers |
Hi, I spent some time staring at TPC-DS benchmark's query 6. It contains a somewhat complicated subquery, and most of the time spent on that query is currently spent on executing the subquery again and again. The essence of the query boils down to this: CREATE TABLE foo (i int4, j int4); CREATE TABLE bar (i int4, j int4); INSERT INTO foo SELECT g%10, g FROM generate_series(1, 10000) g; INSERT INTO bar SELECT g%10, g FROM generate_series(1, 10000) g; SELECT * FROM foo WHERE foo.j >= 1.2 * (SELECT avg(bar.j) FROM bar WHERE foo.i = bar.i); The planner can pull up simpler subqueries, converting them to joins, but unfortunately this case is beyond its capabilities. If it was smart enough, it could be transformed into something like this: SELECT * FROM foo LEFT JOIN (SELECT avg(bar.j) as avg, bar.i FROM bar GROUP BY bar.i) as avg_bar ON foo.i = avg_bar.i WHERE foo.j >= 1.2 * avg_bar.avg There was some brief discussion on doing this kind of transformation back in 2011, at https://www.postgresql.org/message-id/4E2F163B.6060105%40gmail.com. I think we're in a better position to do that now than back then, thanks to all the other planner work that's been done since. So how do we get there? I've identified a number of smaller steps that, when all put together, can handle this, as well as many other queries of course. 0. This simple case is already handled: SELECT * FROM foo WHERE foo.j IN (select bar.j from bar) It's turned into a semi-join. The next steps will extend this basic capability, to handle more complicated cases. postgres=# explain SELECT * FROM foo WHERE foo.j IN (select bar.j from bar); QUERY PLAN ------------------------------------------------------------------------- Hash Join (cost=42.75..93.85 rows=1130 width=8) Hash Cond: (foo.j = bar.j) -> Seq Scan on foo (cost=0.00..32.60 rows=2260 width=8) -> Hash (cost=40.25..40.25rows=200 width=4) -> HashAggregate (cost=38.25..40.25 rows=200 width=4) Group Key:bar.j -> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=4) (7 rows) 1. Currently, the IN/EXISTS pullup is only done when the IN/EXISTS is a top-level qual. For example, this isn't currently pulled-up: SELECT * FROM foo WHERE foo.j IN (select bar.j from bar) OR foo.j < 10; That's not a straight semi-join, but we could still turn it into a new kind of LEFT-SEMI join. A left-semi join is like a left join, in that it returns all rows from the left side, and NULLs for any non-matches. And like a semi-join, it returns only one matching row from the right-side, if there are duplicates. In the qual, replace the SubLink with an IS NOT NULL test. So logically, the above would be converted into: SELECT * FROM foo /* SEMI- */ LEFT JOIN (select bar.j from bar) ON foo.j = bar.j WHERE bar.j IS NOT NULL OR foo.j < 10 This transformation can be applied to IN/EXIST type SubPlan references anywhere in the query, not only those in the WHERE clause. 2. Using a similar trick, we can transform subqueries that are not of the IN/EXISTS kind. (This isn't useful on its own, because this example would be handled as an InitPlan, which performs well. But it's a stepping stone in explaining this.) For example: SELECT * FROM foo WHERE foo.j > (select avg(bar.j) from bar) This can be implemented using yet another new join type, a LEFT-UNIQUE join. It's like a LEFT JOIN, but it must check that there are no duplicates in the right-hand-side, and throw an error if there are (ERROR: more than one row returned by a subquery used as an expression). SELECT * FROM foo /* unique-checking */ LEFT JOIN (select avg(bar.j) as min_j from bar) as subq ON TRUE WHERE foo.j > subq.min_j 3. All the previous examples are uncorrelated subqueries, but all these transformations can also be performed on correlated subqueries. The difference is that the subquery that's pulled up to the range table must be marked as LATERAL. For example: SELECT * FROM foo WHERE foo.j > (select avg(bar.j) from bar where bar.i = foo.i) -> SELECT * FROM foo /* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select avg(bar.j) from bar where bar.i = foo.i) AS subq(j) ON true WHERE foo.j > subq.j 4. The previous transformation isn't very interesting on its own, because the only way to implement a LATERAL join is a nested loop join, which is essentially the same as executing the subplan the naive way. But we can potentially de-correlate it, by pulling up the correlation qual: SELECT * FROM foo /* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select bar.j, bar.i from bar WHERE bar.i = foo.i) AS subq(j, i) ON true WHERE foo.j > subq.j -> SELECT * FROM foo /* UNIQUE-CHECKING */ LEFT JOIN (select bar.j, bar.i from bar) AS subq(j, i) ON subq.i = foo.i WHERE foo.j > subq.j This will get inlined into the parent query, getting rid of the subquery altogether. But even if that cannot be done, this is potentially much better. (I think we already do such de-correlation, at least in simple cases, actually). 5. If the subquery contains aggregation, we can still perform the previous transformations, by adding the correlation vars to the GROUP BY. Like this: SELECT * FROM foo /* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select AVG(bar.j) from bar WHERE bar.i = foo.i) AS subq(j) ON true WHERE foo.j > subq.j -> SELECT * FROM foo /* UNIQUE-CHECKING */ LEFT JOIN (select AVG(bar.j), bar.i from bar GROUP BY bar.i) AS subq(j, i) ON foo.i = subq.i WHERE foo.j > subq.j 6. De-correlating the subquery is not necessary a win. Consider the previous query, if 'foo' is very small, and 'bar' is a huge. The parameterized plan we generate for a correlated LATERAL subquery, with a nested loop join, is actually better in that case. If we teach the planner to consider parameterized nested loop plans for non-lateral subqueries, by pushing down join quals back to the subquery. There's a comment on this in set_subquery_pathlist: * We don't currently support generating parameterized paths for subqueries * by pushing join clauses down into them; itseems too expensive to re-plan * the subquery multiple times to consider different alternatives. * (XXX that could stand to be reconsidered, nowthat we use Paths.) In the old thread from 2011 that I mentioned above, Tom said: > This leads me to think that we need to represent both cases as the same > sort of query and make a cost-based decision as to which way to go. > Thinking of it as a pull-up or push-down transformation is the wrong > approach because those sorts of transformations are done too early to > be able to use cost comparisons. This last step, to push down join quals to a subquery, allows the planner to make a cost-based decision. A nested-loop, with the join qual pushed back down to the subquery, isn't exactly the same as the SubPlan, but it should be comparable in performance. Thoughts? I'm planning to work on this in the next release cycle, although I'm not too familiar with the planner, and I don't know if I'll be distracted with something more shiny, so no promises on any results. As a final note, I found an interesting paper called "Unnesting Arbitrary Queries", by Thomas Neumann and Alfons Kemper (http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It describes a series of transformations that can be applied to de-correlate (de-lateralize?) any lateral subquery join. The trick is to build a temporary relation that contains all the distinct outer values of the join, and pass that relation down to the subquery. So instead of "executing" the subquery for every outer row, passing the outer values as parameters (using PostgreSQL terminology), you collect a list of all the outer values first, and then execute the subquery only once. In the subquery, you perform a join with the temporary relation. And then the paper describes a bunch of rules and optimizations, that can optimize away the temporary relation in many cases. That might be one way to implement the de-lateralization steps in PostgreSQL. - Heikki
pgsql-hackers by date: