Hi,
I spent some time staring at TPC-DS benchmark's query 6. It contains a
somewhat complicated subquery, and most of the time spent on that query
is currently spent on executing the subquery again and again. The
essence of the query boils down to this:
CREATE TABLE foo (i int4, j int4);
CREATE TABLE bar (i int4, j int4);
INSERT INTO foo SELECT g%10, g FROM generate_series(1, 10000) g;
INSERT INTO bar SELECT g%10, g FROM generate_series(1, 10000) g;
SELECT * FROM foo
WHERE foo.j >= 1.2 * (SELECT avg(bar.j) FROM bar WHERE foo.i = bar.i);
The planner can pull up simpler subqueries, converting them to joins,
but unfortunately this case is beyond its capabilities. If it was smart
enough, it could be transformed into something like this:
SELECT * FROM foo
LEFT JOIN (SELECT avg(bar.j) as avg, bar.i FROM bar GROUP BY bar.i) as
avg_bar ON foo.i = avg_bar.i
WHERE foo.j >= 1.2 * avg_bar.avg
There was some brief discussion on doing this kind of transformation
back in 2011, at
https://www.postgresql.org/message-id/4E2F163B.6060105%40gmail.com. I
think we're in a better position to do that now than back then, thanks
to all the other planner work that's been done since.
So how do we get there? I've identified a number of smaller steps that,
when all put together, can handle this, as well as many other queries of
course.
0. This simple case is already handled:
SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar)
It's turned into a semi-join. The next steps will extend this basic
capability, to handle more complicated cases.
postgres=# explain SELECT * FROM foo WHERE foo.j IN (select bar.j from bar); QUERY PLAN
------------------------------------------------------------------------- Hash Join (cost=42.75..93.85 rows=1130
width=8) Hash Cond: (foo.j = bar.j) -> Seq Scan on foo (cost=0.00..32.60 rows=2260 width=8) -> Hash
(cost=40.25..40.25rows=200 width=4) -> HashAggregate (cost=38.25..40.25 rows=200 width=4) Group
Key:bar.j -> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=4)
(7 rows)
1. Currently, the IN/EXISTS pullup is only done when the IN/EXISTS is a
top-level qual. For example, this isn't currently pulled-up:
SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar) OR foo.j < 10;
That's not a straight semi-join, but we could still turn it into a new
kind of LEFT-SEMI join. A left-semi join is like a left join, in that it
returns all rows from the left side, and NULLs for any non-matches. And
like a semi-join, it returns only one matching row from the right-side,
if there are duplicates. In the qual, replace the SubLink with an IS NOT
NULL test. So logically, the above would be converted into:
SELECT * FROM foo
/* SEMI- */ LEFT JOIN (select bar.j from bar) ON foo.j = bar.j
WHERE bar.j IS NOT NULL OR foo.j < 10
This transformation can be applied to IN/EXIST type SubPlan references
anywhere in the query, not only those in the WHERE clause.
2. Using a similar trick, we can transform subqueries that are not of
the IN/EXISTS kind. (This isn't useful on its own, because this example
would be handled as an InitPlan, which performs well. But it's a
stepping stone in explaining this.)
For example:
SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar)
This can be implemented using yet another new join type, a LEFT-UNIQUE
join. It's like a LEFT JOIN, but it must check that there are no
duplicates in the right-hand-side, and throw an error if there are
(ERROR: more than one row returned by a subquery used as an expression).
SELECT * FROM foo
/* unique-checking */ LEFT JOIN (select avg(bar.j) as min_j from bar) as
subq ON TRUE
WHERE foo.j > subq.min_j
3. All the previous examples are uncorrelated subqueries, but all these
transformations can also be performed on correlated subqueries. The
difference is that the subquery that's pulled up to the range table must
be marked as LATERAL. For example:
SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar where bar.i = foo.i)
->
SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select avg(bar.j) from bar
where bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j
4. The previous transformation isn't very interesting on its own,
because the only way to implement a LATERAL join is a nested loop join,
which is essentially the same as executing the subplan the naive way.
But we can potentially de-correlate it, by pulling up the correlation qual:
SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select bar.j, bar.i from bar
WHERE bar.i = foo.i) AS subq(j, i) ON true
WHERE foo.j > subq.j
->
SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select bar.j, bar.i from bar) AS
subq(j, i) ON subq.i = foo.i
WHERE foo.j > subq.j
This will get inlined into the parent query, getting rid of the subquery
altogether. But even if that cannot be done, this is potentially much
better. (I think we already do such de-correlation, at least in simple
cases, actually).
5. If the subquery contains aggregation, we can still perform the
previous transformations, by adding the correlation vars to the GROUP
BY. Like this:
SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select AVG(bar.j) from bar
WHERE bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j
->
SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select AVG(bar.j), bar.i from bar GROUP
BY bar.i) AS subq(j, i) ON foo.i = subq.i
WHERE foo.j > subq.j
6. De-correlating the subquery is not necessary a win. Consider the
previous query, if 'foo' is very small, and 'bar' is a huge. The
parameterized plan we generate for a correlated LATERAL subquery, with a
nested loop join, is actually better in that case.
If we teach the planner to consider parameterized nested loop plans for
non-lateral subqueries, by pushing down join quals back to the subquery.
There's a comment on this in set_subquery_pathlist:
* We don't currently support generating parameterized paths for subqueries * by pushing join clauses down into them;
itseems too expensive to
re-plan * the subquery multiple times to consider different alternatives. * (XXX that could stand to be reconsidered,
nowthat we use Paths.)
In the old thread from 2011 that I mentioned above, Tom said:
> This leads me to think that we need to represent both cases as the same
> sort of query and make a cost-based decision as to which way to go.
> Thinking of it as a pull-up or push-down transformation is the wrong
> approach because those sorts of transformations are done too early to
> be able to use cost comparisons.
This last step, to push down join quals to a subquery, allows the
planner to make a cost-based decision. A nested-loop, with the join qual
pushed back down to the subquery, isn't exactly the same as the SubPlan,
but it should be comparable in performance.
Thoughts? I'm planning to work on this in the next release cycle,
although I'm not too familiar with the planner, and I don't know if I'll
be distracted with something more shiny, so no promises on any results.
As a final note, I found an interesting paper called "Unnesting
Arbitrary Queries", by Thomas Neumann and Alfons Kemper
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf).
It describes a series of transformations that can be applied to
de-correlate (de-lateralize?) any lateral subquery join. The trick is to
build a temporary relation that contains all the distinct outer values
of the join, and pass that relation down to the subquery. So instead of
"executing" the subquery for every outer row, passing the outer values
as parameters (using PostgreSQL terminology), you collect a list of all
the outer values first, and then execute the subquery only once. In the
subquery, you perform a join with the temporary relation. And then the
paper describes a bunch of rules and optimizations, that can optimize
away the temporary relation in many cases. That might be one way to
implement the de-lateralization steps in PostgreSQL.
- Heikki