Thread: [HACKERS] Pulling up more complicated subqueries

[HACKERS] Pulling up more complicated subqueries

From
Heikki Linnakangas
Date:
Hi,

I spent some time staring at TPC-DS benchmark's query 6. It contains a 
somewhat complicated subquery, and most of the time spent on that query 
is currently spent on executing the subquery again and again. The 
essence of the query boils down to this:

CREATE TABLE foo (i int4, j int4);
CREATE TABLE bar (i int4, j int4);

INSERT INTO foo SELECT g%10, g FROM generate_series(1, 10000) g;
INSERT INTO bar SELECT g%10, g FROM generate_series(1, 10000) g;


SELECT * FROM foo
WHERE foo.j >= 1.2 *  (SELECT avg(bar.j) FROM bar WHERE foo.i = bar.i);


The planner can pull up simpler subqueries, converting them to joins, 
but unfortunately this case is beyond its capabilities. If it was smart 
enough, it could be transformed into something like this:

SELECT * FROM foo
LEFT JOIN (SELECT avg(bar.j) as avg, bar.i FROM bar GROUP BY bar.i) as 
avg_bar ON foo.i = avg_bar.i
WHERE foo.j >= 1.2 * avg_bar.avg


There was some brief discussion on doing this kind of transformation 
back in 2011, at 
https://www.postgresql.org/message-id/4E2F163B.6060105%40gmail.com. I 
think we're in a better position to do that now than back then, thanks 
to all the other planner work that's been done since.

So how do we get there? I've identified a number of smaller steps that, 
when all put together, can handle this, as well as many other queries of 
course.

0. This simple case is already handled:

SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar)

It's turned into a semi-join. The next steps will extend this basic 
capability, to handle more complicated cases.

postgres=# explain SELECT * FROM foo WHERE foo.j IN (select bar.j from bar);                               QUERY PLAN
------------------------------------------------------------------------- Hash Join  (cost=42.75..93.85 rows=1130
width=8)  Hash Cond: (foo.j = bar.j)   ->  Seq Scan on foo  (cost=0.00..32.60 rows=2260 width=8)   ->  Hash
(cost=40.25..40.25rows=200 width=4)         ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)               Group
Key:bar.j               ->  Seq Scan on bar  (cost=0.00..32.60 rows=2260 width=4)
 
(7 rows)


1. Currently, the IN/EXISTS pullup is only done when the IN/EXISTS is a 
top-level qual. For example, this isn't currently pulled-up:

SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar) OR foo.j < 10;

That's not a straight semi-join, but we could still turn it into a new 
kind of LEFT-SEMI join. A left-semi join is like a left join, in that it 
returns all rows from the left side, and NULLs for any non-matches. And 
like a semi-join, it returns only one matching row from the right-side, 
if there are duplicates. In the qual, replace the SubLink with an IS NOT 
NULL test. So logically, the above would be converted into:

SELECT * FROM foo
/* SEMI- */ LEFT JOIN (select bar.j from bar) ON foo.j = bar.j
WHERE bar.j IS NOT NULL OR foo.j < 10

This transformation can be applied to IN/EXIST type SubPlan references 
anywhere in the query, not only those in the WHERE clause.


2. Using a similar trick, we can transform subqueries that are not of 
the IN/EXISTS kind. (This isn't useful on its own, because this example 
would be handled as an InitPlan, which performs well. But it's a 
stepping stone in explaining this.)

For example:

SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar)

This can be implemented using yet another new join type, a LEFT-UNIQUE 
join. It's like a LEFT JOIN, but it must check that there are no 
duplicates in the right-hand-side, and throw an error if there are 
(ERROR:  more than one row returned by a subquery used as an expression).

SELECT * FROM foo
/* unique-checking */ LEFT JOIN (select avg(bar.j) as min_j from bar) as 
subq ON TRUE
WHERE foo.j > subq.min_j


3. All the previous examples are uncorrelated subqueries, but all these 
transformations can also be performed on correlated subqueries. The 
difference is that the subquery that's pulled up to the range table must 
be marked as LATERAL. For example:

SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar where bar.i = foo.i)

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select avg(bar.j) from bar 
where bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j


4. The previous transformation isn't very interesting on its own, 
because the only way to implement a LATERAL join is a nested loop join, 
which is essentially the same as executing the subplan the naive way. 
But we can potentially de-correlate it, by pulling up the correlation qual:


SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select bar.j, bar.i from bar 
WHERE bar.i = foo.i) AS subq(j, i) ON true
WHERE foo.j > subq.j

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select bar.j, bar.i from bar) AS 
subq(j, i) ON subq.i = foo.i
WHERE foo.j > subq.j


This will get inlined into the parent query, getting rid of the subquery 
altogether. But even if that cannot be done, this is potentially much 
better. (I think we already do such de-correlation, at least in simple 
cases, actually).


5. If the subquery contains aggregation, we can still perform the 
previous transformations, by adding the correlation vars to the GROUP 
BY. Like this:

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select AVG(bar.j) from bar 
WHERE bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select AVG(bar.j), bar.i from bar GROUP 
BY bar.i) AS subq(j, i) ON foo.i = subq.i
WHERE foo.j > subq.j


6. De-correlating the subquery is not necessary a win. Consider the 
previous query, if 'foo' is very small, and 'bar' is a huge. The 
parameterized plan we generate for a correlated LATERAL subquery, with a 
nested loop join, is actually better in that case.

If we teach the planner to consider parameterized nested loop plans for 
non-lateral subqueries, by pushing down join quals back to the subquery. 
There's a comment on this in set_subquery_pathlist:
 * We don't currently support generating parameterized paths for subqueries * by pushing join clauses down into them;
itseems too expensive to 
 
re-plan * the subquery multiple times to consider different alternatives. * (XXX that could stand to be reconsidered,
nowthat we use Paths.)
 


In the old thread from 2011 that I mentioned above, Tom said:

> This leads me to think that we need to represent both cases as the same
> sort of query and make a cost-based decision as to which way to go.
> Thinking of it as a pull-up or push-down transformation is the wrong
> approach because those sorts of transformations are done too early to
> be able to use cost comparisons.

This last step, to push down join quals to a subquery, allows the 
planner to make a cost-based decision. A nested-loop, with the join qual 
pushed back down to the subquery, isn't exactly the same as the SubPlan, 
but it should be comparable in performance.


Thoughts? I'm planning to work on this in the next release cycle, 
although I'm not too familiar with the planner, and I don't know if I'll 
be distracted with something more shiny, so no promises on any results.

As a final note, I found an interesting paper called "Unnesting 
Arbitrary Queries", by Thomas Neumann and Alfons Kemper 
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). 
It describes a series of transformations that can be applied to 
de-correlate (de-lateralize?) any lateral subquery join. The trick is to 
build a temporary relation that contains all the distinct outer values 
of the join, and pass that relation down to the subquery. So instead of 
"executing" the subquery for every outer row, passing the outer values 
as parameters (using PostgreSQL terminology), you collect a list of all 
the outer values first, and then execute the subquery only once. In the 
subquery, you perform a join with the temporary relation. And then the 
paper describes a bunch of rules and optimizations, that can optimize
away the temporary relation in many cases. That might be one way to 
implement the de-lateralization steps in PostgreSQL.

- Heikki



Re: [HACKERS] Pulling up more complicated subqueries

From
Robert Haas
Date:
On Wed, May 17, 2017 at 11:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> That's not a straight semi-join, but we could still turn it into a new kind
> of LEFT-SEMI join. A left-semi join is like a left join, in that it returns
> all rows from the left side, and NULLs for any non-matches. And like a
> semi-join, it returns only one matching row from the right-side, if there
> are duplicates. In the qual, replace the SubLink with an IS NOT NULL test.
...
> This can be implemented using yet another new join type, a LEFT-UNIQUE join.
> It's like a LEFT JOIN, but it must check that there are no duplicates in the
> right-hand-side, and throw an error if there are (ERROR:  more than one row
> returned by a subquery used as an expression).

It seems like we might want to split what is currently called JoinType
into two separate things -- one that is INNER/LEFT/RIGHT/FULL and the
other that says what to do about multiple matches, which could be that
they are expected, they are to be ignored (as in your LEFT-SEMI case),
or they should error out (as in your LEFT-UNIQUE case).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Pulling up more complicated subqueries

From
David Rowley
Date:
On 18 May 2017 at 04:30, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 17, 2017 at 11:08 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> That's not a straight semi-join, but we could still turn it into a new kind
>> of LEFT-SEMI join. A left-semi join is like a left join, in that it returns
>> all rows from the left side, and NULLs for any non-matches. And like a
>> semi-join, it returns only one matching row from the right-side, if there
>> are duplicates. In the qual, replace the SubLink with an IS NOT NULL test.
> ...
>> This can be implemented using yet another new join type, a LEFT-UNIQUE join.
>> It's like a LEFT JOIN, but it must check that there are no duplicates in the
>> right-hand-side, and throw an error if there are (ERROR:  more than one row
>> returned by a subquery used as an expression).
>
> It seems like we might want to split what is currently called JoinType
> into two separate things -- one that is INNER/LEFT/RIGHT/FULL and the
> other that says what to do about multiple matches, which could be that
> they are expected, they are to be ignored (as in your LEFT-SEMI case),
> or they should error out (as in your LEFT-UNIQUE case).

I just wanted to mention that I almost got sucked down that hole with
unique joins. Instead, I'd recommend following the pattern of passing
bool flags down to the executor from the planner just like what is
done for inner_unique in, say make_hashjoin()

I did change unique joins at one point to overload the JoinType, but
it was a much more scary thing to do as there's lots of code in the
planner that does special things based on the join type, and the
footprint of your patch and risk factor start to grow pretty rapidly
once you do that.

I mention something around this in [1].

[1] https://www.postgresql.org/message-id/CAKJS1f_jRki1PQ4X-9UGKa-wnBhECQLnrxCX5haQzu4SDR_r2Q%40mail.gmail.com

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Pulling up more complicated subqueries

From
Ashutosh Bapat
Date:
On Wed, May 17, 2017 at 8:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

>
> As a final note, I found an interesting paper called "Unnesting Arbitrary
> Queries", by Thomas Neumann and Alfons Kemper
> (http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf).
> It describes a series of transformations that can be applied to de-correlate
> (de-lateralize?) any lateral subquery join. The trick is to build a
> temporary relation that contains all the distinct outer values of the join,
> and pass that relation down to the subquery. So instead of "executing" the
> subquery for every outer row, passing the outer values as parameters (using
> PostgreSQL terminology), you collect a list of all the outer values first,
> and then execute the subquery only once. In the subquery, you perform a join
> with the temporary relation. And then the paper describes a bunch of rules
> and optimizations, that can optimize away the temporary relation in many
> cases. That might be one way to implement the de-lateralization steps in
> PostgreSQL.
>

IIUC what you describe here, in a query involving foreign scan, we
could pass down a list of parameters to the foreign server instead of
running foreign scan on every parameter in the list. That would
improve performance of such queries by a lot.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company