Thread: Improving non-joinable EXISTS subqueries

Improving non-joinable EXISTS subqueries

From
Tom Lane
Date:
The examples that Kevin Grittner put up awhile back included several
uses of EXISTS() in places where it couldn't be turned into a semijoin,
eg in the query's targetlist.  I was musing a bit about whether we could
improve those scenarios.  I would like to get 8.4 to the point where we
could say as a blanket performance recommendation "prefer EXISTS over
IN".  The semantic gotchas associated with NOT IN make it hard to
optimize well, not to mention being a perennial bane of novices; so if
we could just point people in the other direction without qualification
I think we'd be better off.  But how much work would it be to get there?

The single place where IN wins over EXISTS as of CVS HEAD is that a
non-join-optimizable IN clause can still be turned into a "hashed
subplan", which greatly reduces the cost of making IN tests for a large
number of upper-query rows.  It looks to me that with the current
planning infrastructure it wouldn't be very difficult to turn EXISTS
(with hashable comparisons to upper variables in its WHERE) into a
similar kind of plan.  The problem is that *that isn't necessarily a win*.
Consider something like
SELECT x, y, EXISTS(SELECT * FROM tab1 WHERE tab1.a = tab2.z)FROM tab2 WHERE ...;

Given that there's an index on tab1.a, the current planning for this
will produce what's essentially a nestloop-with-inner-indexscan plan:
for each tab2 row selected by the outer query, we'll do an indexscan
probe into tab1 to see if there's a match.  This is an ideal plan as
long as the outer query doesn't select very many tab2 rows.

We could transform this into the equivalent of a hashed implementation
of
SELECT x, y, z IN (SELECT a FROM tab1)FROM tab2 WHERE ...;

which would result in loading all of tab1 into a hashtable and then
probing the hashtable for each tab2 row.  Now, that wins big if there
are many selected tab2 rows (and tab1 isn't too big to fit in an
in-memory hashtable).  But it loses big if the outer query only needs
to probe for a few values --- we won't repay the cost of building
the hashtable.

So I think it'd be a really bad idea to make this change blindly.
For everyone whose query got speeded up, someone else's would be slowed
down --- in fact, for apps that are tuned to PG's existing behavior,
you could expect that it'd mostly be the latter case.

The problem then is to make the choice of plan with some intelligence.
The bit of information that we lack in order to do that is an idea of
how many times the outer query will call the EXISTS subquery.  Right
now, all subqueries that can't be pulled up as joins are planned fully
during SS_process_sublinks(), which is called long before we can have
any idea about that.  I looked into whether it's feasible to postpone
planning subqueries till later on in planning.  I think it's probably
structurally possible without an enormous amount of work, but it's not
exactly trivial either.

Even given that we postpone planning/costing subqueries until we really
need to know the cost, we're not out of the woods.  For an EXISTS
appearing in a join condition, it's entirely possible that different
join sequences will result in executing the EXISTS wildly different
numbers of times.  Re-planning the EXISTS subquery each time we consider
a different upper-query join sequence seems entirely impractical on
speed grounds.

So it seems like what we'd need to do is

* During planner startup, generate Paths (we'd need no more level of
detail) for both the "retail" and hashed version of each EXISTS
subquery.  From these, estimate the startup cost of the hashed version
(ie, time to execute the un-qualified subquery once and load the hash
table) and the per-upper-row costs of each approach.  Stash these costs
somewhere handy.

* While forming upper-query paths, estimate the costs of each approach
on-the-fly for every path, based on the estimated number of rows in the
input paths.  Use the cheaper case while figuring the cost of that
upper path.

* While building the final Plan, instantiate whichever subquery Plan
is a winner in the context of the chosen upper path.

I don't think any of this is out of reach, but it'd be a nontrivial
bit of implementation effort (maybe a week or three) and it also looks
like there might be a measurable planning slowdown for any query
involving subqueries.  I'm not sure yet how much of this is just moving
existing work around and how much will actually represent new planning
expense.  But certainly, costing EXISTS subqueries two different ways
isn't going to be free.

So ... I'm wondering if this actually touches anyone's hot-button,
or if we should just file it in the overflowing pile of Things That
Might Be Nice To Do Someday.

Comments?
        regards, tom lane


Re: Improving non-joinable EXISTS subqueries

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I don't think any of this is out of reach, but it'd be a nontrivial
> bit of implementation effort (maybe a week or three) and it also looks
> like there might be a measurable planning slowdown for any query
> involving subqueries.  I'm not sure yet how much of this is just moving
> existing work around and how much will actually represent new planning
> expense.  But certainly, costing EXISTS subqueries two different ways
> isn't going to be free.

The typical comment around here is that it's usually a huge win when a
bit of CPU time is spent in buying I/O savings.  Since most servers are
I/O bound anyway, and since most servers nowadays are typically
oversized in CPU terms, this strikes me as the good tradeoff to be
making.

In any case, most of the time EXISTS queries are expensive queries, so
spending more time planning them is probably good.  It'd be a shame to
spend a lot of time planning queries that are trivial in execution
costs, but I wouldn't expect this to be the case here.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Improving non-joinable EXISTS subqueries

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> So ... I'm wondering if this actually touches anyone's hot-button,
> or if we should just file it in the overflowing pile of Things That
> Might Be Nice To Do Someday.

What bugs me the most about having IN() be faster than EXISTS() in
certain situations is that it ends up being counter-intuitive and not
really what you'd expect to happen.  That being said, we can always tell
people that they can use IN() as a work-around for these situations.  In
the long run, I think it's definitely worth it to spend a bit of extra
time in planning the query for this case.  Not knowing what else is on
your plate for 8.4, I don't know where I'd rank this, but it wouldn't be
at the top.
Thanks,
    Stephen

Re: Improving non-joinable EXISTS subqueries

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> The examples that Kevin Grittner put up awhile back included several
> uses of EXISTS() in places where it couldn't be turned into a
semijoin,
> eg in the query's targetlist.  I was musing a bit about whether we
could
> improve those scenarios.  I would like to get 8.4 to the point where
we
> could say as a blanket performance recommendation "prefer EXISTS
over
> IN".  The semantic gotchas associated with NOT IN make it hard to
> optimize well, not to mention being a perennial bane of novices; so
if
> we could just point people in the other direction without
qualification
> I think we'd be better off.
Agreed.
> So ... I'm wondering if this actually touches anyone's hot-button,
> or if we should just file it in the overflowing pile of Things That
> Might Be Nice To Do Someday.
> 
> Comments?
I'm in the position of trying to influence programmers here to write
queries using set logic.  Way too many of the queries here are coded
with a cursor for a "primary" select, with a bunch of lower level
cursors to navigate around and get the related rows one at a time. 
Results are often stuck into a work table as this progresses, with the
work table massaged a bit here and there in this procedural process,
and the final results selected out.  It should surprise nobody here
that this is not fast to write, easy to maintain, efficient to run, or
generally free from subtle errors.  I point out that they should write
queries which state what they want, regardless of how complex those
rules are, instead of writing how to get it.  The optimizer, I argue,
has tricks available which they don't.
Usually, a rewrite into set logic has a fraction of the number of
lines, runs much faster, and loses a bug or two that was hidden within
the procedural spaghetti.  On the other hand, sometimes they write a
perfectly good "set logic" query (from the point of view of stating
what they want), and the optimizer falls down, and I have to come in
and say "Oh, it has trouble with EXISTS; you can use IN here." When I
tell them to use IN instead of EXISTS, then I need to have all these
caveats about the sizes of tables and the possibilities of NULL on the
NOT EXISTS.  At this point I tend to lose a big part of my audience.
So I'd be very happy to see this work done, not because I can't find a
workaround, but because trying to teach all the programmers tricky
hand-optimizations is a losing battle, and if I lose that battle the
queries degenerate into spaghetti-land.
As with others, I can't say where this fits on a priority list, but I
would hate to see it drift off onto a "someday" list.
-Kevin


Re: Improving non-joinable EXISTS subqueries

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> [ complicated scheme for improving planning of EXISTS ]
> So I'd be very happy to see this work done, not because I can't find a
> workaround, but because trying to teach all the programmers tricky
> hand-optimizations is a losing battle, and if I lose that battle the
> queries degenerate into spaghetti-land.

I spent some time looking at this, and soon grew rather discouraged:
even the very first step of what I'd had in mind, which was to delay
replacement of uplevel Vars with Params until late in the planning
process, looks like it will destabilize large amounts of code that
aren't particularly related to the problem at hand.  (Most of the
planner blithely assumes that it will never see an uplevel Var, and
tends to just treat any Var as being of the current query level.)

So I backed off and thought some more, and eventually came to this
conclusion: when we have an EXISTS that could be done both ways,
why not just generate plans for both ways, and leave the decision
which to use until later?  Like maybe even execution time?

We have speculated in the past about having alternative plans that
could be conditionally executed based on information not available
at planning time.  This could be seen as a first experiment in that
direction.  I am not thinking of a general-purpose AlternativePlan
kind of execution node, because SubPlans aren't actually part of the
main plan-node tree, but an AlternativeSubPlans expression node
type might work.

The two issues that would obviously have to be faced to make this
work are:

1. While the planner is estimating evaluation costs of the qual
conditions for the upper query, which EXISTS implementation do we assume
will be used?  It might be that we could still use my original idea of
providing cost_qual_eval() with some context about the likely number of
calls, but what I'm thinking at the moment is that it's not worth the
trouble, because it isn't going to matter that much.  Either possibility
is expensive enough compared to ordinary qual conditions that the
planner will be driven in the direction of plans that minimize the
number of EXISTS evaluations, and that's all that we really care about.
So I'd be inclined to just use the numbers for the base (non hashed)
implementation and be done with it.

2. How will the executor make the decision which to use?  Well, it's
got access to the overall rowcount estimates that the planner made.
What I'm thinking of doing is having the AlternativeSubPlans node
look at the rowcount estimate of its immediate parent Plan node.
This is actually exactly the right number for a subplan in the
targetlist of the Plan node.  For a subplan in the qual list, it's
an underestimate, but probably not an enormous underestimate.
(Assuming that the subplan is at the end of the qual list, which is
where it'd normally be, the expected number of calls of the subplan
would be the output rowcount estimate divided by the estimated
selectivity of the subplan qual --- but at present the latter is always
0.5 ...)

Another technique that we could play with is to have the
AlternativeSubPlans node track the actual number of calls it gets,
and switch from the "retail" implementation to the "hashed"
implementation if that exceeds a threshold.  This'd provide some
robustness in the face of bad estimates, although of course it's
not optimal compared to having made the right choice to start with.

Thoughts?
        regards, tom lane


Re: Improving non-joinable EXISTS subqueries

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> when we have an EXISTS that could be done both ways,
> why not just generate plans for both ways, and leave the decision
> which to use until later?
That seems good to me.  The costs for the slower plan generally come
out much higher.  When the run times are close, the one that edges out
the other doesn't always win, but that's to be expected.  EXISTS is
hardly unique in that respect.  Competing on costs seems better than
some more mechanical approach.
> Like maybe even execution time?
> 
> We have speculated in the past about having alternative plans that
> could be conditionally executed based on information not available
> at planning time.  This could be seen as a first experiment in that
> direction.  I am not thinking of a general-purpose AlternativePlan
> kind of execution node, because SubPlans aren't actually part of the
> main plan-node tree, but an AlternativeSubPlans expression node
> type might work.
> 
> The two issues that would obviously have to be faced to make this
> work are:
> 
> 1. While the planner is estimating evaluation costs of the qual
> conditions for the upper query, which EXISTS implementation do we
assume
> will be used?  It might be that we could still use my original idea
of
> providing cost_qual_eval() with some context about the likely number
of
> calls, but what I'm thinking at the moment is that it's not worth
the
> trouble, because it isn't going to matter that much.  Either
possibility
> is expensive enough compared to ordinary qual conditions that the
> planner will be driven in the direction of plans that minimize the
> number of EXISTS evaluations, and that's all that we really care
about.
> So I'd be inclined to just use the numbers for the base (non hashed)
> implementation and be done with it.
Seems reasonable from this point of view: it seems like you'd never
choose a plan worse than the current releases, although you might
sometimes miss a plan that would be even faster than the suggested
improvement finds.  I think it makes sense to defer this until such
time (if ever) that it is shown to be worth the effort.
> 2. How will the executor make the decision which to use?  Well, it's
> got access to the overall rowcount estimates that the planner made.
> What I'm thinking of doing is having the AlternativeSubPlans node
> look at the rowcount estimate of its immediate parent Plan node.
> This is actually exactly the right number for a subplan in the
> targetlist of the Plan node.  For a subplan in the qual list, it's
> an underestimate, but probably not an enormous underestimate.
> (Assuming that the subplan is at the end of the qual list, which is
> where it'd normally be, the expected number of calls of the subplan
> would be the output rowcount estimate divided by the estimated
> selectivity of the subplan qual --- but at present the latter is
always
> 0.5 ...)
If you meant multiplied by 0.5 I think I followed that.  Made sense.
> Another technique that we could play with is to have the
> AlternativeSubPlans node track the actual number of calls it gets,
> and switch from the "retail" implementation to the "hashed"
> implementation if that exceeds a threshold.  This'd provide some
> robustness in the face of bad estimates, although of course it's
> not optimal compared to having made the right choice to start with.
That sounds interesting, but unless it has value as a prototype for
other runtime adaptivity, it sounds like a lot of work for the
benefit.  I'm not that unhappy with the estimates I'm getting in a
properly tuned database.  And the execution-time work to process some
number of rows this way seems likely to far exceed the work to refine
the estimates and costing used to choose a plan.
-Kevin


Re: Improving non-joinable EXISTS subqueries

From
"Robert Haas"
Date:
> Another technique that we could play with is to have the
> AlternativeSubPlans node track the actual number of calls it gets,
> and switch from the "retail" implementation to the "hashed"
> implementation if that exceeds a threshold.  This'd provide some
> robustness in the face of bad estimates, although of course it's
> not optimal compared to having made the right choice to start with.

Ideally you'd want to set that threshold dynamically.  If you expect x
calls and midway through execution notice that you're already up to 2x
calls, the right thing to do depends a lot on whether you're 1% done
or 99% done.

Logic of this type also opens a bit of a can of worms, in that there
are probably many other situations in which it's possible to notice
that your estimates are off and shift gears in mid-query, but how much
are you willing to slow down the queries where there isn't a problem
to speed up the ones where there is?

...Robert


Re: Improving non-joinable EXISTS subqueries

From
Decibel!
Date:
On Aug 20, 2008, at 12:43 PM, Tom Lane wrote:
> We have speculated in the past about having alternative plans that
> could be conditionally executed based on information not available
> at planning time.  This could be seen as a first experiment in that
> direction.  I am not thinking of a general-purpose AlternativePlan
> kind of execution node, because SubPlans aren't actually part of the
> main plan-node tree, but an AlternativeSubPlans expression node
> type might work.

Something I think we could also use is the ability to grab certain  
information before planing takes place. The big case that comes to  
mind is:

SELECT ... FROM big_table b JOIN small_lookup_table s USING  
(small_lookup_id)    WHERE s.some_name = 'alpha';

... or where we're doing s.some_name IN ('a','b','c'). In many cases,  
translating the some_name lookup into actual _id values that you can  
then look at in pg_stats for big_table results in a huge improvement  
is rowcount estimates. If this is then joining to 5 other tables,  
that rowcount information can have a huge impact on the query plan.

> Another technique that we could play with is to have the
> AlternativeSubPlans node track the actual number of calls it gets,
> and switch from the "retail" implementation to the "hashed"
> implementation if that exceeds a threshold.  This'd provide some
> robustness in the face of bad estimates, although of course it's
> not optimal compared to having made the right choice to start with.


In many systems, having the most optimal plan isn't that important;  
not having a really bad plan is. I expect that giving the executor  
the ability to decide the planner made a mistake and shift gears  
would go a long way to reducing the impact of bad plans. I wonder if  
any other databases have that ability... maybe this will be a first. :)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828