Re: Limiting the number of parameterized indexpaths created - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Limiting the number of parameterized indexpaths created
Date
Msg-id 15514.1351712659@sss.pgh.pa.us
Whole thread Raw
In response to Re: Limiting the number of parameterized indexpaths created  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Limiting the number of parameterized indexpaths created  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Simon Riggs <simon@2ndQuadrant.com> writes:
> We should be able to do a better job of recognising some other
> aspect/symmetry of this situation and then simplifying the problem
> from that. Calculating all paths treats the problem as a complete
> unknown and we must be able to do better than that. If we look at the
> problem from the perspective of how we would handle it if one of the
> tables was very large, how would that change things? Can we use a
> greedy algorithm starting with largest table?

Doesn't seem particularly relevant.  The issue is which parameterized
paths to generate for the current index of the current table.  It could
possibly make sense to try harder if the current table is large than
if it's small --- but no matter how large it is, we can't afford to
enumerate all 2^N possibilities if we have join clauses linking to N
other relations and N is more than a handful.

The reason this is a problem is that you might have indexes, or even
individual join clauses, that require multiple other rels to be used
effectively.  Suppose we have an index t1ab on t1(a, b) and a query
WHERE t1.a = t2.x AND t1.b = t3.y

If t1 is large while t2 and t3 are both small, the best plan is to join
t2 and t3 first (even if that's a cartesian join) and then use both t2.x
and t3.y as parameters for an indexscan on t1ab.  The same observation
can apply even for single index columns and single join clauses:
WHERE t1.a = (t2.x + t3.y)

So the context where this is an issue is that we're considering the
specific index t1ab and wondering what parameterized indexscan paths
to create using it.  To find a good plan in these examples, we have to
create a path that's parameterized by both t2 and t3.  It's not so hard
to consider that possibility here, but what if there are 30 different
join clauses (linking to 30 other rels) that could be used with the same
index?  We can't afford to try all the combinations exhaustively, and
even if we could, it's just about certain that most of them aren't worth
our time.

Before 9.2, we didn't have this problem in this guise because we didn't
generate parameterized paths bottom-up; instead, given that we had
already decided to join t2 and t3 for some reason, we would look to see
what indexpaths we could make for t1 using values from t2 and/or t3.
However, that approach had big problems of its own: it couldn't find
plans that involved pushing parameters down through an intervening join.

Another point here is that no such path can get chosen unless the
planner decides to consider joining t2 and t3 to begin with --- and if
that's a cartesian join, it will reject the idea out of hand.  In the
previous coding there was basically no realistic fix for that; if we let
the thing try useless cartesian joins, planning time on large queries
will go to hell in a handbasket.  With the new code, we could in
principle make note of interesting parameterized paths that require
multiple outer relations, and then allow those specific combinations of
relations to get cartesian-joined.  I've not had time to pursue this yet
but I think it should happen sometime.

So I think bottom-up creation of parameterized paths is better than what
we were doing pre-9.2, but we do have to stop it from going crazy when
there are too many join clauses for the same index.

Now the $64 question is how hard do we have to try to find the "best"
alternative when there are too many.  I think queries in which there are
a lot of join clauses touching the same index but they are not related
by equivalence classes are probably pretty pathological.  It's
questionable whether we should spend a lot of skull sweat on such cases,
or for that matter that we'd even recognize an especially good solution
if we saw it.

One idea that occurred to me while typing this up is to restrict the
maximum number of other-rels that we'll consider together, ie, when we
have 30 other rels, don't try to form all 2^30 subsets but only those
containing at most K rels, for some pretty small K (say, the number of
index columns).  But I'm not really sure it's worth the effort.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: foreign key locks
Next
From: Simon Riggs
Date:
Subject: Re: Limiting the number of parameterized indexpaths created