Thread: Parameterized paths vs index clauses extracted from OR clauses

Parameterized paths vs index clauses extracted from OR clauses

From

Tom Lane

Date:

28 February 2013, 20:03:48

I looked into the behavior complained of in
http://www.postgresql.org/message-id/CAMkU=1xLiwDkfemkDWjznr_JmzUYbZzRZ4F22kXa3Vg6Pz9hCQ@mail.gmail.com
I'm still not sure whether anything else is going on in the original
problem, but I now understand Jeff's simplified query.  The planner
does actually generate the desired plan, but it doesn't pick it because
it misestimates the row count and hence the cost.  The issue is that
it generates an indexable OR clause by pulling a couple of sub-clauses
out of the messy OR condition in the original query:
      template2."id" = product2."id"      or      case when product1."id" is not null then 1           when
template2."id"is not null then 0      end <> 1      and      product2."id" = 2916353
 

If we focus our attention on just
      template2."id" = product2."id"      or      product2."id" = 2916353

we have something that can be used in a parameterized (by template2)
bitmap scan of product2.  We do get as far as finding that out and
building a bitmap scan path, but the path is marked as yielding 2918
rows (the whole product table), not the 2 rows it actually will
produce.  That's because the parameterized path code is designed to
assume that all identically-parameterized scans will produce the same
number of rows, and it's already computed that row estimate without
the benefit of the extracted OR clause.  So this results in a
factor-of-1400 overestimate of the cost of the nestloop, resulting
in a wrong plan choice.

We didn't have this problem in 9.1 or earlier because the planner did
not assume that inner indexscan paths all produced the same number of
rows.  As of 9.2 there's an expectation that if two paths have the same
parameterization then they will enforce the same set of pushed-down join
clauses and hence yield the same number of rows.

We could drop that assumption, but doing so would destroy the basis
of add_path_precheck(), because a more expensive path could be worth
keeping if it yields fewer rows.  So we'd either have to give up
two-phase estimation of join path costs entirely, or do rowcount
estimation in the first phase which'd likely destroy most of its
advantage anyway.  I would not be totally sad to get rid of the
two-phase estimates because they greatly complicate the API and logic
in costsize.c, but they did seem to produce a useful speedup in planning
of complex joins.

If we don't do that, we need some less klugy way of dealing with
indexclauses extracted from OR clauses.  The idea I have about this is
to go ahead and put such clauses into the regular join clause lists, but
mark them as being derived from their original clauses (say by adding a
field to RestrictInfo that links back to the original RestrictInfo).
The effect of the marking would be that the derived clause would be
ignored for cost and selectivity estimates anytime it is present in the
same list as its parent clause, so that we don't end up double-counting
it.  I think this would let us get rid of the very klugy selectivity
hacking that's currently done in orindxpath.c to solve a similar problem
in the non-join case.  A downside of this approach is that to preserve
the same-number-of-rows assumption, we'd end up having to enforce the
extracted clauses as filter clauses in parameterized paths, even if
they'd not proved to be of any use as index quals.

Whichever way we go, the resulting patch is likely to be too large and
invasive for me to feel terribly comfortable about back-patching it into
9.2.  AFAICT this issue only arises for indexquals extracted out of
larger OR conditions, so maybe it's not worth taking such a risk for.

Thoughts?
        regards, tom lane

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Robert Haas

Date:

04 March 2013, 16:33:20

On Thu, Feb 28, 2013 at 3:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Whichever way we go, the resulting patch is likely to be too large and
> invasive for me to feel terribly comfortable about back-patching it into
> 9.2.  AFAICT this issue only arises for indexquals extracted out of
> larger OR conditions, so maybe it's not worth taking such a risk for.

EnterpriseDB has received a number of complaints from our customers
resulting from planner behavior changes which were back-patched; so I
am not sanguine about back-patching unless the situation is pretty
darn dire and the fix is pretty darn near certain to be an improvement
in every case.  As we have discussed many times here and on
pgsql-performance, DBAs seem to prefer a query plan that's the same
every time, even if it's bad.  Updating to get a security fix and
having plans change under you proves to be an unpleasant surprise.

> A downside of this approach is that to preserve
> the same-number-of-rows assumption, we'd end up having to enforce the
> extracted clauses as filter clauses in parameterized paths, even if
> they'd not proved to be of any use as index quals.

I'm not sure I fully grasp why this is a downside.  Explain further?

Since there's little point in using a paramaterized path in the first
place unless it enables you to drastically reduce the number of rows
being processed, I would anticipate that maybe the consequences aren't
too bad, but I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Craig Ringer

Date:

05 March 2013, 01:10:58

<div class="moz-cite-prefix">On 03/05/2013 12:33 AM, Robert Haas wrote:<br /></div><blockquote
cite="mid:CA+TgmoZoRySYSvrSYLbiR3eCYgWAaZzPNHr+2vAq7JWFqF2f+Q@mail.gmail.com"type="cite"><pre wrap="">On Thu, Feb 28,
2013at 3:03 PM, Tom Lane <a class="moz-txt-link-rfc2396E" href="mailto:tgl@sss.pgh.pa.us"><tgl@sss.pgh.pa.us></a>
wrote:
</pre><blockquote type="cite"><pre wrap="">Whichever way we go, the resulting patch is likely to be too large and
invasive for me to feel terribly comfortable about back-patching it into
9.2.  AFAICT this issue only arises for indexquals extracted out of
larger OR conditions, so maybe it's not worth taking such a risk for.
</pre></blockquote><pre wrap="">
EnterpriseDB has received a number of complaints from our customers
resulting from planner behavior changes which were back-patched; so I
am not sanguine about back-patching unless the situation is pretty
darn dire and the fix is pretty darn near certain to be an improvement
in every case.  As we have discussed many times here and on
pgsql-performance, DBAs seem to prefer a query plan that's the same
every time, even if it's bad.  Updating to get a security fix and
having plans change under you proves to be an unpleasant surprise.</pre></blockquote> It also weakens our argument (per
versioningpolicy) that patch releases are *totally safe* and are something you can apply without extensive testing and
QAper <a href="http://www.postgresql.org/support/versioning/">http://www.postgresql.org/support/versioning/</a> .<br
/><br/> I have enough trouble getting people to update as it is.<br /><br /> I'd be interested in being able to turn on
fixes/enhancements,but having behaviour changes that aren't clear wins in all cases turned on by default seems likely
toweaken trust from users who're already way too leery of updating.<br /><br /><pre class="moz-signature" cols="72">--
CraigRinger                   <a class="moz-txt-link-freetext"
href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQLDevelopment, 24x7 Support, Training &
Services</pre>

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Tom Lane

Date:

05 March 2013, 20:44:14

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 28, 2013 at 3:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Whichever way we go, the resulting patch is likely to be too large and
>> invasive for me to feel terribly comfortable about back-patching it into
>> 9.2.  AFAICT this issue only arises for indexquals extracted out of
>> larger OR conditions, so maybe it's not worth taking such a risk for.

> EnterpriseDB has received a number of complaints from our customers
> resulting from planner behavior changes which were back-patched; so I
> am not sanguine about back-patching unless the situation is pretty
> darn dire and the fix is pretty darn near certain to be an improvement
> in every case.

Well, the point is not so much about whether it's an improvement as that
9.2's current behavior is a regression from 9.1 and earlier.  People may
not like changes in minor releases, but they don't like regressions
either.

>> A downside of this approach is that to preserve
>> the same-number-of-rows assumption, we'd end up having to enforce the
>> extracted clauses as filter clauses in parameterized paths, even if
>> they'd not proved to be of any use as index quals.

> I'm not sure I fully grasp why this is a downside.  Explain further?

Because we'd be checking redundant clauses.  You'd get something like
Nested Loop    Filter: (foo OR (bar AND baz))
    ... some outer scan here ...
    Index Scan:        Filter: (foo OR bar)

If "foo OR bar" is useful as an indexqual condition in the inner scan,
that's one thing.  But if it isn't, the cycles expended to check it in
the inner scan are possibly wasted, because we'll still have to check
the full original OR clause later.  It's possible that the filter
condition removes enough rows from the inner scan's result to justify
the redundant checks, but it's at least as possible that it doesn't.

> Since there's little point in using a paramaterized path in the first
> place unless it enables you to drastically reduce the number of rows
> being processed, I would anticipate that maybe the consequences aren't
> too bad, but I'm not sure.

Yeah, we could hope that the inner scan is already producing few enough
rows that it doesn't matter much.  But I think that we'd end up checking
the added qual even in a non-parameterized scan; there's no mechanism
for pushing quals into the general qual lists and then retracting them
later.  (Hm, maybe what we need is a marker for "enforce this clause
only if you feel like it"?)
        regards, tom lane

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Robert Haas

Date:

06 March 2013, 04:20:23

On Tue, Mar 5, 2013 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, the point is not so much about whether it's an improvement as that
> 9.2's current behavior is a regression from 9.1 and earlier.  People may
> not like changes in minor releases, but they don't like regressions
> either.

That's true, but I'm still worried that we're just moving the
unhappiness around from one group of people to another group of
people, and I don't have a lot of confidence about which group is
larger.

>>> A downside of this approach is that to preserve
>>> the same-number-of-rows assumption, we'd end up having to enforce the
>>> extracted clauses as filter clauses in parameterized paths, even if
>>> they'd not proved to be of any use as index quals.
>
>> I'm not sure I fully grasp why this is a downside.  Explain further?
>
> Because we'd be checking redundant clauses.  You'd get something like
>
>         Nested Loop
>                 Filter: (foo OR (bar AND baz))
>
>                 ... some outer scan here ...
>
>                 Index Scan:
>                         Filter: (foo OR bar)
>
> If "foo OR bar" is useful as an indexqual condition in the inner scan,
> that's one thing.  But if it isn't, the cycles expended to check it in
> the inner scan are possibly wasted, because we'll still have to check
> the full original OR clause later.  It's possible that the filter
> condition removes enough rows from the inner scan's result to justify
> the redundant checks, but it's at least as possible that it doesn't.

Yeah, that's pretty unappealing.  It probably doesn't matter much if
foo is just a column reference, but what if it's an expensive
function?  For that matter, what if it's a volatile function that we
can't execute twice without changing the results?

>> Since there's little point in using a paramaterized path in the first
>> place unless it enables you to drastically reduce the number of rows
>> being processed, I would anticipate that maybe the consequences aren't
>> too bad, but I'm not sure.
>
> Yeah, we could hope that the inner scan is already producing few enough
> rows that it doesn't matter much.  But I think that we'd end up checking
> the added qual even in a non-parameterized scan; there's no mechanism
> for pushing quals into the general qual lists and then retracting them
> later.  (Hm, maybe what we need is a marker for "enforce this clause
> only if you feel like it"?)

Not sure I get the parenthesized bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Tom Lane

Date:

06 March 2013, 04:53:24

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 5, 2013 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If "foo OR bar" is useful as an indexqual condition in the inner scan,
>> that's one thing.  But if it isn't, the cycles expended to check it in
>> the inner scan are possibly wasted, because we'll still have to check
>> the full original OR clause later.  It's possible that the filter
>> condition removes enough rows from the inner scan's result to justify
>> the redundant checks, but it's at least as possible that it doesn't.

> Yeah, that's pretty unappealing.  It probably doesn't matter much if
> foo is just a column reference, but what if it's an expensive
> function?  For that matter, what if it's a volatile function that we
> can't execute twice without changing the results?

Well, *that* worry at least is a nonissue: if the clause contains
volatile functions then it's not a candidate to be an indexqual anyway.
The code that pulls out these simplified OR clauses is only looking for
clauses that can be shown to match existing indexes, so it won't pick
anything like that.  But expensive functions could be a hazard.

>> (Hm, maybe what we need is a marker for "enforce this clause
>> only if you feel like it"?)

> Not sure I get the parenthesized bit.

I was thinking that we'd extract the simplified clause and then mark it
as something to be used only if it can be used as an indexqual.
However, on second thought that still leaves us with the problem that
some parameterized paths yield more rows than others.  Maybe that
assumption simply has to go ...
        regards, tom lane

Re: Parameterized paths vs index clauses extracted from OR clauses

From

Robert Haas

Date:

08 March 2013, 14:20:52

On Tue, Mar 5, 2013 at 11:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Mar 5, 2013 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If "foo OR bar" is useful as an indexqual condition in the inner scan,
>>> that's one thing.  But if it isn't, the cycles expended to check it in
>>> the inner scan are possibly wasted, because we'll still have to check
>>> the full original OR clause later.  It's possible that the filter
>>> condition removes enough rows from the inner scan's result to justify
>>> the redundant checks, but it's at least as possible that it doesn't.
>
>> Yeah, that's pretty unappealing.  It probably doesn't matter much if
>> foo is just a column reference, but what if it's an expensive
>> function?  For that matter, what if it's a volatile function that we
>> can't execute twice without changing the results?
>
> Well, *that* worry at least is a nonissue: if the clause contains
> volatile functions then it's not a candidate to be an indexqual anyway.
> The code that pulls out these simplified OR clauses is only looking for
> clauses that can be shown to match existing indexes, so it won't pick
> anything like that.  But expensive functions could be a hazard.
>
>>> (Hm, maybe what we need is a marker for "enforce this clause
>>> only if you feel like it"?)
>
>> Not sure I get the parenthesized bit.
>
> I was thinking that we'd extract the simplified clause and then mark it
> as something to be used only if it can be used as an indexqual.
> However, on second thought that still leaves us with the problem that
> some parameterized paths yield more rows than others.

OK, that's what I was thinking, and why I was confused.

> Maybe that
> assumption simply has to go ...

That assumption is pretty darn important from a planning-time
perspective, though.  I think if we rip it out we'd better have some
idea what we're going to do instead.  This parameterized path stuff is
very cool, but just like any other planner trick it has to be worth
the cycles spent at runtime.

On further reflection, I'm wondering about this part from your original email:

> We do get as far as finding that out and
> building a bitmap scan path, but the path is marked as yielding 2918
> rows (the whole product table), not the 2 rows it actually will
> produce.  That's because the parameterized path code is designed to
> assume that all identically-parameterized scans will produce the same
> number of rows, and it's already computed that row estimate without
> the benefit of the extracted OR clause.

Can't we fix this in some more-localized way?  I mean, bitmap index
scans are a bit of a special case, because the individual quals get
their own plan tree nodes.   With a sequential scan, index scan, or
index-only scan, any scan of the same relation can incorporate all of
the same quals.  But if we implement a scan for a = 1 or b = 1 as a
bitmap-or of two bitmap-scans, then suddenly there's a node in the
tree that can only accommodate only the a = 1 qual and another that
can accommodate only the b = 1 qual.  Expecting those to have the the
same row-count is clearly nonsense, but it feels like an artifact of
the representational choice.  One can imagine a database where a
sequential scan with a complex filter condition is represented as a
filtering node atop a sequential-scan node rather than as a sequential
scan node with a filter condition glued on to it; conversely, one
could imagine a representation for bitmap scans where a bitmap-or
operation comes out as a single node with a more-complex internal
structure.  I'm not actually proposing that we change the
representational choices we've made here (though I wouldn't be averse
to it), but the point I'm making is that it seems like the behavior
that's causing the problem is a representational artifact, and
thinking about the problem that way might suggest a narrower fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company