Re: Parameterized-path cost comparisons need some work - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Parameterized-path cost comparisons need some work
Date
Msg-id 5189.1330838449@sss.pgh.pa.us
Whole thread Raw
In response to Re: Parameterized-path cost comparisons need some work  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parameterized-path cost comparisons need some work
Re: Parameterized-path cost comparisons need some work
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 29, 2012 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (... thinks some more ...)  No, that doesn't get us there, because that
>> doesn't establish that a more-parameterized path produces fewer rows
>> than some path that requires less parameterization, yet not none at
>> all.  You really want add_path carrying out those comparisons.  In your
>> previous example, it's entirely possible that path D is dominated by B
>> or C because of poor choices of join quals.

> I'm not following this part.  Can you explain further?  It seems to me
> at any rate that we could get pretty far if we could just separate
> parameterized paths and unparameterized paths into separate buckets.

To try to get some definitive info about this, I instrumented add_path
to emit a log message any time it compared two paths for which one's
parameterization was a strict subset of the other's, and yet the first
was not estimated to return more rows.  Sure enough, I got a lot of
messages, just by running the regression tests (and even more with some
of the other test cases I'd been using for parameterized-path testing).
All of the hits were for equal numbers of rows, though -- there were
no cases with a rows difference in the opposite direction from
expectations.

After looking at the results, I think that the fallacy in what we've
been discussing is this: a parameterized path may well have some extra
selectivity over a less-parameterized one, but perhaps *not enough to be
meaningful*.  The cases I was getting hits on were where the rowcount
estimate got rounded off to be the same as for the less-parameterized
path.  (In this connection it's worth noting that most of the hits were
for rowcount estimates of only 1 or 2 rows.)  So basically, the scenario
is where you have restriction clauses that are already enough to get
down to a small number of rows retrieved, and then you have some join
clauses that are not very selective and don't reduce the rowcount any
further.  Or maybe you have some nicely selective join clauses, and then
adding more joins to some other relations doesn't help any further.

In situations like this, we want add_path to reject the ineffective
more-parameterized path as not being an improvement over the
less-parameterized path.  Not having it do so might save cycles in
add_path itself, but we'd be being penny-wise and pound-foolish, because
not getting rid of the useless paths will cost us a lot more at the next
join level up.

So I'm back to thinking we need to look explicitly at the rowcount
comparison as well as all the existing conditions in add_path.

One annoying thing about that is that it will reduce the usefulness of
add_path_precheck, because that's called before we compute the rowcount
estimates (and indeed not having to make the rowcount estimates is one
of the major savings from the precheck).  I think what we'll have to do
is assume that a difference in parameterization could result in a
difference in rowcount, and hence only a dominant path with exactly the
same parameterization can result in failing the precheck.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Daniele Varrazzo
Date:
Subject: Re: Patch: improve selectivity estimation for IN/NOT IN
Next
From: Brendan Jurd
Date:
Subject: Our regex vs. POSIX on "longest match"