Thread: Idea about estimating selectivity for single-column expressions

Idea about estimating selectivity for single-column expressions

From
Tom Lane
Date:
There were two different complaints today about the planner's inability
to make good estimates for WHERE conditions involving bit AND tests,
such as(field & 8) = 0(field & 8) <> 0(field & 127) = field

I'm not particularly eager to try to put in special-case knowledge
for this sort of thing, but I had an idea that might cover a wide
enough variety of cases to be worth the trouble.  It's this: if we
have a WHERE clause that is an immutable (or maybe even just stable)
function of a single table column, and we don't have any special-purpose
estimator that works for it, we could estimate the selectivity by
actually evaluating the clause for each MCV and histogram entry for
the column, then combining the results with appropriate weightings.
We do something comparable already for "field LIKE pattern" expressions,
but this would generalize it to any expression.  With the larger
histogram sizes that are default as of 8.4, I think this could be
expected to provide a pretty good estimate most of the time ---
certainly far better than the hardwired default estimates are.

In the long run this might be workable for multi-column expressions
too, but not before we have multi-column statistics so we know which
combinations of values to try.

I can see a couple of possible downsides:

* If the expression is expensive to evaluate, we could waste a lot
of time making the estimate.  This would only be a serious issue
if the query were designed to avoid evaluating the WHERE clause
very often, but that's certainly possible.  We could perhaps check
the estimated cost of the expression and not do this if it's too
high ... but expression cost estimates are generally pretty bogus.

* The expression might throw an error for some inputs, for instance(1 / field) < 0.5
which would fail on zero.  We could recover by wrapping the whole
estimation process in a subtransaction, but that seems really expensive.
I thought about arguing that the failure would happen anyway at runtime,
but that doesn't hold water --- for example, the user might have just
deleted all the rows with field = 0, and would have grounds to complain
if the query failed; but there would still be an entry for zero in the
histogram.

Comments?
        regards, tom lane


Re: Idea about estimating selectivity for single-column expressions

From
Robert Haas
Date:
On Tue, Aug 18, 2009 at 10:53 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> There were two different complaints today about the planner's inability
> to make good estimates for WHERE conditions involving bit AND tests,
> such as
>        (field & 8) = 0
>        (field & 8) <> 0
>        (field & 127) = field
>
> I'm not particularly eager to try to put in special-case knowledge
> for this sort of thing, but I had an idea that might cover a wide
> enough variety of cases to be worth the trouble.  It's this: if we
> have a WHERE clause that is an immutable (or maybe even just stable)
> function of a single table column, and we don't have any special-purpose
> estimator that works for it, we could estimate the selectivity by
> actually evaluating the clause for each MCV and histogram entry for
> the column, then combining the results with appropriate weightings.
> We do something comparable already for "field LIKE pattern" expressions,
> but this would generalize it to any expression.  With the larger
> histogram sizes that are default as of 8.4, I think this could be
> expected to provide a pretty good estimate most of the time ---
> certainly far better than the hardwired default estimates are.

Trying it on the MCVs makes a lot of sense.  I'm not so sure about
trying it on the histogram entries.  There's no reason to assume that
those cluster in any way that will be useful.  (For example, suppose
that I have the numbers 1 through 10,000 in some particular column and
my expression is col % 100.)

> In the long run this might be workable for multi-column expressions
> too, but not before we have multi-column statistics so we know which
> combinations of values to try.
>
> I can see a couple of possible downsides:
>
> * If the expression is expensive to evaluate, we could waste a lot
> of time making the estimate.  This would only be a serious issue
> if the query were designed to avoid evaluating the WHERE clause
> very often, but that's certainly possible.  We could perhaps check
> the estimated cost of the expression and not do this if it's too
> high ... but expression cost estimates are generally pretty bogus.

It's hard to know whether this is going to be an issue without
implementing it and seeing if anyone yells.  Generally, I suspect that
it will help more often than it will hurt, because bad selectivity
estimates are the #1 source of bad query plans, and the selectivity
estimates for expressions of this kind are currently quite terrible.

> * The expression might throw an error for some inputs, for instance
>        (1 / field) < 0.5
> which would fail on zero.  We could recover by wrapping the whole
> estimation process in a subtransaction, but that seems really expensive.
> I thought about arguing that the failure would happen anyway at runtime,
> but that doesn't hold water --- for example, the user might have just
> deleted all the rows with field = 0, and would have grounds to complain
> if the query failed; but there would still be an entry for zero in the
> histogram.

This seems like a show-stopper unless someone has a clever idea to
address it.  Do we have any way of marking which functions are not
capable of raising an exception, and just don't evaluate anything that
can?

...Robert


Re: Idea about estimating selectivity for single-column expressions

From
Greg Stark
Date:
On Wed, Aug 19, 2009 at 3:53 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> * The expression might throw an error for some inputs, for instance
>        (1 / field) < 0.5
> which would fail on zero.  We could recover by wrapping the whole
> estimation process in a subtransaction, but that seems really expensive.
> I thought about arguing that the failure would happen anyway at runtime,
> but that doesn't hold water --- for example, the user might have just
> deleted all the rows with field = 0, and would have grounds to complain
> if the query failed; but there would still be an entry for zero in the
> histogram.

We could add another flag in pg_proc for functions which cannot throw
an error. Perhaps all index operator class operators be required to
use such functions too?


--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Idea about estimating selectivity for single-column expressions

From
Greg Stark
Date:
On Wed, Aug 19, 2009 at 3:16 PM, Greg Stark<gsstark@mit.edu> wrote:
> On Wed, Aug 19, 2009 at 3:53 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> * The expression might throw an error for some inputs, for instance
>>        (1 / field) < 0.5
>> which would fail on zero.  We could recover by wrapping the whole
>> estimation process in a subtransaction, but that seems really expensive.
>> I thought about arguing that the failure would happen anyway at runtime,
>> but that doesn't hold water --- for example, the user might have just
>> deleted all the rows with field = 0, and would have grounds to complain
>> if the query failed; but there would still be an entry for zero in the
>> histogram.
>
> We could add another flag in pg_proc for functions which cannot throw
> an error. Perhaps all index operator class operators be required to
> use such functions too?


Another thought. In the above case it would actually be fine to catch
the error with PG_TRY() without a subtransaction. As long as no shared
database state has been modified when the error is thrown then the
subtransaction isn't needed to roll them back.

In theory "immutable" should mean that's true, though there's the
traditional question of whether invisible internal state changes count
as mutations and it would be too fragile to depend on.

What I'm wondering is if we can reliably detect any state changes that
would need to be rolled back. If the PG_CATCH() fires and no state
changes have occurred since the PG_TRY() then we can just continue and
ignore that entry or treat it as false. If any have then we have to
throw an error but if it requires very unusual types of functions then
perhaps that's ok.

I'm not sure if it's sufficient but does the lazy xid assignment
provide a place to hook this in? We could add a counter to indicate
how many times the xid was needed -- if that counter doesn't change
while between the PG_TRY()/PG_CATCH() then we know there's nothing
that would have needed a proper subtransaction to roll back.

As I said I'm not sure that's actually sufficient. Are there other
shared state that subtransactions are needed to roll back? Are there a
small enumerable set or are we going to be constantly discovering new
kinds of state that need to be protected against?

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Idea about estimating selectivity for single-column expressions

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> We could add another flag in pg_proc for functions which cannot throw
> an error. Perhaps all index operator class operators be required to
> use such functions too?

It would be an extremely small set.  For starters, anything that returns
a pass-by-reference data type would be out automatically --- the palloc
to return the result could hit out-of-memory.

Now maybe we could define the flag as meaning "if this does throw an
error it's okay for the calling query to fail too", which I think is
reasonable for out-of-memory type errors.  But it's getting pretty
squishy at that point, and I think we'd have a lot of problems with
mislabeled functions.
        regards, tom lane


Re: Idea about estimating selectivity for single-column expressions

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Another thought. In the above case it would actually be fine to catch
> the error with PG_TRY() without a subtransaction. As long as no shared
> database state has been modified when the error is thrown then the
> subtransaction isn't needed to roll them back.

I think catching general errors coming out of possibly user-defined code
without using a subtransaction is just too scary to contemplate.
There's too much stuff that could need to be cleaned up and would not
have been.

It may be that a subtransaction will actually be acceptable overhead,
as long as we don't go overboard and invoke this mechanism for "simple"
WHERE clauses.  Hard to tell without coding it up and testing it though.
        regards, tom lane


Re: Idea about estimating selectivity for single-column expressions

From
Greg Stark
Date:
On Wed, Aug 19, 2009 at 7:22 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>
> It may be that a subtransaction will actually be acceptable overhead,
> as long as we don't go overboard and invoke this mechanism for "simple"
> WHERE clauses.  Hard to tell without coding it up and testing it though.

Are you looking for volunteers? Or have you already nearly finished?

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Idea about estimating selectivity for single-column expressions

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> On Wed, Aug 19, 2009 at 7:22 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> It may be that a subtransaction will actually be acceptable overhead,
>> as long as we don't go overboard and invoke this mechanism for "simple"
>> WHERE clauses. �Hard to tell without coding it up and testing it though.

> Are you looking for volunteers? Or have you already nearly finished?

I have not started on it, and have a few other things on my plate first.
I was planning to do it, but if you'd like to try, be my guest.
        regards, tom lane


Re: Idea about estimating selectivity for single-column expressions

From
Josh Berkus
Date:
Tom, Greg, Robert,

Here's my suggestion:

1. First, estimate the cost of the node with a very pessimistic (50%?)
selectivity for the calculation.

2. If the cost hits a certain threshold, then run the calculation
estimation on the histogram.

That way, we avoid the subtransaction and other overhead on very small sets.


also:

> Trying it on the MCVs makes a lot of sense.  I'm not so sure about
> trying it on the histogram entries.  There's no reason to assume that
> those cluster in any way that will be useful.  (For example, suppose
> that I have the numbers 1 through 10,000 in some particular column and
> my expression is col % 100.)

Yes, but for seriously skewed column distributions, the difference in
frequency between the MCV and a sample "random" distribution will be
huge.  And it's precisely those distributions which are currently
failing in the query planner.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Idea about estimating selectivity for single-column expressions

From
Robert Haas
Date:
On Wed, Aug 19, 2009 at 3:00 PM, Josh Berkus<josh@agliodbs.com> wrote:
> Tom, Greg, Robert,
>
> Here's my suggestion:
>
> 1. First, estimate the cost of the node with a very pessimistic (50%?)
> selectivity for the calculation.

There is no such thing as a pessimistic selectivity estimation.  Right
now a lot of things use nested loops when they should hash, because we
use 0.5%.  If we changed it to 50%, then we'd have the opposite
problem (and maybe some merge joins where we should have hashing).

Unfortunately, there is no substitute for accurate estimates.  Of
course if the executor had the ability to switch from a nest loop to a
hash join in mid query it would help a great deal, but that's a much
bigger project, I think.

...Robert