Re: Idea about estimating selectivity for single-column expressions - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Idea about estimating selectivity for single-column expressions
Date
Msg-id 603c8f070908182009t4dec8710pf8e27088653950fb@mail.gmail.com
Whole thread Raw
In response to Idea about estimating selectivity for single-column expressions  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, Aug 18, 2009 at 10:53 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> There were two different complaints today about the planner's inability
> to make good estimates for WHERE conditions involving bit AND tests,
> such as
>        (field & 8) = 0
>        (field & 8) <> 0
>        (field & 127) = field
>
> I'm not particularly eager to try to put in special-case knowledge
> for this sort of thing, but I had an idea that might cover a wide
> enough variety of cases to be worth the trouble.  It's this: if we
> have a WHERE clause that is an immutable (or maybe even just stable)
> function of a single table column, and we don't have any special-purpose
> estimator that works for it, we could estimate the selectivity by
> actually evaluating the clause for each MCV and histogram entry for
> the column, then combining the results with appropriate weightings.
> We do something comparable already for "field LIKE pattern" expressions,
> but this would generalize it to any expression.  With the larger
> histogram sizes that are default as of 8.4, I think this could be
> expected to provide a pretty good estimate most of the time ---
> certainly far better than the hardwired default estimates are.

Trying it on the MCVs makes a lot of sense.  I'm not so sure about
trying it on the histogram entries.  There's no reason to assume that
those cluster in any way that will be useful.  (For example, suppose
that I have the numbers 1 through 10,000 in some particular column and
my expression is col % 100.)

> In the long run this might be workable for multi-column expressions
> too, but not before we have multi-column statistics so we know which
> combinations of values to try.
>
> I can see a couple of possible downsides:
>
> * If the expression is expensive to evaluate, we could waste a lot
> of time making the estimate.  This would only be a serious issue
> if the query were designed to avoid evaluating the WHERE clause
> very often, but that's certainly possible.  We could perhaps check
> the estimated cost of the expression and not do this if it's too
> high ... but expression cost estimates are generally pretty bogus.

It's hard to know whether this is going to be an issue without
implementing it and seeing if anyone yells.  Generally, I suspect that
it will help more often than it will hurt, because bad selectivity
estimates are the #1 source of bad query plans, and the selectivity
estimates for expressions of this kind are currently quite terrible.

> * The expression might throw an error for some inputs, for instance
>        (1 / field) < 0.5
> which would fail on zero.  We could recover by wrapping the whole
> estimation process in a subtransaction, but that seems really expensive.
> I thought about arguing that the failure would happen anyway at runtime,
> but that doesn't hold water --- for example, the user might have just
> deleted all the rows with field = 0, and would have grounds to complain
> if the query failed; but there would still be an entry for zero in the
> histogram.

This seems like a show-stopper unless someone has a clever idea to
address it.  Do we have any way of marking which functions are not
capable of raising an exception, and just don't evaluate anything that
can?

...Robert


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Idea about estimating selectivity for single-column expressions
Next
From: Pavel Stehule
Date:
Subject: release notes - issue