Re: estimating # of distinct values - Mailing list pgsql-hackers

From Nathan Boley
Subject Re: estimating # of distinct values
Date
Msg-id AANLkTimaLvNSOfYzfsnXzjUVZoTtqQi_L4v=W3HGtJFO@mail.gmail.com
Whole thread Raw
In response to Re: estimating # of distinct values  (Florian Pflug <fgp@phlo.org>)
List pgsql-hackers
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jan20, 2011, at 02:41 , Nathan Boley wrote:
>>>> If you think about it, it's a bit ridiculous to look at the whole table
>>>> *just* to "estimate" ndistinct - if we go that far why dont we just
>>>> store the full distribution and be done with it?
>>>
>>> - the best you could do is to average the
>>> individual probabilities which gives ... well, 1/ndistinct.
>>
>> That is certainly *not* the best you could do in every case. The mean
>> is only the best estimate in L2, which is definitely not the case
>> here.
>
> No, not in every case. But on average it comes out best, no?

In the sense of minimizing the average plan cost over all values?
Definitely not. Consider the trivial case where a bitmap scan is the
cost of a sequential scan + epsilon.

>
>> Consider a table with 10K values, 9,990 of which are 1 and the rest of
>> which are 2, 3, ..., 10, versus a table that has the same 10 distinct
>> values evenly distributed. For a simple equality query, in the first
>> case, a bitmap scan might be best. In the second case, a sequential
>> scan would always be best.
>
> True. But selectivity estimates alone don't show the difference there.

Yet the full distribution would - supporting my argument that even in
cases where we dont know a specific value, the full distribution is
more informative.

>
> Also, in my personal experience this isn't really the area we've got
> problems now. The problem cases for me always were queries with a rather
> large number of joins (7 to 10 or so) plus rather complex search
> conditions. The join order, not the access strategy, then becomes the
> deciding factor in plan performance. And the join order *is* driven purely
> off the selectivity estimates, unlike the access strategy which even today
> takes other factors into account (like clusteredness, I believe)
>

I think I'm losing you. Why would ndistinct be complete sufficient for
estimating the optimal join order?

>> This is precisely the point I was trying to make in my email: the loss
>> function is very complicated. Improving the ndistinct estimator could
>> significantly improve the estimates of ndistinct ( in the quadratic
>> loss sense ) while only marginally improving the plans.
>
>
> The point of this exercise isn't really to improve the ndistinct estimates
> for single columns. Rather, we'd like to get ndistinct estimates for
> groups of columns because that allows us to use the uniform bayesian
> approach to multi-column selectivity estimation that Tomas found literature
> on. Which on the whole seems like it *does* have a chance of greatly
> improving the plans in some cases. We could, of course, estimate multi-column
> ndistinct the same way we estimate single-column ndistincts, but quite a few
> people raised concerns that this wouldn't work due to the large error our
> current ndistinct estimates have.

Sure. But estimating multi-column ndistinct is surely not the only way
to approach this problem.

The comment I made, which you objected to, was that it's silly to look
at the whole table and then throw away all information *except*
ndistinct. If we really think that looking at the whole table is the
best approach, then shouldn't we be able to get better statistics? Is
this really an open question?

-Nathan


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Include WAL in base backup
Next
From: Josh Kupershmidt
Date:
Subject: Re: psql: Add \dL to show languages