Re: estimating # of distinct values - Mailing list pgsql-hackers

From Florian Pflug
Subject Re: estimating # of distinct values
Date
Msg-id ADDE6675-4ADE-4516-8FC9-E10E73A016C4@phlo.org
Whole thread Raw
In response to Re: estimating # of distinct values  (Nathan Boley <npboley@gmail.com>)
Responses Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Re: estimating # of distinct values  (Nathan Boley <npboley@gmail.com>)
List pgsql-hackers
On Jan20, 2011, at 02:41 , Nathan Boley wrote:
>>> If you think about it, it's a bit ridiculous to look at the whole table
>>> *just* to "estimate" ndistinct - if we go that far why dont we just
>>> store the full distribution and be done with it?
>> 
>> - the best you could do is to average the
>> individual probabilities which gives ... well, 1/ndistinct.
> 
> That is certainly *not* the best you could do in every case. The mean
> is only the best estimate in L2, which is definitely not the case
> here.

No, not in every case. But on average it comes out best, no?

> Consider a table with 10K values, 9,990 of which are 1 and the rest of
> which are 2, 3, ..., 10, versus a table that has the same 10 distinct
> values evenly distributed. For a simple equality query, in the first
> case, a bitmap scan might be best. In the second case, a sequential
> scan would always be best.

True. But selectivity estimates alone don't show the difference there.

Also, in my personal experience this isn't really the area we've got
problems now. The problem cases for me always were queries with a rather
large number of joins (7 to 10 or so) plus rather complex search
conditions. The join order, not the access strategy, then becomes the
deciding factor in plan performance. And the join order *is* driven purely
off the selectivity estimates, unlike the access strategy which even today
takes other factors into account (like clusteredness, I believe)

> This is precisely the point I was trying to make in my email: the loss
> function is very complicated. Improving the ndistinct estimator could
> significantly improve the estimates of ndistinct ( in the quadratic
> loss sense ) while only marginally improving the plans.


The point of this exercise isn't really to improve the ndistinct estimates
for single columns. Rather, we'd like to get ndistinct estimates for
groups of columns because that allows us to use the uniform bayesian
approach to multi-column selectivity estimation that Tomas found literature
on. Which on the whole seems like it *does* have a chance of greatly
improving the plans in some cases. We could, of course, estimate multi-column
ndistinct the same way we estimate single-column ndistincts, but quite a few
people raised concerns that this wouldn't work due to the large error our
current ndistinct estimates have. 

best regards,
Florian Pflug



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: REVIEW: "writable CTEs" - doc patch
Next
From: Robert Haas
Date:
Subject: Re: estimating # of distinct values