Re: estimating # of distinct values - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: estimating # of distinct values
Date
Msg-id 4D377495.5040606@fuzzy.cz
Whole thread Raw
In response to Re: estimating # of distinct values  (Nathan Boley <npboley@gmail.com>)
Responses Re: estimating # of distinct values  (Nathan Boley <npboley@gmail.com>)
List pgsql-hackers
Dne 19.1.2011 23:44, Nathan Boley napsal(a):
> 1) The distribution of values in a table is rarely pathological, and
> usually follows one of several common patterns. ( IOW, we have good
> heuristics )

Not true. You're missing the goal of this effort - to get ndistinct
estimate for combination of multiple columns. Which is usually
pathological in case of dependent columns. Which is exactly the case
when the user will explicitly enable this feature to get better
estimates (and thus better plans).

> 2) The choice of plan is fairly robust to mis-estimation of ndistinct,
> because there are only a few things the executer can choose. It
> doesn't usually matter if a value composes 5% or 20%  ( or 99% ) of
> the table, we still usually need to scan the entire table.

Again, don't think about a single column (although even in that case
there are known fail cases). Think about multiple columns and the number
of distinct combinations. With attribute value independence assumption,
this is usually significantly underestimated. And that's a problem as it
often leads to an index scan instead of sequential scan (or something
like that).

> Again, in my *very* humble opinion, If the ultimate goal is to improve
> the planner, we should try to cut down on the number of cases in which
> a poor estimate of ndistinct gives a really bad plan instead of
> chasing after marginal gains from a better ndistinct estimator. Maybe

What is a marginal gain? The ultimate goal is to build cross-column
stats, which in case of dependent columns usually results is poor plans.
How is fixing that a marginal gain?

Yes, it may turn out it's not worth it, but it's a bit early to say that
without an implementation and some hard data.

> ( as I've advocated for before ) this means parameterizing the
> distribution of values ( ie, find that the values are roughly uniform
> ), maybe this means estimating the error of our statistics and using
> the most robust rather than the best plan, or maybe it means something
> totally different. But: ( and the literature seems to support me )

Which literature? I've read an awful lot of papers on this topic lately,
and I don't remember anything like that. If there's an interesting
paper, I'd like to read it. Yes, all the papers state estimating a
ndistinct is a particularly tricky task, but that's somehow expected?

I've even checked how other databases do this - e.g. Oracle does it very
similarly to what I proposed (the user has to enable the feature, it has
to be rebuilt from time to time, etc.). I'm not saying we should do
everything like Oracle, but they are not clueless idiots.

> pounding away at the ndistinct estimator seems like a dead end. If you
> think about it, it's a bit ridiculous to look at the whole table
> *just* to "estimate" ndistinct - if we go that far why dont we just
> store the full distribution and be done with it?

No, I'm not trying to do this just to improve the ndistinct estimate.
The goal is to improve the cardinality estimate in case of dependent
columns, and one of the papers depends on good ndistinct estimates.

And actually it does not depend on ndistinct for the columns only, it
depends on ndistinct estimates for the combination of columns. So
improving the ndistinct estimates for columns is just a necessary first
step (and only if it works reasonably well, we can do the next step).

regards
Tomas


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: REVIEW: WIP: plpgsql - foreach in
Next
From: Florian Pflug
Date:
Subject: Re: estimating # of distinct values