Re: estimating # of distinct values - Mailing list pgsql-hackers

From Nathan Boley
Subject Re: estimating # of distinct values
Date
Msg-id AANLkTi=wNuG0p4gvLMXo+uXhCvspHPq340r0y98tNofj@mail.gmail.com
Whole thread Raw
In response to Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
>
> Not true. You're missing the goal of this effort - to get ndistinct
> estimate for combination of multiple columns. Which is usually
> pathological in case of dependent columns.

<snip>

> Again, don't think about a single column (although even in that case
> there are known fail cases). Think about multiple columns and the number
> of distinct combinations. With attribute value independence assumption,
> this is usually significantly underestimated. And that's a problem as it
> often leads to an index scan instead of sequential scan (or something
> like that).

My point was that, in the case of single columns, we've done pretty
well despite the poor ndistinct estimates. I suspect the same will be
true in the case of multiple columns; good heuristics will trump
minimax estimators.

>> ( as I've advocated for before ) this means parameterizing the
>> distribution of values ( ie, find that the values are roughly uniform
>> ), maybe this means estimating the error of our statistics and using
>> the most robust rather than the best plan, or maybe it means something
>> totally different. But: ( and the literature seems to support me )
>
> Which literature? I've read an awful lot of papers on this topic lately,
> and I don't remember anything like that. If there's an interesting
> paper, I'd like to read it. Yes, all the papers state estimating a
> ndistinct is a particularly tricky task, but that's somehow expected?

It is definitely expected - non-parametric minimax results are always
*very* weak. The papers that you've cited seem to confirm this;
bounding ndistinct estimation error is tricky in the general case (
even with very simple loss functions, which we do not have ). The same
is true for other non-parametric methods. Think about kernel density
estimation: how many data points do you need to estimate the density
of a normal distribution well? What about if you *know* that the data
is normal ( or even that it comes from a big family? ).

> No, I'm not trying to do this just to improve the ndistinct estimate.
> The goal is to improve the cardinality estimate in case of dependent
> columns, and one of the papers depends on good ndistinct estimates.
>
> And actually it does not depend on ndistinct for the columns only, it
> depends on ndistinct estimates for the combination of columns. So
> improving the ndistinct estimates for columns is just a necessary first
> step (and only if it works reasonably well, we can do the next step).

I think that any approach which depends on precise estimates of
ndistinct is not practical.

I am very happy that you've spent so much time on this, and I'm sorry
if my previous email came off as combative. My point was only that
simple heuristics have served us well in the past and, before we go to
the effort of new, complicated schemes, we should see how well similar
heuristics work in the multiple column case. I am worried that if the
initial plan is too complicated then nothing will happen and, even if
something does happen, it will be tough to get it committed ( check
the archives for cross column stat threads - there are a lot ).

Best,
Nathan Boley


pgsql-hackers by date:

Previous
From: Dan Ports
Date:
Subject: Re: SSI and Hot Standby
Next
From: Fujii Masao
Date:
Subject: Re: pg_basebackup for streaming base backups