Re: PATCH: adaptive ndistinct estimator v4 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: adaptive ndistinct estimator v4
Date
Msg-id 55207A7A.1050802@2ndquadrant.com
Whole thread Raw
In response to Re: PATCH: adaptive ndistinct estimator v4  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Hi,

On 04/03/15 15:46, Greg Stark wrote:
>  > The simple workaround for this was adding a fallback to GEE when f[1]
> or f[2] is 0. GEE is another estimator described in the paper, behaving
> much better in those cases.
>
> For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1 
in that paper. The exact formulation of the theorem is a bit complex, 
but it essentially says that knowing just the sizes of the data set and 
sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the 
estimator gives you estimates with error exceeding some limit (with a 
certain probability).

Knowledge of how much the data set is skewed gives us opportunity to 
improve the estimates by choosing an estimator performing better with 
such data sets. The problem is we don't know the skew - we can only 
estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch) 
is an example of such hybrid estimators, but based on my experiments it 
does not work terribly well with one particular type of skew that I'd 
expect to be relatively common (few very common values, many very rare 
values).

Luckily, GEE performs pretty well in this case, but we can use the AE 
otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated 
(pretty much as Theorem 1 in the paper says). I'd be nice to do more 
testing on real-world data sets, to see if this performs better or worse 
than our current estimator.

regards
Tomas



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)
Next
From: Tomas Vondra
Date:
Subject: Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)