Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Date
Msg-id 54E553B1.6030601@2ndquadrant.com
Whole thread Raw
In response to Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>>      values   current    adaptive
>>      ------------------------------
>>      106           99         107
>>      106            8     6449190
>>      1006          38     6449190
>>      10006        327       42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
> 
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?

Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:
 Assume table with "N" rows, D distinct values and sample of "r" rows (all those values are fixed). Then there exists a
datasetwith those features, so that "ratio error"
 
     error(D, D') = max(D'/D, D/D')
 is greater than f(N, r, P) with probability at least "P". I.e. if you randomly choose a sample of 'r' rows, you'll get
anerror exceeding the ratio with probability P.
 

So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".


Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is
           3.000     30.000     300.000 ----------------------------------------   1%         88         28           9
 5%         70         22           7
 


At least that's what I get if I compute it using this python function:
   def err(N, r, p):       return sqrt((N-r)/(2.0*r) * log(1.0/p))


So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Gavin Flower
Date:
Subject: Re: Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
Next
From: Fujii Masao
Date:
Subject: Re: pgaudit - an auditing extension for PostgreSQL