Hi all,
attached is v4 of the patch implementing adaptive ndistinct estimator.
I've been looking into the strange estimates, mentioned on 2014/12/07:
> values current adaptive
> ------------------------------
> 106 99 107
> 106 8 6449190
> 1006 38 6449190
> 10006 327 42441
I suspected this might be some sort of rounding error in the numerical
optimization (looking for 'm' solving the equation from paper), but
turns out that's not the case.
The adaptive estimator is a bit unstable for skewed distributions, that
are not sufficiently smooth. Whenever f[1] or f[2] was 0 (i.e. there
were no values occuring exactly once or twice in the sample), the result
was rather off.
The simple workaround for this was adding a fallback to GEE when f[1] or
f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.
With the current version, I do get this (with statistics_target=10):
values current adaptive
------------------------------
106 99 108
106 8 178
1006 38 2083
10006 327 11120
The results do change a bit based on the sample, but these values are a
good example of the values I'm getting.
The other examples (with skewed but smooth distributions) work as good
as before.
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services