Re: PATCH: adaptive ndistinct estimator v4 - Mailing list pgsql-hackers

Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

I've been looking into the strange estimates, mentioned on 2014/12/07:

>      values   current    adaptive
>      ------------------------------
>      106           99         107
>      106            8     6449190
>      1006          38     6449190
>      10006        327       42441

I suspected this might be some sort of rounding error in the numerical
optimization (looking for 'm' solving the equation from paper), but
turns out that's not the case.

The adaptive estimator is a bit unstable for skewed distributions, that
are not sufficiently smooth. Whenever f[1] or f[2] was 0 (i.e. there
were no values occuring exactly once or twice in the sample), the result
was rather off.

The simple workaround for this was adding a fallback to GEE when f[1] or
f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.

With the current version, I do get this (with statistics_target=10):

      values   current    adaptive
      ------------------------------
      106           99         108
      106            8         178
      1006          38        2083
      10006        327       11120

The results do change a bit based on the sample, but these values are a
good example of the values I'm getting.

The other examples (with skewed but smooth distributions) work as good
as before.

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bug #10432 failed to re-find parent key in index
Next
From: Tom Lane
Date:
Subject: Re: GUC context information in the document.