Re: PATCH: adaptive ndistinct estimator v4 - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: PATCH: adaptive ndistinct estimator v4
Date
Msg-id CAMkU=1wxHZEWGh98rg-2wq1HPbWxs6Cf_MTwHTg92gT=QOdZcQ@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: adaptive ndistinct estimator v4  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Tue, Apr 14, 2015 at 11:45 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

Hi Tomas,

I have a case here where the adaptive algorithm underestimates ndistinct by a factor of 7 while the default estimator is pretty close.

5MB file:


# create table foo2 (x text);
# \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
# analyze verbose foo2;
INFO:  analyzing "public.foo2"
INFO:  "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
WARNING: ndistinct estimate current=998951.78 adaptive=135819.00

I've done a more complete analysis with a real world database I have access to.

I've analyzed patched and current with default_statistics_target of 100 and 10000. (I also have some of the same data under 9.2, but that is not meaningfully different than unpatched head).

For easier interpretation I hacked up the analyzer so that it just reports the estimated number, never converting to the negative fraction.

See the spreadsheet here:

https://docs.google.com/spreadsheets/d/1qUcBoQkRFFcSDq7GtkiQkHqlLTbxQYl5hh6S0byff2M/edit?usp=sharing

The 10000 target was initially collected in an attempt to discern the truth when the 100 target methods disagreed, but I decided to just collect the gold-standard truth.

The truth is given by:
select count(*) from (select distinct column from schema.table where column is not null) foo;

And number_not_null is given by:
select count(*) from schema.table where column is not null;

It looks like the proposed method sometimes overestimates, although never by a huge amount, while the old one never overestimated.  Overall it mostly seems to be more accurate, but occasionally it does substantially worse than the current method.  I suspect most of the problems are related to the same issue reported in the last email.  There are a lot of places where both underestimate, but where the new method does so by less than head.

If there are any columns anyone wants to examine further, give me the token for it and I'll try to create a generator that generates data with the same distribution (and clustering, if that seems relevant).

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
Next
From: Jim Nasby
Date:
Subject: Re: optimizing vacuum truncation scans