Re: Problem with n_distinct being consistently inaccurate. - Mailing list pgsql-admin

From Nick Fankhauser
Subject Re: Problem with n_distinct being consistently inaccurate.
Date
Msg-id NEBBLAAHGLEEPCGOBHDGOEIOIJAA.nickf@ontko.com
Whole thread Raw
In response to Re: Problem with n_distinct being consistently inaccurate.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Problem with n_distinct being consistently inaccurate.
List pgsql-admin
> AFAIK, estimating number of distinct values from a small sample is
> inherently an ill-conditioned problem.

If I had been getting estimates all over the map, I'd have been a bit more
unconcerned, but what I'm seeing is a very consistent number that also
increases and tends to be more consistent in proportion to the stats
"target" number. This makes me think that there is more at work here than
the inaccuracy likely to occur from small samples. It's as if the algorithm
and sample size (even at default) are pretty reasonable for returning
consistent results in this case, but a multiplier needs to be changed.

For instance, with the various values for statistics, if I do an analyze on
the table and then look at n_distinct 6 times, these are the results I get:

(actual number is 92,000)

set statistics = -1 (default):
13549
14268
14772
14518
13863
13526

mean = 14083
std dev = 518 or 3.7% of mean


set statistics = 100
22457
22598
22566
22580
22767
22490

mean = 22576
std dev = 108 or .5% of mean


set statistics = 500
39878
39984
40018
39977
39885
40070

mean = 39968
std dev = 75 or .2% of mean


set statistics = 1000
51428
51503
51486
51658
51625
51589

mean = 51548
std dev = 74 or .1% of mean


> You could try sticking the correct n_distinct into pg_statistic by hand
> just to see if it really does change the plan

OK... but I'm a bit confused on how to get to the right row in pg_statistic.

when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
with pg_class.oid, but apparently this is not the case. Is there a place I
can look to find which keys correspond among the pg_catalog tables?


> but I'd like to think
> that getting within a factor of 2 is good enough.

Probably so... but with the default stats, it is more like a factor of 6,
which seems significant to me, and if my conjecture is correct, it might be
an easy fix. (Easy for me to say, since I'm not a developer. <grin>)

-Nick




pgsql-admin by date:

Previous
From: Tom Lane
Date:
Subject: Re: Problem with n_distinct being consistently inaccurate.
Next
From: Tom Lane
Date:
Subject: Re: Problem with n_distinct being consistently inaccurate.