> AFAIK, estimating number of distinct values from a small sample is
> inherently an ill-conditioned problem.
If I had been getting estimates all over the map, I'd have been a bit more
unconcerned, but what I'm seeing is a very consistent number that also
increases and tends to be more consistent in proportion to the stats
"target" number. This makes me think that there is more at work here than
the inaccuracy likely to occur from small samples. It's as if the algorithm
and sample size (even at default) are pretty reasonable for returning
consistent results in this case, but a multiplier needs to be changed.
For instance, with the various values for statistics, if I do an analyze on
the table and then look at n_distinct 6 times, these are the results I get:
(actual number is 92,000)
set statistics = -1 (default):
13549
14268
14772
14518
13863
13526
mean = 14083
std dev = 518 or 3.7% of mean
set statistics = 100
22457
22598
22566
22580
22767
22490
mean = 22576
std dev = 108 or .5% of mean
set statistics = 500
39878
39984
40018
39977
39885
40070
mean = 39968
std dev = 75 or .2% of mean
set statistics = 1000
51428
51503
51486
51658
51625
51589
mean = 51548
std dev = 74 or .1% of mean
> You could try sticking the correct n_distinct into pg_statistic by hand
> just to see if it really does change the plan
OK... but I'm a bit confused on how to get to the right row in pg_statistic.
when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
with pg_class.oid, but apparently this is not the case. Is there a place I
can look to find which keys correspond among the pg_catalog tables?
> but I'd like to think
> that getting within a factor of 2 is good enough.
Probably so... but with the default stats, it is more like a factor of 6,
which seems significant to me, and if my conjecture is correct, it might be
an easy fix. (Easy for me to say, since I'm not a developer. <grin>)
-Nick