Home > mailing lists

Re: On Distributions In 7.2 (Longish) - Mailing list pgsql-general

From	Tom Lane
Subject	Re: On Distributions In 7.2 (Longish)
Date	October 27, 2001 17:51:51
Msg-id	3445.1004208603@sss.pgh.pa.us Whole thread Raw
In response to	On Distributions In 7.2 (Longish) (Mark kirkwood <markir@slingshot.co.nz>)
Responses	Re: On Distributions In 7.2 (Longish) Re: On Distributions In 7.2 More On 7.2 Distributions - Estimates For Number Distinct < 0
List	pgsql-general

Tree view

Mark kirkwood <markir@slingshot.co.nz> writes:
> I have been examining the behaviour of these additions, using the dataset
> from my "warehouse comparison" as a test bed - as it has some large-ish
> tables with controllable data distributions.

Many thanks for this!  I haven't yet gotten much feedback on real-world
performance of the new statistics code.

> --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;

> most_common_vals
> {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> Most_common_freqs
>
{"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
> n_distinct        36511

> Note we are out by an order of magnitude here for number distinct -
> should be 3000, and the frequencies are a little overestimated -
> should be 0.001

The factor-of-2 error for the "most common" frequency doesn't bother me;
that's to be expected, considering that we're using a random sampling
of the table rows.  However, the factor-of-10 error in the n_distinct
estimate is more troubling.  Could you trace through the code that
produces that estimate (in current sources, near line 1310 of
src/backend/commands/analyze.c) and see if you can tell why it's so
far off?

Another interesting thing to look at is how much the stats change in
repeated runs of ANALYZE.  Since we're taking a random sample,
successive runs can be expected to produce slightly different answers.
I would like to think that the answers won't change too far ... but
maybe this n_distict estimate was an outlier.

> ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;

> most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
> values)...
> most_common_freqs
>
{"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
> n_distinct        3027

> Now the number of distinct values is very accurate and frequencies are a
> little underestimated.

This is a little strange as well.  If the true frequency of every d0key
value is 0.001, then the random sample should produce a few estimates
larger than that as well as a few smaller; so I'd expect the "most
common" entries to be larger than 0.001.  Furthermore, there is code in
there that's supposed to suppress "most common" entries that aren't
really much more common than the estimated average (cf lines 1360-1380).
Why didn't that trigger?

Looks like I may have some bugs to fix here.

> The conclusion here seems to be that the 10 quantiles are not quite enough
> for accurate distributional data where (large) tables have a few thousand
> distinct values. However 100 quantiles was sufficient to get accurate
> statistics.
> Experimentation showed that accurate estimates (+/- 10%) of number of
> distinct values did not begin to appear until about 75 quantiles were used.

One of the things I would like to establish before we finish beta is
whether the default stats target of 10 is large enough.  I am not very
comfortable with raising it as far as 75-100, but would not be fazed
with numbers around 20-30.  I appreciate your feedback on this point.

I wonder, though, whether what we're looking at isn't just a problem
with the number-of-distinct-values estimation equation.  The one that's
in there seemed a little phony to me, but I couldn't find anything else
in a cursory literature search.

            regards, tom lane

pgsql-general by date:

From: Sheer El-Showk
Date: 27 October 2001, 17:09:27
Subject: performance tuning

From: Keary Suska
Date: 27 October 2001, 18:01:48
Subject: Re: performance tuning

Re: On Distributions In 7.2 (Longish) - Mailing list pgsql-general

Previous

Next