Re: On Distributions In 7.2 (Longish) - Mailing list pgsql-general

From Mark kirkwood
Subject Re: On Distributions In 7.2 (Longish)
Date
Msg-id 01102812504700.01068@spikey.slithery.org
Whole thread Raw
In response to Re: On Distributions In 7.2 (Longish)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Tom

1)  I will look at analyze.c and
2)  do more analyze runs with 10 buckets to see if I am getting some bad
probabilistic effects.

With respect to 2), I will also try the same examination on a Freebsd 4.4 box
that I have got here ( just to check that the Linux Mandrake 8.0 machine I
use is not the source of any randomness problems)

regards

Mark

>On Sunday 28 October 2001 07:50, Tom Lane wrote:
> Mark kirkwood <markir@slingshot.co.nz> writes:
> > I have been examining the behaviour of these additions, using the dataset
> > from my "warehouse comparison" as a test bed - as it has some large-ish
> > tables with controllable data distributions.
>
> Many thanks for this!  I haven't yet gotten much feedback on real-world
> performance of the new statistics code.
>
> > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
> >
> > most_common_vals
> > {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> > Most_common_freqs
> > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667"
> >,"0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct
> > 36511
> >
> > Note we are out by an order of magnitude here for number distinct -
> > should be 3000, and the frequencies are a little overestimated -
> > should be 0.001
>
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows.  However, the factor-of-10 error in the n_distinct
> estimate is more troubling.  Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?
>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE.  Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.
>
> > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
> >
> > most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
> > values)...
> > most_common_freqs
> > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667
> >","0.000666667","0.000666667","0.000633333",.... n_distinct        3027
> >
> > Now the number of distinct values is very accurate and frequencies are a
> > little underestimated.
>
> This is a little strange as well.  If the true frequency of every d0key
> value is 0.001, then the random sample should produce a few estimates
> larger than that as well as a few smaller; so I'd expect the "most
> common" entries to be larger than 0.001.  Furthermore, there is code in
> there that's supposed to suppress "most common" entries that aren't
> really much more common than the estimated average (cf lines 1360-1380).
> Why didn't that trigger?
>
> Looks like I may have some bugs to fix here.
>
> > The conclusion here seems to be that the 10 quantiles are not quite
> > enough for accurate distributional data where (large) tables have a few
> > thousand distinct values. However 100 quantiles was sufficient to get
> > accurate statistics.
> > Experimentation showed that accurate estimates (+/- 10%) of number of
> > distinct values did not begin to appear until about 75 quantiles were
> > used.
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough.  I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30.  I appreciate your feedback on this point.
>
> I wonder, though, whether what we're looking at isn't just a problem
> with the number-of-distinct-values estimation equation.  The one that's
> in there seemed a little phony to me, but I couldn't find anything else
> in a cursory literature search.
>
>             regards, tom lane

pgsql-general by date:

Previous
From: Sheer El-Showk
Date:
Subject: Re: performance tuning
Next
From: Uros Gruber
Date:
Subject: Problem with some function