Re: On Distributions In 7.2 (Longish) - Mailing list pgsql-general
From | Mark kirkwood |
---|---|
Subject | Re: On Distributions In 7.2 (Longish) |
Date | |
Msg-id | 01102812504700.01068@spikey.slithery.org Whole thread Raw |
In response to | Re: On Distributions In 7.2 (Longish) (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
Tom 1) I will look at analyze.c and 2) do more analyze runs with 10 buckets to see if I am getting some bad probabilistic effects. With respect to 2), I will also try the same examination on a Freebsd 4.4 box that I have got here ( just to check that the Linux Mandrake 8.0 machine I use is not the source of any randomness problems) regards Mark >On Sunday 28 October 2001 07:50, Tom Lane wrote: > Mark kirkwood <markir@slingshot.co.nz> writes: > > I have been examining the behaviour of these additions, using the dataset > > from my "warehouse comparison" as a test bed - as it has some large-ish > > tables with controllable data distributions. > > Many thanks for this! I haven't yet gotten much feedback on real-world > performance of the new statistics code. > > > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; > > > > most_common_vals > > {"2243","2751","105","250","525","1623","2112","2331","2983","28"} > > Most_common_freqs > > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667" > >,"0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct > > 36511 > > > > Note we are out by an order of magnitude here for number distinct - > > should be 3000, and the frequencies are a little overestimated - > > should be 0.001 > > The factor-of-2 error for the "most common" frequency doesn't bother me; > that's to be expected, considering that we're using a random sampling > of the table rows. However, the factor-of-10 error in the n_distinct > estimate is more troubling. Could you trace through the code that > produces that estimate (in current sources, near line 1310 of > src/backend/commands/analyze.c) and see if you can tell why it's so > far off? > > Another interesting thing to look at is how much the stats change in > repeated runs of ANALYZE. Since we're taking a random sample, > successive runs can be expected to produce slightly different answers. > I would like to think that the answers won't change too far ... but > maybe this n_distict estimate was an outlier. > > > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; > > > > most_common_vals {"328","1242","524","1515","2058","2168",( 94 more > > values)... > > most_common_freqs > > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667 > >","0.000666667","0.000666667","0.000633333",.... n_distinct 3027 > > > > Now the number of distinct values is very accurate and frequencies are a > > little underestimated. > > This is a little strange as well. If the true frequency of every d0key > value is 0.001, then the random sample should produce a few estimates > larger than that as well as a few smaller; so I'd expect the "most > common" entries to be larger than 0.001. Furthermore, there is code in > there that's supposed to suppress "most common" entries that aren't > really much more common than the estimated average (cf lines 1360-1380). > Why didn't that trigger? > > Looks like I may have some bugs to fix here. > > > The conclusion here seems to be that the 10 quantiles are not quite > > enough for accurate distributional data where (large) tables have a few > > thousand distinct values. However 100 quantiles was sufficient to get > > accurate statistics. > > Experimentation showed that accurate estimates (+/- 10%) of number of > > distinct values did not begin to appear until about 75 quantiles were > > used. > > One of the things I would like to establish before we finish beta is > whether the default stats target of 10 is large enough. I am not very > comfortable with raising it as far as 75-100, but would not be fazed > with numbers around 20-30. I appreciate your feedback on this point. > > I wonder, though, whether what we're looking at isn't just a problem > with the number-of-distinct-values estimation equation. The one that's > in there seemed a little phony to me, but I couldn't find anything else > in a cursory literature search. > > regards, tom lane
pgsql-general by date: