Re: On Distributions In 7.2 (Longish) - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: On Distributions In 7.2 (Longish) |
Date | |
Msg-id | 3445.1004208603@sss.pgh.pa.us Whole thread Raw |
In response to | On Distributions In 7.2 (Longish) (Mark kirkwood <markir@slingshot.co.nz>) |
Responses |
Re: On Distributions In 7.2 (Longish)
Re: On Distributions In 7.2 More On 7.2 Distributions - Estimates For Number Distinct < 0 |
List | pgsql-general |
Mark kirkwood <markir@slingshot.co.nz> writes: > I have been examining the behaviour of these additions, using the dataset > from my "warehouse comparison" as a test bed - as it has some large-ish > tables with controllable data distributions. Many thanks for this! I haven't yet gotten much feedback on real-world performance of the new statistics code. > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; > most_common_vals > {"2243","2751","105","250","525","1623","2112","2331","2983","28"} > Most_common_freqs > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"} > n_distinct 36511 > Note we are out by an order of magnitude here for number distinct - > should be 3000, and the frequencies are a little overestimated - > should be 0.001 The factor-of-2 error for the "most common" frequency doesn't bother me; that's to be expected, considering that we're using a random sampling of the table rows. However, the factor-of-10 error in the n_distinct estimate is more troubling. Could you trace through the code that produces that estimate (in current sources, near line 1310 of src/backend/commands/analyze.c) and see if you can tell why it's so far off? Another interesting thing to look at is how much the stats change in repeated runs of ANALYZE. Since we're taking a random sample, successive runs can be expected to produce slightly different answers. I would like to think that the answers won't change too far ... but maybe this n_distict estimate was an outlier. > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; > most_common_vals {"328","1242","524","1515","2058","2168",( 94 more > values)... > most_common_freqs > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",.... > n_distinct 3027 > Now the number of distinct values is very accurate and frequencies are a > little underestimated. This is a little strange as well. If the true frequency of every d0key value is 0.001, then the random sample should produce a few estimates larger than that as well as a few smaller; so I'd expect the "most common" entries to be larger than 0.001. Furthermore, there is code in there that's supposed to suppress "most common" entries that aren't really much more common than the estimated average (cf lines 1360-1380). Why didn't that trigger? Looks like I may have some bugs to fix here. > The conclusion here seems to be that the 10 quantiles are not quite enough > for accurate distributional data where (large) tables have a few thousand > distinct values. However 100 quantiles was sufficient to get accurate > statistics. > Experimentation showed that accurate estimates (+/- 10%) of number of > distinct values did not begin to appear until about 75 quantiles were used. One of the things I would like to establish before we finish beta is whether the default stats target of 10 is large enough. I am not very comfortable with raising it as far as 75-100, but would not be fazed with numbers around 20-30. I appreciate your feedback on this point. I wonder, though, whether what we're looking at isn't just a problem with the number-of-distinct-values estimation equation. The one that's in there seemed a little phony to me, but I couldn't find anything else in a cursory literature search. regards, tom lane
pgsql-general by date: