Thread: On Distributions In 7.2 (Longish)
The current sources (7.2) have introduced distributional factors into the system statistics. I have been examining the behaviour of these additions, using the dataset from my "warehouse comparison" as a test bed - as it has some large-ish tables with controllable data distributions. I think the results are quite interesting.... Tables ------ Table "dim0" Column | Type | Modifiers --------+--------------------------+----------- d0key | integer | f1 | timestamp with time zone | f2 | character varying(20) | f3 | character varying(20) | Indexes: dim0_q1 ( on f1 - UNIQUE) Unique keys: dim0_pk Rows : 3000 Table "fact0" Column | Type | Modifiers --------+---------+----------- d0key | integer | d1key | integer | d2key | integer | val | integer | filler | text | Indexes: fact0_q1 (on d0key ONLY) Rows : 3000000 Distribution : d0key uniformly distributed 1000 occurrences for each value of d0key - i.e 0.001 frequency for each value of d0key 3000 distinct values of d0key Query ----- The query to be examined is : SELECT d0.f3, count(f.val) FROM dim0 d0, fact0 f WHERE d0.d0key = f.d0key AND d0.f1 between '1999-12-01' AND '2000-02-29' GROUP BY d0.f3; This will join 88 rows from the dim0 table with 88000 from the fact0 table and group them into 3 "month" buckets. Case 1 : -------- Consider using the default distributional sampling settings (10 quantiles) - --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; ANALYZE fact0; System Stats ------------ SELECT most_common_vals,,most_common_freqs,n_distinct FROM pg_stats WHERE tablename = 'fact0' AND attname='d0key'; most_common_vals {"2243","2751","105","250","525","1623","2112","2331","2983","28"} Most_common_freqs {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct 36511 Note we are out by an order of magnitude here for number distinct - should be 3000, and the frequencies are a little overestimated - should be 0.001 QUERY PLAN Aggregate (cost=29712.88..29749.00 rows=722 width=18) -> Group (cost=29712.88..29730.94 rows=7225 width=18) -> Sort (cost=29712.88..29712.88 rows=7225 width=18) -> Nested Loop (cost=0.00..29249.81 rows=7225 width=18) -> Index Scan using dim0_q1 on dim0 d0 (cost=0.00..4.43 rows=89 width=10) -> Index Scan using fact0_q1 on fact0 f (cost=0.00..326.33 rows=81 width=8) RESULTS ------- f3 | count ----+------- 01 | 30000 02 | 28000 12 | 30000 (3 rows) ELAPSED : 19s Clearly the query statistics are underestimating the number of rows for the nested loop join, by about a factor of 10. The estimated rows from dim0 are ok. Case 2 : -------- Lets try 100 quantiles ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; ANALYZE fact0; System Stats ------------ most_common_vals {"328","1242","524","1515","2058","2168",( 94 more values)... most_common_freqs {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",.... n_distinct 3027 Now the number of distinct values is very accurate and frequencies are a little underestimated. QUERY PLAN Aggregate (cost=118518.96..118958.65 rows=8794 width=18) -> Group (cost=118518.96..118738.80 rows=87937 width=18) -> Sort (cost=118518.96..118518.96 rows=87937 width=18) -> Hash Join (cost=4.65..111297.50 rows=87937 width=18) -> Seq Scan on fact0 f (cost=0.00..87693.36 rows=3000036 width=8) -> Hash (cost=4.43..4.43 rows=89 width=10) -> Index Scan using dim0_q1 on dim0 d0 (cost=0.00..4.43 rows=89 width=10) ELAPSED : 60s The query statistics are now very accurate ( e.g 89 rows from dim0 and 87937 rows joined) - note that ironically the better execution plan is chosen with the poorer statistical data ! The conclusion here seems to be that the 10 quantiles are not quite enough for accurate distributional data where (large) tables have a few thousand distinct values. However 100 quantiles was sufficient to get accurate statistics. Further Notes ------------- Experimentation showed that accurate estimates (+/- 10%) of number of distinct values did not begin to appear until about 75 quantiles were used. On the other hand reducing the number of distinct entries by a factor of 10, while keeping the number of rows constant at 3000000 gave rise to accurate statistics with 10 quantiles. Given that the distribution used for fact0 is relatively benign (uniform), the test is hopefully fair (i.e. is not constructed specially to fox the analyzer), However the most common value for fact0 is non-unique ( since all d0key values have the same frequency ) - I am uncertain if this is significant... Tests were perfomed with 7.2 snapshot 16 Oct. regards Mark 4AÆ
Mark kirkwood <markir@slingshot.co.nz> writes: > I have been examining the behaviour of these additions, using the dataset > from my "warehouse comparison" as a test bed - as it has some large-ish > tables with controllable data distributions. Many thanks for this! I haven't yet gotten much feedback on real-world performance of the new statistics code. > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; > most_common_vals > {"2243","2751","105","250","525","1623","2112","2331","2983","28"} > Most_common_freqs > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"} > n_distinct 36511 > Note we are out by an order of magnitude here for number distinct - > should be 3000, and the frequencies are a little overestimated - > should be 0.001 The factor-of-2 error for the "most common" frequency doesn't bother me; that's to be expected, considering that we're using a random sampling of the table rows. However, the factor-of-10 error in the n_distinct estimate is more troubling. Could you trace through the code that produces that estimate (in current sources, near line 1310 of src/backend/commands/analyze.c) and see if you can tell why it's so far off? Another interesting thing to look at is how much the stats change in repeated runs of ANALYZE. Since we're taking a random sample, successive runs can be expected to produce slightly different answers. I would like to think that the answers won't change too far ... but maybe this n_distict estimate was an outlier. > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; > most_common_vals {"328","1242","524","1515","2058","2168",( 94 more > values)... > most_common_freqs > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",.... > n_distinct 3027 > Now the number of distinct values is very accurate and frequencies are a > little underestimated. This is a little strange as well. If the true frequency of every d0key value is 0.001, then the random sample should produce a few estimates larger than that as well as a few smaller; so I'd expect the "most common" entries to be larger than 0.001. Furthermore, there is code in there that's supposed to suppress "most common" entries that aren't really much more common than the estimated average (cf lines 1360-1380). Why didn't that trigger? Looks like I may have some bugs to fix here. > The conclusion here seems to be that the 10 quantiles are not quite enough > for accurate distributional data where (large) tables have a few thousand > distinct values. However 100 quantiles was sufficient to get accurate > statistics. > Experimentation showed that accurate estimates (+/- 10%) of number of > distinct values did not begin to appear until about 75 quantiles were used. One of the things I would like to establish before we finish beta is whether the default stats target of 10 is large enough. I am not very comfortable with raising it as far as 75-100, but would not be fazed with numbers around 20-30. I appreciate your feedback on this point. I wonder, though, whether what we're looking at isn't just a problem with the number-of-distinct-values estimation equation. The one that's in there seemed a little phony to me, but I couldn't find anything else in a cursory literature search. regards, tom lane
Tom 1) I will look at analyze.c and 2) do more analyze runs with 10 buckets to see if I am getting some bad probabilistic effects. With respect to 2), I will also try the same examination on a Freebsd 4.4 box that I have got here ( just to check that the Linux Mandrake 8.0 machine I use is not the source of any randomness problems) regards Mark >On Sunday 28 October 2001 07:50, Tom Lane wrote: > Mark kirkwood <markir@slingshot.co.nz> writes: > > I have been examining the behaviour of these additions, using the dataset > > from my "warehouse comparison" as a test bed - as it has some large-ish > > tables with controllable data distributions. > > Many thanks for this! I haven't yet gotten much feedback on real-world > performance of the new statistics code. > > > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10; > > > > most_common_vals > > {"2243","2751","105","250","525","1623","2112","2331","2983","28"} > > Most_common_freqs > > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667" > >,"0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct > > 36511 > > > > Note we are out by an order of magnitude here for number distinct - > > should be 3000, and the frequencies are a little overestimated - > > should be 0.001 > > The factor-of-2 error for the "most common" frequency doesn't bother me; > that's to be expected, considering that we're using a random sampling > of the table rows. However, the factor-of-10 error in the n_distinct > estimate is more troubling. Could you trace through the code that > produces that estimate (in current sources, near line 1310 of > src/backend/commands/analyze.c) and see if you can tell why it's so > far off? > > Another interesting thing to look at is how much the stats change in > repeated runs of ANALYZE. Since we're taking a random sample, > successive runs can be expected to produce slightly different answers. > I would like to think that the answers won't change too far ... but > maybe this n_distict estimate was an outlier. > > > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; > > > > most_common_vals {"328","1242","524","1515","2058","2168",( 94 more > > values)... > > most_common_freqs > > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667 > >","0.000666667","0.000666667","0.000633333",.... n_distinct 3027 > > > > Now the number of distinct values is very accurate and frequencies are a > > little underestimated. > > This is a little strange as well. If the true frequency of every d0key > value is 0.001, then the random sample should produce a few estimates > larger than that as well as a few smaller; so I'd expect the "most > common" entries to be larger than 0.001. Furthermore, there is code in > there that's supposed to suppress "most common" entries that aren't > really much more common than the estimated average (cf lines 1360-1380). > Why didn't that trigger? > > Looks like I may have some bugs to fix here. > > > The conclusion here seems to be that the 10 quantiles are not quite > > enough for accurate distributional data where (large) tables have a few > > thousand distinct values. However 100 quantiles was sufficient to get > > accurate statistics. > > Experimentation showed that accurate estimates (+/- 10%) of number of > > distinct values did not begin to appear until about 75 quantiles were > > used. > > One of the things I would like to establish before we finish beta is > whether the default stats target of 10 is large enough. I am not very > comfortable with raising it as far as 75-100, but would not be fazed > with numbers around 20-30. I appreciate your feedback on this point. > > I wonder, though, whether what we're looking at isn't just a problem > with the number-of-distinct-values estimation equation. The one that's > in there seemed a little phony to me, but I couldn't find anything else > in a cursory literature search. > > regards, tom lane
Tom Lane wrote (edited....) > The factor-of-2 error for the "most common" frequency doesn't bother me; > that's to be expected, considering that we're using a random sampling > of the table rows. However, the factor-of-10 error in the n_distinct > estimate is more troubling. Could you trace through the code that > produces that estimate (in current sources, near line 1310 of > src/backend/commands/analyze.c) and see if you can tell why it's so > far off? My limited tracing of analyze.c ( adding various elog calls in lines 1288->1344 of Beta 1 sources ) : ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 3000 values in relation (?) (numrows) DEBUG: Have 3000036 total values in relation (totalrows) DEBUG: Have 3000 values in sample (values_cnt) DEBUG: Have 1875 distinct values in sample (ndistinct) DEBUG: Have 813 multiple values in sample (nmultiple) DEBUG: calc 1116.000000 f1 for Chaudhuri rule DEBUG: calc 35291.230469 term1 for Chaudhuri rule DEBUG: calc 34397.000000 distinct via Chaudhuri rule That term1 seems like the problem. I am thinking of a sort of modifier function that tames the extreme values for term1 when sample size << rows in relation, but tends to 1 as sample size ~ rows in relation. a quick and dirty experiment with a vaguely suitable such function in analyze.c ( approx line 1333): /* term1 = sqrt(totalrows / (double) numrows) * f1; */ term1 = sqrt(totalrows / (double) numrows) * f1 / log10(log10(1000) * totalrows/numrows); ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 3000 values in relation DEBUG: Have 3000036 total values in relation DEBUG: Have 3000 values in sample DEBUG: Have 1898 distinct values in sample DEBUG: Have 798 multiple values in sample DEBUG: calc 1100.000000 f1 for Modified Chaudhuri rule DEBUG: calc 10004.025391 term1 for Modified Chaudhuri rule DEBUG: calc 10802.000000 distinct via Modified Chaudhuri rule which is better (only factor of 3 out now !) now see if I have killed the larger sample behaviour : ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 30000 values in relation DEBUG: Have 3000036 total values in relation DEBUG: Have 30000 values in sample DEBUG: Have 3000 distinct values in sample DEBUG: Have 2998 multiple values in sample DEBUG: calc 2.000000 f1 for Modified Chaudhuri rule DEBUG: calc 8.073919 term1 for Modified Chaudhuri rule DEBUG: calc 3006.000000 distinct via Modified Chaudhuri rule So is kind of doing the right thing, however a modifier function that _actually_ tends to 1 as numrows -> totalrows (and is more accurate at small sample size) would be much better ! ( where are those 2nd year math books when you want them....) > > Another interesting thing to look at is how much the stats change in > repeated runs of ANALYZE. Since we're taking a random sample, > successive runs can be expected to produce slightly different answers. > I would like to think that the answers won't change too far ... but > maybe this n_distict estimate was an outlier. Repeated tests on the Freebsd 4.4 machine obtained values consistently in the 34000-36000 range, so the sampling method looks great. > > > One of the things I would like to establish before we finish beta is > whether the default stats target of 10 is large enough. I am not very > comfortable with raising it as far as 75-100, but would not be fazed > with numbers around 20-30. I appreciate your feedback on this point. > Agreed, 100 seems too big, 10->30 *feels* much better. regards Mark
In the process of attempting to understand the data in pg_stats, I created a (very) simple example : CREATE TABLE test(id integer); INSERT INTO test VALUES(1); INSERT INTO test VALUES(1); INSERT INTO test VALUES(1); INSERT INTO test VALUES(1); INSERT INTO test VALUES(2); INSERT INTO test VALUES(2); INSERT INTO test VALUES(2); INSERT INTO test VALUES(3); INSERT INTO test VALUES(4); INSERT INTO test VALUES(5); ANALYZE test; SELECT * FROM pg_stats WHERE tablename='test'; tablename test attname id null_frac 0 avg_width 4 n_distinct -0.5 most_common_vals {"1","2"} most_common_vals {"0.4","0.3"} histogram_bounds {"3","4","5"} correlation 1 everything looks good except for n_distinct ( its negative - should be 5) (I wasn't too worried about avg_width ) Using fairly crude tracing (adding elog calls) in src/backend/commands/analyze.c : DEBUG: Analyzing test DEBUG: Analyze : beginning a column DEBUG: Have 10 total values in relation (totalrows) DEBUG: Have 10 values in relation (numrows) DEBUG: Have 10 values in sample (values_cnt) DEBUG: Have 5 distinct values in sample (ndistinct) DEBUG: Have 2 multiple values in sample (nmultiple) DEBUG: calc 5.000000 distinct via Chaudhuri rule DEBUG: calc -0.500000 distinct via >10 percent rowcount rule So we had the correct answer before applying the 10 percent rowcount code. This 10 percent rowcount code being line 1340 or thereabouts : if (stats->stadistinct > 0.1 * totalrows) { stats->stadistinct = -(stats->stadistinct / totalrows); } My example is pretty contrived, but I wonder if I have "stumbled" on a bug here. regards Mark
Mark kirkwood <markir@slingshot.co.nz> writes: > everything looks good except for n_distinct ( its negative - should be 5) This is okay; see the documentation about what n_distinct means. I'd be interested in a better rule-of-thumb for estimating whether the number of distinct values in a column should be assumed to scale with the total table size or not. That 10% threshold was just pulled out of the air... regards, tom lane