Re: On Distributions In 7.2 - Mailing list pgsql-general
From | Mark kirkwood |
---|---|
Subject | Re: On Distributions In 7.2 |
Date | |
Msg-id | 01102921535600.04563@spikey.slithery.org Whole thread Raw |
In response to | Re: On Distributions In 7.2 (Longish) (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
Tom Lane wrote (edited....) > The factor-of-2 error for the "most common" frequency doesn't bother me; > that's to be expected, considering that we're using a random sampling > of the table rows. However, the factor-of-10 error in the n_distinct > estimate is more troubling. Could you trace through the code that > produces that estimate (in current sources, near line 1310 of > src/backend/commands/analyze.c) and see if you can tell why it's so > far off? My limited tracing of analyze.c ( adding various elog calls in lines 1288->1344 of Beta 1 sources ) : ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 3000 values in relation (?) (numrows) DEBUG: Have 3000036 total values in relation (totalrows) DEBUG: Have 3000 values in sample (values_cnt) DEBUG: Have 1875 distinct values in sample (ndistinct) DEBUG: Have 813 multiple values in sample (nmultiple) DEBUG: calc 1116.000000 f1 for Chaudhuri rule DEBUG: calc 35291.230469 term1 for Chaudhuri rule DEBUG: calc 34397.000000 distinct via Chaudhuri rule That term1 seems like the problem. I am thinking of a sort of modifier function that tames the extreme values for term1 when sample size << rows in relation, but tends to 1 as sample size ~ rows in relation. a quick and dirty experiment with a vaguely suitable such function in analyze.c ( approx line 1333): /* term1 = sqrt(totalrows / (double) numrows) * f1; */ term1 = sqrt(totalrows / (double) numrows) * f1 / log10(log10(1000) * totalrows/numrows); ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 3000 values in relation DEBUG: Have 3000036 total values in relation DEBUG: Have 3000 values in sample DEBUG: Have 1898 distinct values in sample DEBUG: Have 798 multiple values in sample DEBUG: calc 1100.000000 f1 for Modified Chaudhuri rule DEBUG: calc 10004.025391 term1 for Modified Chaudhuri rule DEBUG: calc 10802.000000 distinct via Modified Chaudhuri rule which is better (only factor of 3 out now !) now see if I have killed the larger sample behaviour : ALTER TABLE fact0 ALTER d0key SET STATISTICS 100; ANALYZE fact0; DEBUG: Analyzing fact0 DEBUG: Analyze : beginning a column DEBUG: Have 30000 values in relation DEBUG: Have 3000036 total values in relation DEBUG: Have 30000 values in sample DEBUG: Have 3000 distinct values in sample DEBUG: Have 2998 multiple values in sample DEBUG: calc 2.000000 f1 for Modified Chaudhuri rule DEBUG: calc 8.073919 term1 for Modified Chaudhuri rule DEBUG: calc 3006.000000 distinct via Modified Chaudhuri rule So is kind of doing the right thing, however a modifier function that _actually_ tends to 1 as numrows -> totalrows (and is more accurate at small sample size) would be much better ! ( where are those 2nd year math books when you want them....) > > Another interesting thing to look at is how much the stats change in > repeated runs of ANALYZE. Since we're taking a random sample, > successive runs can be expected to produce slightly different answers. > I would like to think that the answers won't change too far ... but > maybe this n_distict estimate was an outlier. Repeated tests on the Freebsd 4.4 machine obtained values consistently in the 34000-36000 range, so the sampling method looks great. > > > One of the things I would like to establish before we finish beta is > whether the default stats target of 10 is large enough. I am not very > comfortable with raising it as far as 75-100, but would not be fazed > with numbers around 20-30. I appreciate your feedback on this point. > Agreed, 100 seems too big, 10->30 *feels* much better. regards Mark
pgsql-general by date: