Home > mailing lists

Re: On Distributions In 7.2 - Mailing list pgsql-general

From	Mark kirkwood
Subject	Re: On Distributions In 7.2
Date	October 29, 2001 04:22:02
Msg-id	01102921535600.04563@spikey.slithery.org Whole thread Raw
In response to	Re: On Distributions In 7.2 (Longish) (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-general

Tree view

Tom Lane wrote (edited....)
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows.  However, the factor-of-10 error in the n_distinct
> estimate is more troubling.  Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?

My limited tracing of  analyze.c
( adding various elog calls in lines 1288->1344 of Beta 1 sources ) :

ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 3000 values in relation (?)         (numrows)
DEBUG:      Have 3000036 total values in relation (totalrows)
DEBUG:      Have 3000 values in sample            (values_cnt)
DEBUG:      Have 1875 distinct values in sample   (ndistinct)
DEBUG:      Have 813 multiple values in sample    (nmultiple)
DEBUG:      calc 1116.000000 f1 for Chaudhuri rule
DEBUG:      calc 35291.230469 term1 for Chaudhuri rule
DEBUG:      calc 34397.000000 distinct via Chaudhuri rule

That term1 seems like the problem. I am thinking of a sort of modifier
function that tames the extreme values for term1 when sample size << rows in
relation, but tends to 1 as sample size ~ rows in relation.

a quick and dirty experiment with a vaguely suitable such function in
analyze.c ( approx line 1333):

/*                      term1 = sqrt(totalrows / (double) numrows) * f1; */
                        term1 = sqrt(totalrows / (double) numrows) * f1 /
                                   log10(log10(1000) * totalrows/numrows);

ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 3000 values in relation
DEBUG:      Have 3000036 total values in relation
DEBUG:      Have 3000 values in sample
DEBUG:      Have 1898 distinct values in sample
DEBUG:      Have 798 multiple values in sample
DEBUG:      calc 1100.000000 f1 for Modified Chaudhuri rule
DEBUG:      calc 10004.025391 term1 for Modified Chaudhuri rule
DEBUG:      calc 10802.000000 distinct via Modified Chaudhuri rule

which is better (only factor of 3 out now !)

now see if I have killed the larger sample behaviour :

ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 30000 values in relation
DEBUG:      Have 3000036 total values in relation
DEBUG:      Have 30000 values in sample
DEBUG:      Have 3000 distinct values in sample
DEBUG:      Have 2998 multiple values in sample
DEBUG:      calc 2.000000 f1 for Modified Chaudhuri rule
DEBUG:      calc 8.073919 term1 for Modified Chaudhuri rule
DEBUG:      calc 3006.000000 distinct via Modified Chaudhuri rule

So is kind of doing the right thing, however a modifier function that
_actually_ tends to 1 as numrows -> totalrows (and is more accurate at small
sample size) would be much better ! ( where are those 2nd year math books
when you want them....)

>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE.  Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.

Repeated tests on the Freebsd 4.4 machine obtained values consistently in the
34000-36000 range, so the sampling method looks great.
>
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough.  I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30.  I appreciate your feedback on this point.
>
Agreed, 100 seems too big, 10->30 *feels* much better.

regards

Mark

pgsql-general by date:

From: "John Fabiani"
Date: 29 October 2001, 03:57:03
Subject: Re: How to know PostgreSql is install in Linux system.

From: Mark kirkwood
Date: 29 October 2001, 04:22:03
Subject: More On 7.2 Distributions - Estimates For Number Distinct < 0

Re: On Distributions In 7.2 - Mailing list pgsql-general

Previous

Next