Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Odd statistics behaviour in 7.2
Date
Msg-id 21967.1013968953@sss.pgh.pa.us
Whole thread Raw
In response to Odd statistics behaviour in 7.2  ("Gordon A. Runkle" <gar@integrated-dynamics.com>)
List pgsql-hackers
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> It would seem that if you could determine if the number of distinct
> values is _increasing_ as you scan more rows, that an increase in table
> size would also cause an increase, e.g. if you have X distinct values
> looking at N rows, and 2X distinct values looking at 2N rows, that
> clearly would show a scale.

[ thinks for awhile... ]  I don't think that'll help.  You could not
expect an exact 2:1 increase, except in the case of a simple unique
column, which isn't the problem anyway.  So the above would really
have to be coded as "count the number of distinct values in the sample
(d1) and the number in half of the sample (d2); then if d1/d2 >= X
assume the number of distinct values scales".  X is a constant somewhere
between 1 and 2, but where?  I think you've only managed to trade one
arbitrary threshold for another one.

A more serious problem is that the above could easily be fooled by a
distribution that contains a few very-popular values and a larger number
of seldom-seen ones.  Consider for example a column "number of children"
over a database of families.  In a sample of a thousand or so, you might
well see only values 0..4 (or so); if you double the size of the sample,
and find a few rows with 5 to 10 kids, are you then correct to label the
column as scaling with the size of the database?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Brian Bruns
Date:
Subject: Re: making way for DRDA
Next
From: "Marc G. Fournier"
Date:
Subject: Branch created ... May v7.3 be Born!!