Thread: On Distributions In 7.2 (Longish)

On Distributions In 7.2 (Longish)

From
Mark kirkwood
Date:
The current sources (7.2) have introduced distributional factors into the
system statistics.

I have been examining the behaviour of these additions, using the dataset
from my "warehouse comparison" as a test bed - as it has some large-ish
tables with controllable data distributions.

I think the results are quite interesting....


Tables
------
                 Table "dim0"
 Column |           Type           | Modifiers
--------+--------------------------+-----------
 d0key  | integer                  |
 f1     | timestamp with time zone |
 f2     | character varying(20)    |
 f3     | character varying(20)    |
Indexes: dim0_q1 ( on f1 - UNIQUE)
Unique keys: dim0_pk
Rows : 3000

        Table "fact0"
 Column |  Type   | Modifiers
--------+---------+-----------
 d0key  | integer |
 d1key  | integer |
 d2key  | integer |
 val    | integer |
 filler | text    |
Indexes: fact0_q1 (on d0key ONLY)
Rows : 3000000
Distribution : d0key uniformly distributed
               1000  occurrences for each value of d0key - i.e
               0.001 frequency for each value of d0key
           3000  distinct values of d0key

Query
-----

The query to be examined is :


SELECT
       d0.f3,
       count(f.val)
FROM dim0 d0,
     fact0 f
WHERE d0.d0key = f.d0key
AND   d0.f1 between '1999-12-01' AND '2000-02-29'
GROUP BY d0.f3;

This will join 88 rows from the dim0 table with 88000 from the fact0 table
and group them into 3 "month" buckets.


Case 1 :
--------

Consider using the default distributional sampling settings (10 quantiles) -

--ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
ANALYZE fact0;

System Stats
------------

SELECT most_common_vals,,most_common_freqs,n_distinct FROM pg_stats WHERE
tablename = 'fact0' AND attname='d0key';


most_common_vals
{"2243","2751","105","250","525","1623","2112","2331","2983","28"}
Most_common_freqs

{"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
n_distinct        36511


Note we are out by an order of magnitude here for number distinct - should be
3000, and the frequencies are a little overestimated - should be 0.001



QUERY PLAN

Aggregate  (cost=29712.88..29749.00 rows=722 width=18)
  ->  Group  (cost=29712.88..29730.94 rows=7225 width=18)
        ->  Sort  (cost=29712.88..29712.88 rows=7225 width=18)
              ->  Nested Loop  (cost=0.00..29249.81 rows=7225 width=18)
                    ->  Index Scan using dim0_q1 on dim0 d0  (cost=0.00..4.43
rows=89 width=10)
                    ->  Index Scan using fact0_q1 on fact0 f
(cost=0.00..326.33 rows=81 width=8)


RESULTS
-------

 f3 | count
----+-------
 01 | 30000
 02 | 28000
 12 | 30000
(3 rows)

ELAPSED : 19s

Clearly the query statistics are underestimating the number of rows for the
nested loop join, by about a factor of 10. The estimated rows from dim0 are
ok.

Case 2 :
--------

Lets try 100 quantiles

ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;

System Stats
------------

most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
values)...
most_common_freqs

{"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
n_distinct        3027

Now the number of distinct values is very accurate and frequencies are a
little underestimated.

QUERY PLAN

Aggregate  (cost=118518.96..118958.65 rows=8794 width=18)
  ->  Group  (cost=118518.96..118738.80 rows=87937 width=18)
        ->  Sort  (cost=118518.96..118518.96 rows=87937 width=18)
              ->  Hash Join  (cost=4.65..111297.50 rows=87937 width=18)
                    ->  Seq Scan on fact0 f  (cost=0.00..87693.36
rows=3000036 width=8)
                    ->  Hash  (cost=4.43..4.43 rows=89 width=10)
                          ->  Index Scan using dim0_q1 on dim0 d0
(cost=0.00..4.43 rows=89 width=10)

ELAPSED : 60s

The query statistics are now very accurate ( e.g 89 rows from dim0 and 87937
rows joined)  - note that ironically the better execution plan is chosen with
the poorer statistical data !


The conclusion here seems to be that the 10 quantiles are not quite enough
for accurate distributional data where (large) tables have a few thousand
distinct values. However 100 quantiles was sufficient to get accurate
statistics.

Further Notes
-------------

Experimentation showed that accurate estimates (+/- 10%) of number of
distinct values did not begin to appear until about 75 quantiles were used.

On the other hand reducing the number of distinct entries by a factor of 10,
while keeping the number of rows constant at 3000000 gave rise to accurate
statistics with 10 quantiles.

Given that the distribution used for fact0 is relatively benign (uniform),
the test is hopefully fair (i.e. is not constructed specially to fox the
analyzer), However the most common value for fact0 is non-unique ( since all
d0key values have the same frequency ) - I am uncertain if this is
significant...

Tests were perfomed with 7.2 snapshot 16 Oct.

regards

Mark
4AÆ


Re: On Distributions In 7.2 (Longish)

From
Tom Lane
Date:
Mark kirkwood <markir@slingshot.co.nz> writes:
> I have been examining the behaviour of these additions, using the dataset
> from my "warehouse comparison" as a test bed - as it has some large-ish
> tables with controllable data distributions.

Many thanks for this!  I haven't yet gotten much feedback on real-world
performance of the new statistics code.

> --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;

> most_common_vals
> {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> Most_common_freqs
>
{"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
> n_distinct        36511

> Note we are out by an order of magnitude here for number distinct -
> should be 3000, and the frequencies are a little overestimated -
> should be 0.001

The factor-of-2 error for the "most common" frequency doesn't bother me;
that's to be expected, considering that we're using a random sampling
of the table rows.  However, the factor-of-10 error in the n_distinct
estimate is more troubling.  Could you trace through the code that
produces that estimate (in current sources, near line 1310 of
src/backend/commands/analyze.c) and see if you can tell why it's so
far off?

Another interesting thing to look at is how much the stats change in
repeated runs of ANALYZE.  Since we're taking a random sample,
successive runs can be expected to produce slightly different answers.
I would like to think that the answers won't change too far ... but
maybe this n_distict estimate was an outlier.

> ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;

> most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
> values)...
> most_common_freqs
>
{"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
> n_distinct        3027

> Now the number of distinct values is very accurate and frequencies are a
> little underestimated.

This is a little strange as well.  If the true frequency of every d0key
value is 0.001, then the random sample should produce a few estimates
larger than that as well as a few smaller; so I'd expect the "most
common" entries to be larger than 0.001.  Furthermore, there is code in
there that's supposed to suppress "most common" entries that aren't
really much more common than the estimated average (cf lines 1360-1380).
Why didn't that trigger?

Looks like I may have some bugs to fix here.

> The conclusion here seems to be that the 10 quantiles are not quite enough
> for accurate distributional data where (large) tables have a few thousand
> distinct values. However 100 quantiles was sufficient to get accurate
> statistics.
> Experimentation showed that accurate estimates (+/- 10%) of number of
> distinct values did not begin to appear until about 75 quantiles were used.

One of the things I would like to establish before we finish beta is
whether the default stats target of 10 is large enough.  I am not very
comfortable with raising it as far as 75-100, but would not be fazed
with numbers around 20-30.  I appreciate your feedback on this point.

I wonder, though, whether what we're looking at isn't just a problem
with the number-of-distinct-values estimation equation.  The one that's
in there seemed a little phony to me, but I couldn't find anything else
in a cursory literature search.

            regards, tom lane

Re: On Distributions In 7.2 (Longish)

From
Mark kirkwood
Date:
Tom

1)  I will look at analyze.c and
2)  do more analyze runs with 10 buckets to see if I am getting some bad
probabilistic effects.

With respect to 2), I will also try the same examination on a Freebsd 4.4 box
that I have got here ( just to check that the Linux Mandrake 8.0 machine I
use is not the source of any randomness problems)

regards

Mark

>On Sunday 28 October 2001 07:50, Tom Lane wrote:
> Mark kirkwood <markir@slingshot.co.nz> writes:
> > I have been examining the behaviour of these additions, using the dataset
> > from my "warehouse comparison" as a test bed - as it has some large-ish
> > tables with controllable data distributions.
>
> Many thanks for this!  I haven't yet gotten much feedback on real-world
> performance of the new statistics code.
>
> > --ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
> >
> > most_common_vals
> > {"2243","2751","105","250","525","1623","2112","2331","2983","28"}
> > Most_common_freqs
> > {"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667"
> >,"0.00166667","0.00166667","0.00166667","0.00133333"} n_distinct
> > 36511
> >
> > Note we are out by an order of magnitude here for number distinct -
> > should be 3000, and the frequencies are a little overestimated -
> > should be 0.001
>
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows.  However, the factor-of-10 error in the n_distinct
> estimate is more troubling.  Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?
>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE.  Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.
>
> > ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
> >
> > most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
> > values)...
> > most_common_freqs
> > {"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667
> >","0.000666667","0.000666667","0.000633333",.... n_distinct        3027
> >
> > Now the number of distinct values is very accurate and frequencies are a
> > little underestimated.
>
> This is a little strange as well.  If the true frequency of every d0key
> value is 0.001, then the random sample should produce a few estimates
> larger than that as well as a few smaller; so I'd expect the "most
> common" entries to be larger than 0.001.  Furthermore, there is code in
> there that's supposed to suppress "most common" entries that aren't
> really much more common than the estimated average (cf lines 1360-1380).
> Why didn't that trigger?
>
> Looks like I may have some bugs to fix here.
>
> > The conclusion here seems to be that the 10 quantiles are not quite
> > enough for accurate distributional data where (large) tables have a few
> > thousand distinct values. However 100 quantiles was sufficient to get
> > accurate statistics.
> > Experimentation showed that accurate estimates (+/- 10%) of number of
> > distinct values did not begin to appear until about 75 quantiles were
> > used.
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough.  I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30.  I appreciate your feedback on this point.
>
> I wonder, though, whether what we're looking at isn't just a problem
> with the number-of-distinct-values estimation equation.  The one that's
> in there seemed a little phony to me, but I couldn't find anything else
> in a cursory literature search.
>
>             regards, tom lane

Re: On Distributions In 7.2

From
Mark kirkwood
Date:
Tom Lane wrote (edited....)
> The factor-of-2 error for the "most common" frequency doesn't bother me;
> that's to be expected, considering that we're using a random sampling
> of the table rows.  However, the factor-of-10 error in the n_distinct
> estimate is more troubling.  Could you trace through the code that
> produces that estimate (in current sources, near line 1310 of
> src/backend/commands/analyze.c) and see if you can tell why it's so
> far off?

My limited tracing of  analyze.c
( adding various elog calls in lines 1288->1344 of Beta 1 sources ) :

ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 3000 values in relation (?)         (numrows)
DEBUG:      Have 3000036 total values in relation (totalrows)
DEBUG:      Have 3000 values in sample            (values_cnt)
DEBUG:      Have 1875 distinct values in sample   (ndistinct)
DEBUG:      Have 813 multiple values in sample    (nmultiple)
DEBUG:      calc 1116.000000 f1 for Chaudhuri rule
DEBUG:      calc 35291.230469 term1 for Chaudhuri rule
DEBUG:      calc 34397.000000 distinct via Chaudhuri rule

That term1 seems like the problem. I am thinking of a sort of modifier
function that tames the extreme values for term1 when sample size << rows in
relation, but tends to 1 as sample size ~ rows in relation.

a quick and dirty experiment with a vaguely suitable such function in
analyze.c ( approx line 1333):

/*                      term1 = sqrt(totalrows / (double) numrows) * f1; */
                        term1 = sqrt(totalrows / (double) numrows) * f1 /
                                   log10(log10(1000) * totalrows/numrows);

ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 3000 values in relation
DEBUG:      Have 3000036 total values in relation
DEBUG:      Have 3000 values in sample
DEBUG:      Have 1898 distinct values in sample
DEBUG:      Have 798 multiple values in sample
DEBUG:      calc 1100.000000 f1 for Modified Chaudhuri rule
DEBUG:      calc 10004.025391 term1 for Modified Chaudhuri rule
DEBUG:      calc 10802.000000 distinct via Modified Chaudhuri rule

which is better (only factor of 3 out now !)

now see if I have killed the larger sample behaviour :

ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;

DEBUG:  Analyzing fact0
DEBUG:    Analyze : beginning a column
DEBUG:      Have 30000 values in relation
DEBUG:      Have 3000036 total values in relation
DEBUG:      Have 30000 values in sample
DEBUG:      Have 3000 distinct values in sample
DEBUG:      Have 2998 multiple values in sample
DEBUG:      calc 2.000000 f1 for Modified Chaudhuri rule
DEBUG:      calc 8.073919 term1 for Modified Chaudhuri rule
DEBUG:      calc 3006.000000 distinct via Modified Chaudhuri rule

So is kind of doing the right thing, however a modifier function that
_actually_ tends to 1 as numrows -> totalrows (and is more accurate at small
sample size) would be much better ! ( where are those 2nd year math books
when you want them....)

>
> Another interesting thing to look at is how much the stats change in
> repeated runs of ANALYZE.  Since we're taking a random sample,
> successive runs can be expected to produce slightly different answers.
> I would like to think that the answers won't change too far ... but
> maybe this n_distict estimate was an outlier.

Repeated tests on the Freebsd 4.4 machine obtained values consistently in the
34000-36000 range, so the sampling method looks great.
>
>
> One of the things I would like to establish before we finish beta is
> whether the default stats target of 10 is large enough.  I am not very
> comfortable with raising it as far as 75-100, but would not be fazed
> with numbers around 20-30.  I appreciate your feedback on this point.
>
Agreed, 100 seems too big, 10->30 *feels* much better.

regards

Mark

More On 7.2 Distributions - Estimates For Number Distinct < 0

From
Mark kirkwood
Date:
In the process of attempting to understand the data in pg_stats, I created a
(very) simple example :

CREATE TABLE test(id integer);

INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(1);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
INSERT INTO test VALUES(5);

ANALYZE test;

SELECT * FROM pg_stats WHERE tablename='test';

tablename        test
attname          id
null_frac        0
avg_width        4
n_distinct       -0.5
most_common_vals {"1","2"}
most_common_vals {"0.4","0.3"}
histogram_bounds {"3","4","5"}
correlation      1

everything looks good except for n_distinct ( its negative - should be 5)
(I wasn't too worried about avg_width )

Using fairly crude tracing (adding elog calls) in
src/backend/commands/analyze.c :

DEBUG:  Analyzing test
DEBUG:    Analyze : beginning a column
DEBUG:      Have 10 total values in relation (totalrows)
DEBUG:      Have 10 values in relation       (numrows)
DEBUG:      Have 10 values in sample         (values_cnt)
DEBUG:      Have 5 distinct values in sample (ndistinct)
DEBUG:      Have 2 multiple values in sample (nmultiple)
DEBUG:      calc 5.000000 distinct via Chaudhuri rule
DEBUG:      calc -0.500000 distinct via >10 percent rowcount rule

So we had the correct answer before applying the 10 percent rowcount code.

This 10 percent rowcount code being line 1340 or thereabouts :

if (stats->stadistinct > 0.1 * totalrows)
{
stats->stadistinct = -(stats->stadistinct / totalrows);
}

My example is pretty contrived, but I wonder if I have "stumbled" on a bug
here.

regards

Mark


Re: More On 7.2 Distributions - Estimates For Number Distinct < 0

From
Tom Lane
Date:
Mark kirkwood <markir@slingshot.co.nz> writes:
> everything looks good except for n_distinct ( its negative - should be 5)

This is okay; see the documentation about what n_distinct means.

I'd be interested in a better rule-of-thumb for estimating whether the
number of distinct values in a column should be assumed to scale with
the total table size or not.  That 10% threshold was just pulled out
of the air...

            regards, tom lane