On Distributions In 7.2 (Longish) - Mailing list pgsql-general

From Mark kirkwood
Subject On Distributions In 7.2 (Longish)
Date
Msg-id 01102716083800.01060@spikey.slithery.org
Whole thread Raw
List pgsql-general
The current sources (7.2) have introduced distributional factors into the
system statistics.

I have been examining the behaviour of these additions, using the dataset
from my "warehouse comparison" as a test bed - as it has some large-ish
tables with controllable data distributions.

I think the results are quite interesting....


Tables
------
                 Table "dim0"
 Column |           Type           | Modifiers
--------+--------------------------+-----------
 d0key  | integer                  |
 f1     | timestamp with time zone |
 f2     | character varying(20)    |
 f3     | character varying(20)    |
Indexes: dim0_q1 ( on f1 - UNIQUE)
Unique keys: dim0_pk
Rows : 3000

        Table "fact0"
 Column |  Type   | Modifiers
--------+---------+-----------
 d0key  | integer |
 d1key  | integer |
 d2key  | integer |
 val    | integer |
 filler | text    |
Indexes: fact0_q1 (on d0key ONLY)
Rows : 3000000
Distribution : d0key uniformly distributed
               1000  occurrences for each value of d0key - i.e
               0.001 frequency for each value of d0key
           3000  distinct values of d0key

Query
-----

The query to be examined is :


SELECT
       d0.f3,
       count(f.val)
FROM dim0 d0,
     fact0 f
WHERE d0.d0key = f.d0key
AND   d0.f1 between '1999-12-01' AND '2000-02-29'
GROUP BY d0.f3;

This will join 88 rows from the dim0 table with 88000 from the fact0 table
and group them into 3 "month" buckets.


Case 1 :
--------

Consider using the default distributional sampling settings (10 quantiles) -

--ALTER TABLE fact0 ALTER d0key SET STATISTICS 10;
ANALYZE fact0;

System Stats
------------

SELECT most_common_vals,,most_common_freqs,n_distinct FROM pg_stats WHERE
tablename = 'fact0' AND attname='d0key';


most_common_vals
{"2243","2751","105","250","525","1623","2112","2331","2983","28"}
Most_common_freqs

{"0.00233333","0.002","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00166667","0.00133333"}
n_distinct        36511


Note we are out by an order of magnitude here for number distinct - should be
3000, and the frequencies are a little overestimated - should be 0.001



QUERY PLAN

Aggregate  (cost=29712.88..29749.00 rows=722 width=18)
  ->  Group  (cost=29712.88..29730.94 rows=7225 width=18)
        ->  Sort  (cost=29712.88..29712.88 rows=7225 width=18)
              ->  Nested Loop  (cost=0.00..29249.81 rows=7225 width=18)
                    ->  Index Scan using dim0_q1 on dim0 d0  (cost=0.00..4.43
rows=89 width=10)
                    ->  Index Scan using fact0_q1 on fact0 f
(cost=0.00..326.33 rows=81 width=8)


RESULTS
-------

 f3 | count
----+-------
 01 | 30000
 02 | 28000
 12 | 30000
(3 rows)

ELAPSED : 19s

Clearly the query statistics are underestimating the number of rows for the
nested loop join, by about a factor of 10. The estimated rows from dim0 are
ok.

Case 2 :
--------

Lets try 100 quantiles

ALTER TABLE fact0 ALTER d0key SET STATISTICS 100;
ANALYZE fact0;

System Stats
------------

most_common_vals  {"328","1242","524","1515","2058","2168",( 94 more
values)...
most_common_freqs

{"0.0007","0.0007","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000666667","0.000633333",....
n_distinct        3027

Now the number of distinct values is very accurate and frequencies are a
little underestimated.

QUERY PLAN

Aggregate  (cost=118518.96..118958.65 rows=8794 width=18)
  ->  Group  (cost=118518.96..118738.80 rows=87937 width=18)
        ->  Sort  (cost=118518.96..118518.96 rows=87937 width=18)
              ->  Hash Join  (cost=4.65..111297.50 rows=87937 width=18)
                    ->  Seq Scan on fact0 f  (cost=0.00..87693.36
rows=3000036 width=8)
                    ->  Hash  (cost=4.43..4.43 rows=89 width=10)
                          ->  Index Scan using dim0_q1 on dim0 d0
(cost=0.00..4.43 rows=89 width=10)

ELAPSED : 60s

The query statistics are now very accurate ( e.g 89 rows from dim0 and 87937
rows joined)  - note that ironically the better execution plan is chosen with
the poorer statistical data !


The conclusion here seems to be that the 10 quantiles are not quite enough
for accurate distributional data where (large) tables have a few thousand
distinct values. However 100 quantiles was sufficient to get accurate
statistics.

Further Notes
-------------

Experimentation showed that accurate estimates (+/- 10%) of number of
distinct values did not begin to appear until about 75 quantiles were used.

On the other hand reducing the number of distinct entries by a factor of 10,
while keeping the number of rows constant at 3000000 gave rise to accurate
statistics with 10 quantiles.

Given that the distribution used for fact0 is relatively benign (uniform),
the test is hopefully fair (i.e. is not constructed specially to fox the
analyzer), However the most common value for fact0 is non-unique ( since all
d0key values have the same frequency ) - I am uncertain if this is
significant...

Tests were perfomed with 7.2 snapshot 16 Oct.

regards

Mark
4AÆ


pgsql-general by date:

Previous
From: Mark kirkwood
Date:
Subject: On Distributions In 7.2 (Longish)
Next
From: "mike sears"
Date:
Subject: Re: Broken pipes