On Clusters - Mailing list pgsql-sql

From Mark Kirkwood
Subject On Clusters
Date
Msg-id 01030718574700.02422@spikey.slithery.org
Whole thread Raw
List pgsql-sql
A previous posting mentioning clusters prompted me to revist some earlier 
tests done on clustered and unclustered data.

It appears that currently ( 7.1beta5 )  the optimizer is unaware of any 
clustering on a table - how important is that ?

To answer this question I used by "pet" data warehouse tables :
        Table "fact1"  3000000 rows ~ 350MbAttribute |  Type   | Distribution
-----------+---------+-------------d0key     | integer | 3000 distinct values 0-9000 clustered d1key     | integer |val
     | integer |filler    | text    |
 
Index: fact1_pk ( d0key,d0key ) cluster "key"        Table "fact2" 3000000 rows ~ 350MbAttribute |  Type   |
Distribution
-----------+---------+-------------d0key     | integer | 3000 distinct values 0-9000 uniformly spreadd1key     |
integer|val       | integer |filler    | text    |
 
Index: fact2_pk  ( d0key,d0key )

The sample queries used to shed some light on the nature of the difference 
are : firstly the index scan -

explain select count(*) 
from fact1 where d0key between 200 and 279;

Aggregate  (cost=58664.62..58664.62 rows=1 width=0) ->  Index Scan using fact1_pk on fact1  (cost=0.00..58598.72
rows=26360
 
width=0)

and the sequential scan -

explain select count(*) 
from fact1 where d0key between 200 and 280;

Aggregate  (cost=59020.73..59020.73 rows=1 width=0) ->  Seq Scan on fact1  (cost=0.00..58954.00 rows=26693 width=0)

and analogous versions for fact2 ( with the same execution plan )

On the unclustered table fact2 the optimizer correctly assess the time to 
switch between an index scan and an sequential scan - both queries take about 
30 s.

However on the clustered table fact1, the (same) choice results in a jump 
from1s for the index scan to 30s for the sequential scan.

(this was the guts of the previous research... bear with me those of you who 
read the last article )

So how long should an index scan be used for ?, some experimentation led me to
adjust the "where" clause in my queries to 

where d0key between 0 and 4500

This produces a query plan of :

Aggregate  (cost=62692.75..62692.75 rows=1 width=0) ->  Seq Scan on fact1  (cost=0.00..58954.00 rows=1495498 width=0)

coercing the optimizer with a brutal set of cpu_tuple_cost = 0.4 gives :

Aggregate  (cost=868673.82..868673.82 rows=1 width=0) ->  Index Scan using fact1_pk on fact1  (cost=0.00..864935.08
rows=1495498
 
width=0)

(note that these scan 1500000 rows, ie. half the data )

Testing these queries on fact1 gives run times af about 35s for both -

thus it is worthwhile to keep using index scans of upto 50% 
of the ( clustered ) table data.

I found this result interesting, as I was thinking more like 15-20% of the 
table data would be the limit.

The answer to the original question ( finally ) is "it is pretty important", 
as knowlege of the clustering drastically changes the optimal access path.

So what to do if you know you have clustered data ? ( either via cluster or 
"it just happens to go in that way" ).

Tampering with the various *cost type parameters to encourage index scans 
seems to be the only solution (other sugestions welcome here), but tends to 
be too global in effect ( for example trying the previous query on 
(unclustered ) fact2 with cpu_tuple_cost=0.4 takes more that 300s - I got 
tired of waiting...) .

So be careful out there...

Cheers

Mark




pgsql-sql by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Query Limitations
Next
From: "Jens Hartwig"
Date:
Subject: AW: Problems with RULE