Home > mailing lists

Re: Our CLUSTER implementation is pessimal - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: Our CLUSTER implementation is pessimal
Date	January 7, 2009 23:25:47
Msg-id	200901080024.n080OCt27964@momjian.us Whole thread Raw
In response to	Our CLUSTER implementation is pessimal (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

Added to TODO:
Improve CLUSTER performance by sorting to reduce random I/O    *
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
 

---------------------------------------------------------------------------

Gregory Stark wrote:
> 
> One thing that's been annoying me for a while is that our CLUSTER
> implementation is really very slow. When I say very slow I mean it's really
> very very very slow.
> 
> The problem is that it does a full index scan and looks up each tuple in the
> order of the index. That means it a) is doing a lot of random i/o and b) has
> to access the same pages over and over again.
> 
> Our query planner knows better and normally does a sequential heap scan and
> sort for queries of this type. But CLUSTER has to use the index.
> 
> Just the other day I wanted to randomize the order of a table so I added a
> random column, indexed it and started cluster running. After 6 hours I looked
> at how far it had gotten and found it was only about 20% done rewriting the
> heap (not including rebuilding the indexes). I cancelled it and built a new
> table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!
> 
> There are a couple problems with this:
> 
> a) We need some way to decide *when* to do a sort and when to do an index
> scan. The planner has all this machinery but we don't really have all the
> pieces handy to use it in a utility statement. This is especially important
> for the case where we're doing a cluster operation on a table that's already
> clustered. In that case an index scan could conceivably actually win (though I
> kind of doubt it). I don't really have a solution for this.
> 
> b) tuplesort no longer has the pieces needed to sort whole tuples including
> visibility info. And actually even the old pieces that were removed had not
> quite the right interface and behaviour. We need to preserve t_self for the
> heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
> generate the scan keys that match the index. The cleanest approach I think is
> to just add back in the old raw tuple methods to tuplesort and tweak them to
> preserve t_self and take Scankeys instead of attnums etc.
> 
> c) expression indexes are a problem. We would have to start with constructing
> new tuples to hold the expression values but retain the entire original heap
> tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
> both in the sorttuple? For now I figure we can just punt on expression indexes
> and always do an index sacn.
> 
> I tested out the idea and it works reasonably well. The code might need some
> cleanup including, I think refactoring cluster_rel() and possibly tweaking the
> interface to tuplesort. But I'm fairly happy with it.
> 
> The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
> using maintenance_work_mem of 768MB and work_mem of 256MB:
> 
> postgres=# \d x
> 
>            Table "public.x"
>  Column |       Type       | Modifiers 
> --------+------------------+-----------
>  i      | integer          | 
>  r      | double precision | 
>  repeat | text             | 
> Indexes:
>     "xi" btree (i)
>     "xi2" btree ((i + 0))
>     "xir" btree (r) CLUSTER
> 
> postgresql=# CLUSTER xi ON x;
> CLUSTER
> Time: 736029.322 ms
> 
> postgresql=# CLUSTER xir ON x;
> CLUSTER
> Time: 723610.507 ms
> 
> postgresql=# CLUSTER xi2 ON x;
> CLUSTER
> Time: 12842793.346 ms
> 
> 
> That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
> sort...
> 

[ Attachment, skipping... ]

> 
> -- 
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>   Ask me about EnterpriseDB's RemoteDBA services!
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

pgsql-hackers by date:

From: Tom Lane
Date: 07 January 2009, 23:16:57
Subject: Re: Updates of SE-PostgreSQL 8.4devel patches (r1389)

From: Tom Lane
Date: 07 January 2009, 23:25:57
Subject: Re: parallel restore

Re: Our CLUSTER implementation is pessimal - Mailing list pgsql-hackers

Previous

Next