Our CLUSTER implementation is pessimal - Mailing list pgsql-hackers

One thing that's been annoying me for a while is that our CLUSTER
implementation is really very slow. When I say very slow I mean it's really
very very very slow.

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

Our query planner knows better and normally does a sequential heap scan and
sort for queries of this type. But CLUSTER has to use the index.

Just the other day I wanted to randomize the order of a table so I added a
random column, indexed it and started cluster running. After 6 hours I looked
at how far it had gotten and found it was only about 20% done rewriting the
heap (not including rebuilding the indexes). I cancelled it and built a new
table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!

There are a couple problems with this:

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

b) tuplesort no longer has the pieces needed to sort whole tuples including
visibility info. And actually even the old pieces that were removed had not
quite the right interface and behaviour. We need to preserve t_self for the
heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
generate the scan keys that match the index. The cleanest approach I think is
to just add back in the old raw tuple methods to tuplesort and tweak them to
preserve t_self and take Scankeys instead of attnums etc.

c) expression indexes are a problem. We would have to start with constructing
new tuples to hold the expression values but retain the entire original heap
tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
both in the sorttuple? For now I figure we can just punt on expression indexes
and always do an index sacn.

I tested out the idea and it works reasonably well. The code might need some
cleanup including, I think refactoring cluster_rel() and possibly tweaking the
interface to tuplesort. But I'm fairly happy with it.

The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
using maintenance_work_mem of 768MB and work_mem of 256MB:

postgres=# \d x

           Table "public.x"
 Column |       Type       | Modifiers
--------+------------------+-----------
 i      | integer          |
 r      | double precision |
 repeat | text             |
Indexes:
    "xi" btree (i)
    "xi2" btree ((i + 0))
    "xir" btree (r) CLUSTER

postgresql=# CLUSTER xi ON x;
CLUSTER
Time: 736029.322 ms

postgresql=# CLUSTER xir ON x;
CLUSTER
Time: 723610.507 ms

postgresql=# CLUSTER xi2 ON x;
CLUSTER
Time: 12842793.346 ms


That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
sort...


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Attachment

pgsql-hackers by date:

Previous
From: "David Rowley"
Date:
Subject: Re: TODO item: Implement Boyer-Moore searching (First time hacker)
Next
From: Alvaro Herrera
Date:
Subject: Re: posix advises ...