Our CLUSTER implementation is pessimal - Mailing list pgsql-hackers
From | Gregory Stark |
---|---|
Subject | Our CLUSTER implementation is pessimal |
Date | |
Msg-id | 87vdxg6a3d.fsf@oxford.xeocode.com Whole thread Raw |
Responses |
Re: Our CLUSTER implementation is pessimal
Re: Our CLUSTER implementation is pessimal Re: Our CLUSTER implementation is pessimal Re: Our CLUSTER implementation is pessimal |
List | pgsql-hackers |
One thing that's been annoying me for a while is that our CLUSTER implementation is really very slow. When I say very slow I mean it's really very very very slow. The problem is that it does a full index scan and looks up each tuple in the order of the index. That means it a) is doing a lot of random i/o and b) has to access the same pages over and over again. Our query planner knows better and normally does a sequential heap scan and sort for queries of this type. But CLUSTER has to use the index. Just the other day I wanted to randomize the order of a table so I added a random column, indexed it and started cluster running. After 6 hours I looked at how far it had gotten and found it was only about 20% done rewriting the heap (not including rebuilding the indexes). I cancelled it and built a new table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes! There are a couple problems with this: a) We need some way to decide *when* to do a sort and when to do an index scan. The planner has all this machinery but we don't really have all the pieces handy to use it in a utility statement. This is especially important for the case where we're doing a cluster operation on a table that's already clustered. In that case an index scan could conceivably actually win (though I kind of doubt it). I don't really have a solution for this. b) tuplesort no longer has the pieces needed to sort whole tuples including visibility info. And actually even the old pieces that were removed had not quite the right interface and behaviour. We need to preserve t_self for the heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to generate the scan keys that match the index. The cleanest approach I think is to just add back in the old raw tuple methods to tuplesort and tweak them to preserve t_self and take Scankeys instead of attnums etc. c) expression indexes are a problem. We would have to start with constructing new tuples to hold the expression values but retain the entire original heap tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store both in the sorttuple? For now I figure we can just punt on expression indexes and always do an index sacn. I tested out the idea and it works reasonably well. The code might need some cleanup including, I think refactoring cluster_rel() and possibly tweaking the interface to tuplesort. But I'm fairly happy with it. The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop using maintenance_work_mem of 768MB and work_mem of 256MB: postgres=# \d x Table "public.x" Column | Type | Modifiers --------+------------------+----------- i | integer | r | double precision | repeat | text | Indexes: "xi" btree (i) "xi2" btree ((i + 0)) "xir" btree (r) CLUSTER postgresql=# CLUSTER xi ON x; CLUSTER Time: 736029.322 ms postgresql=# CLUSTER xir ON x; CLUSTER Time: 723610.507 ms postgresql=# CLUSTER xi2 ON x; CLUSTER Time: 12842793.346 ms That's 3.5 hours for traditional cluster and 12 minutes for cluster using a sort... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Attachment
pgsql-hackers by date: