Thread: Our CLUSTER implementation is pessimal
One thing that's been annoying me for a while is that our CLUSTER implementation is really very slow. When I say very slow I mean it's really very very very slow. The problem is that it does a full index scan and looks up each tuple in the order of the index. That means it a) is doing a lot of random i/o and b) has to access the same pages over and over again. Our query planner knows better and normally does a sequential heap scan and sort for queries of this type. But CLUSTER has to use the index. Just the other day I wanted to randomize the order of a table so I added a random column, indexed it and started cluster running. After 6 hours I looked at how far it had gotten and found it was only about 20% done rewriting the heap (not including rebuilding the indexes). I cancelled it and built a new table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes! There are a couple problems with this: a) We need some way to decide *when* to do a sort and when to do an index scan. The planner has all this machinery but we don't really have all the pieces handy to use it in a utility statement. This is especially important for the case where we're doing a cluster operation on a table that's already clustered. In that case an index scan could conceivably actually win (though I kind of doubt it). I don't really have a solution for this. b) tuplesort no longer has the pieces needed to sort whole tuples including visibility info. And actually even the old pieces that were removed had not quite the right interface and behaviour. We need to preserve t_self for the heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to generate the scan keys that match the index. The cleanest approach I think is to just add back in the old raw tuple methods to tuplesort and tweak them to preserve t_self and take Scankeys instead of attnums etc. c) expression indexes are a problem. We would have to start with constructing new tuples to hold the expression values but retain the entire original heap tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store both in the sorttuple? For now I figure we can just punt on expression indexes and always do an index sacn. I tested out the idea and it works reasonably well. The code might need some cleanup including, I think refactoring cluster_rel() and possibly tweaking the interface to tuplesort. But I'm fairly happy with it. The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop using maintenance_work_mem of 768MB and work_mem of 256MB: postgres=# \d x Table "public.x" Column | Type | Modifiers --------+------------------+----------- i | integer | r | double precision | repeat | text | Indexes: "xi" btree (i) "xi2" btree ((i + 0)) "xir" btree (r) CLUSTER postgresql=# CLUSTER xi ON x; CLUSTER Time: 736029.322 ms postgresql=# CLUSTER xir ON x; CLUSTER Time: 723610.507 ms postgresql=# CLUSTER xi2 ON x; CLUSTER Time: 12842793.346 ms That's 3.5 hours for traditional cluster and 12 minutes for cluster using a sort... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Attachment
On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote: > The problem is that it does a full index scan and looks up each tuple in the > order of the index. That means it a) is doing a lot of random i/o and b) has > to access the same pages over and over again. <snip> > a) We need some way to decide *when* to do a sort and when to do an index > scan. The planner has all this machinery but we don't really have all the > pieces handy to use it in a utility statement. This is especially important > for the case where we're doing a cluster operation on a table that's already > clustered. In that case an index scan could conceivably actually win (though I > kind of doubt it). I don't really have a solution for this. The case I had recently was a table that was hugely bloated. 300MB data and only 110 live rows. A cluster was instant, a seqscan/sort would probably be much slower. A VACUUM FULL probably worse :) Isn't there some compromise. Like say scanning the index to collect a few thousand records and then sort them the way a bitmap index scan does. Should be much more efficient that what we have now. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote: > On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote: >> The problem is that it does a full index scan and looks up each tuple in the >> order of the index. That means it a) is doing a lot of random i/o and b) has >> to access the same pages over and over again. > > <snip> > >> a) We need some way to decide *when* to do a sort and when to do an index >> scan. The planner has all this machinery but we don't really have all the >> pieces handy to use it in a utility statement. This is especially important >> for the case where we're doing a cluster operation on a table that's already >> clustered. In that case an index scan could conceivably actually win (though I >> kind of doubt it). I don't really have a solution for this. > > The case I had recently was a table that was hugely bloated. 300MB data > and only 110 live rows. A cluster was instant, a seqscan/sort would > probably be much slower. A VACUUM FULL probably worse :) > > Isn't there some compromise. Like say scanning the index to collect a > few thousand records and then sort them the way a bitmap index scan > does. Should be much more efficient that what we have now. Ideally we would use the planner, and the planner would choose the best plan for a bloated table like that (it probably does, I'm not sure) as well. However, I'm not sure how much we can trust the statistics for a table we're about to CLUSTER. Often you run CLUSTER on a table you've just loaded or mass-updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > There are a couple problems with this: > a) We need some way to decide *when* to do a sort and when to do an index > scan. The planner has all this machinery but we don't really have all the > pieces handy to use it in a utility statement. Why not? You don't even need any quals when trying to cost a full-index scan. > b) tuplesort no longer has the pieces needed to sort whole tuples including > visibility info. And actually even the old pieces that were removed had not > quite the right interface and behaviour. We need to preserve t_self for the > heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to > generate the scan keys that match the index. So you just broke it irredeemably for non-btree indexes, no? regards, tom lane
On Mon, 2008-09-01 at 00:25 +0100, Gregory Stark wrote: > One thing that's been annoying me for a while is that our CLUSTER > implementation is really very slow. When I say very slow I mean it's really > very very very slow. Does this implementation work towards being able to do CREATE INDEX ... CLUSTER TABLE So that we can do both actions with just one sort of the data? I think there needs to be an option to force this to do either sorts or indexscans. On a large table you may not have the space to perform a full table sort, plus on a multi-column index we may not accurately predict the cost of an indexscan. (What is the change to elog.c about?) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I think there needs to be an option to force this to do either sorts or > indexscans. If we use the planner, "set enable_indexscan =off" or "set enable_sort=off" ought to work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2008-09-08 at 13:52 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I think there needs to be an option to force this to do either sorts or > > indexscans. > > If we use the planner, "set enable_indexscan =off" or "set > enable_sort=off" ought to work. Agreed - as long as that is explicitly in the docs. I'm wondering whether we should put a limit on size of each temp tablespace. This change will cause old admin jobs to break disks that aren't big enough for the new way of doing it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon Riggs wrote: >> I think there needs to be an option to force this to do either sorts or >> indexscans. > > If we use the planner, "set enable_indexscan =off" or "set enable_sort=off" > ought to work. Yeah, I've been thinking about how to use the planner to do this. It seems to me it would be a much better solution because it would allow us to take advantage of other access paths as well (such as other indexes) and even new features that don't currently exist (index-only scans...). To do that it seems to me what we would need to do is add a function _pg_get_rawtuple_header() which returns the visibility information that HTSV needs. Then we need to construct an SPI query like SELECT _pg_get_rawtuple_header(), * FROM tab ORDER BY col1, col2, col3, ... For each tuple we'll have to deform it, and reform it using the new tuple descriptor and just the columns excluding the header and pass that to the heap rewrite module. Passing the header separately. Heap rewrite would have to call HTSV on just the header (with the same hack I put in for this patch to allow passing InvalidBuffer to HTSV). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > Yeah, I've been thinking about how to use the planner to do this. I thought the answer to that was going to be more or less "call cost_sort() and cost_index() and compare the answers". > To do that it seems to me what we would need to do is add a function > _pg_get_rawtuple_header() which returns the visibility information that HTSV > needs. You seem to be confusing "use the planner" with "use the executor". All that we need here is a decision about which code path to take within CLUSTER. We don't need to bring in boatloads of irrelevant infrastructure --- especially not infrastructure that's going to be fighting us every step of the way. The executor isn't designed to return raw tuples and no magic function is going to change that. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> Yeah, I've been thinking about how to use the planner to do this. > > I thought the answer to that was going to be more or less "call > cost_sort() and cost_index() and compare the answers". That was the way I was headed. But the idea of using the executor is looking more and more attractive the more I think about it. It would solve this problem as well as future-proof us against any other new features which could speed up CLUSTER. In particular I'm thinking of people clustering on a covering index (which isn't as uncommon as it sounds, if you have a covering index you probably do want to cluster it -- consider many-to-many join tables). We should be able to do an index-only scan which might be even faster than sorting. Also, it would cleanly solve the expression index case which my current solution can't solve (not without much larger changes to tuplesort because the sort key isn't present in the sort tuple). Also, if your table is very bloated it could be faster to do a bitmap index scan and sort. Or scan another newer and better organized index instead of the index you're clustering. Basically I feel like what I've done is reproduce a small part of the planner and executor and it would be a cleaner and more general solution to just do it properly. >> To do that it seems to me what we would need to do is add a function >> _pg_get_rawtuple_header() which returns the visibility information that HTSV >> needs. > > You seem to be confusing "use the planner" with "use the executor". > All that we need here is a decision about which code path to take > within CLUSTER. We don't need to bring in boatloads of irrelevant > infrastructure --- especially not infrastructure that's going to be > fighting us every step of the way. The executor isn't designed to > return raw tuples and no magic function is going to change that. Actually, thinking about it, couldn't we just a new system column to do this? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > In particular I'm thinking of people clustering on a covering index (which > isn't as uncommon as it sounds, if you have a covering index you probably do > want to cluster it -- consider many-to-many join tables). We should be able to > do an index-only scan which might be even faster than sorting. [ scratches head... ] You need *all* the data from the heap. Or by "covering index" do you mean an index that contains the entire table contents? Doesn't really sound like a case we need to focus on; or at least this version of clustering isn't what it needs, it wants an implementation where the table and the index are the same thing. regards, tom lane
Added to TODO: Improve CLUSTER performance by sorting to reduce random I/O * http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php --------------------------------------------------------------------------- Gregory Stark wrote: > > One thing that's been annoying me for a while is that our CLUSTER > implementation is really very slow. When I say very slow I mean it's really > very very very slow. > > The problem is that it does a full index scan and looks up each tuple in the > order of the index. That means it a) is doing a lot of random i/o and b) has > to access the same pages over and over again. > > Our query planner knows better and normally does a sequential heap scan and > sort for queries of this type. But CLUSTER has to use the index. > > Just the other day I wanted to randomize the order of a table so I added a > random column, indexed it and started cluster running. After 6 hours I looked > at how far it had gotten and found it was only about 20% done rewriting the > heap (not including rebuilding the indexes). I cancelled it and built a new > table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes! > > There are a couple problems with this: > > a) We need some way to decide *when* to do a sort and when to do an index > scan. The planner has all this machinery but we don't really have all the > pieces handy to use it in a utility statement. This is especially important > for the case where we're doing a cluster operation on a table that's already > clustered. In that case an index scan could conceivably actually win (though I > kind of doubt it). I don't really have a solution for this. > > b) tuplesort no longer has the pieces needed to sort whole tuples including > visibility info. And actually even the old pieces that were removed had not > quite the right interface and behaviour. We need to preserve t_self for the > heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to > generate the scan keys that match the index. The cleanest approach I think is > to just add back in the old raw tuple methods to tuplesort and tweak them to > preserve t_self and take Scankeys instead of attnums etc. > > c) expression indexes are a problem. We would have to start with constructing > new tuples to hold the expression values but retain the entire original heap > tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store > both in the sorttuple? For now I figure we can just punt on expression indexes > and always do an index sacn. > > I tested out the idea and it works reasonably well. The code might need some > cleanup including, I think refactoring cluster_rel() and possibly tweaking the > interface to tuplesort. But I'm fairly happy with it. > > The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop > using maintenance_work_mem of 768MB and work_mem of 256MB: > > postgres=# \d x > > Table "public.x" > Column | Type | Modifiers > --------+------------------+----------- > i | integer | > r | double precision | > repeat | text | > Indexes: > "xi" btree (i) > "xi2" btree ((i + 0)) > "xir" btree (r) CLUSTER > > postgresql=# CLUSTER xi ON x; > CLUSTER > Time: 736029.322 ms > > postgresql=# CLUSTER xir ON x; > CLUSTER > Time: 723610.507 ms > > postgresql=# CLUSTER xi2 ON x; > CLUSTER > Time: 12842793.346 ms > > > That's 3.5 hours for traditional cluster and 12 minutes for cluster using a > sort... > [ Attachment, skipping... ] > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Ask me about EnterpriseDB's RemoteDBA services! > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +