Thread: Our CLUSTER implementation is pessimal

Our CLUSTER implementation is pessimal

From

Gregory Stark

Date:

31 August 2008, 20:25:48

One thing that's been annoying me for a while is that our CLUSTER
implementation is really very slow. When I say very slow I mean it's really
very very very slow.

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

Our query planner knows better and normally does a sequential heap scan and
sort for queries of this type. But CLUSTER has to use the index.

Just the other day I wanted to randomize the order of a table so I added a
random column, indexed it and started cluster running. After 6 hours I looked
at how far it had gotten and found it was only about 20% done rewriting the
heap (not including rebuilding the indexes). I cancelled it and built a new
table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!

There are a couple problems with this:

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

b) tuplesort no longer has the pieces needed to sort whole tuples including
visibility info. And actually even the old pieces that were removed had not
quite the right interface and behaviour. We need to preserve t_self for the
heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
generate the scan keys that match the index. The cleanest approach I think is
to just add back in the old raw tuple methods to tuplesort and tweak them to
preserve t_self and take Scankeys instead of attnums etc.

c) expression indexes are a problem. We would have to start with constructing
new tuples to hold the expression values but retain the entire original heap
tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
both in the sorttuple? For now I figure we can just punt on expression indexes
and always do an index sacn.

I tested out the idea and it works reasonably well. The code might need some
cleanup including, I think refactoring cluster_rel() and possibly tweaking the
interface to tuplesort. But I'm fairly happy with it.

The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
using maintenance_work_mem of 768MB and work_mem of 256MB:

postgres=# \d x

           Table "public.x"
 Column |       Type       | Modifiers
--------+------------------+-----------
 i      | integer          |
 r      | double precision |
 repeat | text             |
Indexes:
    "xi" btree (i)
    "xi2" btree ((i + 0))
    "xir" btree (r) CLUSTER

postgresql=# CLUSTER xi ON x;
CLUSTER
Time: 736029.322 ms

postgresql=# CLUSTER xir ON x;
CLUSTER
Time: 723610.507 ms

postgresql=# CLUSTER xi2 ON x;
CLUSTER
Time: 12842793.346 ms


That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
sort...


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Attachment

cluster-sort-v1.diff

Re: Our CLUSTER implementation is pessimal

From

Martijn van Oosterhout

Date:

01 September 2008, 04:22:18

On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote:
> The problem is that it does a full index scan and looks up each tuple in the
> order of the index. That means it a) is doing a lot of random i/o and b) has
> to access the same pages over and over again.

<snip>

> a) We need some way to decide *when* to do a sort and when to do an index
> scan. The planner has all this machinery but we don't really have all the
> pieces handy to use it in a utility statement. This is especially important
> for the case where we're doing a cluster operation on a table that's already
> clustered. In that case an index scan could conceivably actually win (though I
> kind of doubt it). I don't really have a solution for this.

The case I had recently was a table that was hugely bloated. 300MB data
and only 110 live rows. A cluster was instant, a seqscan/sort would
probably be much slower. A VACUUM FULL probably worse :)

Isn't there some compromise. Like say scanning the index to collect a
few thousand records and then sort them the way a bitmap index scan
does. Should be much more efficient that what we have now.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Our CLUSTER implementation is pessimal

From

Heikki Linnakangas

Date:

01 September 2008, 04:40:54

Martijn van Oosterhout wrote:
> On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote:
>> The problem is that it does a full index scan and looks up each tuple in the
>> order of the index. That means it a) is doing a lot of random i/o and b) has
>> to access the same pages over and over again.
> 
> <snip>
> 
>> a) We need some way to decide *when* to do a sort and when to do an index
>> scan. The planner has all this machinery but we don't really have all the
>> pieces handy to use it in a utility statement. This is especially important
>> for the case where we're doing a cluster operation on a table that's already
>> clustered. In that case an index scan could conceivably actually win (though I
>> kind of doubt it). I don't really have a solution for this.
> 
> The case I had recently was a table that was hugely bloated. 300MB data
> and only 110 live rows. A cluster was instant, a seqscan/sort would
> probably be much slower. A VACUUM FULL probably worse :)
> 
> Isn't there some compromise. Like say scanning the index to collect a
> few thousand records and then sort them the way a bitmap index scan
> does. Should be much more efficient that what we have now.

Ideally we would use the planner, and the planner would choose the best 
plan for a bloated table like that (it probably does, I'm not sure) as well.

However, I'm not sure how much we can trust the statistics for a table 
we're about to CLUSTER. Often you run CLUSTER on a table you've just 
loaded or mass-updated.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Our CLUSTER implementation is pessimal

From

Tom Lane

Date:

03 September 2008, 20:30:49

Gregory Stark <stark@enterprisedb.com> writes:
> There are a couple problems with this:

> a) We need some way to decide *when* to do a sort and when to do an index
> scan. The planner has all this machinery but we don't really have all the
> pieces handy to use it in a utility statement.

Why not?  You don't even need any quals when trying to cost a full-index
scan.

> b) tuplesort no longer has the pieces needed to sort whole tuples including
> visibility info. And actually even the old pieces that were removed had not
> quite the right interface and behaviour. We need to preserve t_self for the
> heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
> generate the scan keys that match the index.

So you just broke it irredeemably for non-btree indexes, no?
        regards, tom lane

Re: Our CLUSTER implementation is pessimal

From

Simon Riggs

Date:

03 September 2008, 23:28:34

On Mon, 2008-09-01 at 00:25 +0100, Gregory Stark wrote:

> One thing that's been annoying me for a while is that our CLUSTER
> implementation is really very slow. When I say very slow I mean it's really
> very very very slow.

Does this implementation work towards being able to do CREATE INDEX ... CLUSTER TABLE
So that we can do both actions with just one sort of the data?

I think there needs to be an option to force this to do either sorts or
indexscans. On a large table you may not have the space to perform a
full table sort, plus on a multi-column index we may not accurately
predict the cost of an indexscan.

(What is the change to elog.c about?)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Our CLUSTER implementation is pessimal

From

Heikki Linnakangas

Date:

08 September 2008, 07:52:41

Simon Riggs wrote:
> I think there needs to be an option to force this to do either sorts or
> indexscans. 

If we use the planner, "set enable_indexscan =off" or "set 
enable_sort=off" ought to work.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Our CLUSTER implementation is pessimal

From

Simon Riggs

Date:

08 September 2008, 08:12:19

On Mon, 2008-09-08 at 13:52 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I think there needs to be an option to force this to do either sorts or
> > indexscans. 
> 
> If we use the planner, "set enable_indexscan =off" or "set 
> enable_sort=off" ought to work.

Agreed - as long as that is explicitly in the docs.

I'm wondering whether we should put a limit on size of each temp
tablespace. This change will cause old admin jobs to break disks that
aren't big enough for the new way of doing it.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Our CLUSTER implementation is pessimal

From

Gregory Stark

Date:

08 September 2008, 08:20:20

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Simon Riggs wrote:
>> I think there needs to be an option to force this to do either sorts or
>> indexscans. 
>
> If we use the planner, "set enable_indexscan =off" or "set enable_sort=off"
> ought to work.

Yeah, I've been thinking about how to use the planner to do this. It seems to
me it would be a much better solution because it would allow us to take
advantage of other access paths as well (such as other indexes) and even new
features that don't currently exist (index-only scans...).

To do that it seems to me what we would need to do is add a function
_pg_get_rawtuple_header() which returns the visibility information that HTSV
needs. 

Then we need to construct an SPI query like
SELECT _pg_get_rawtuple_header(), * FROM tab ORDER BY col1, col2, col3, ...

For each tuple we'll have to deform it, and reform it using the new tuple
descriptor and just the columns excluding the header and pass that to the heap
rewrite module. Passing the header separately.

Heap rewrite would have to call HTSV on just the header (with the same hack I
put in for this patch to allow passing InvalidBuffer to HTSV).

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: Our CLUSTER implementation is pessimal

From

Tom Lane

Date:

08 September 2008, 09:48:34

Gregory Stark <stark@enterprisedb.com> writes:
> Yeah, I've been thinking about how to use the planner to do this.

I thought the answer to that was going to be more or less "call
cost_sort() and cost_index() and compare the answers".

> To do that it seems to me what we would need to do is add a function
> _pg_get_rawtuple_header() which returns the visibility information that HTSV
> needs. 

You seem to be confusing "use the planner" with "use the executor".
All that we need here is a decision about which code path to take
within CLUSTER.  We don't need to bring in boatloads of irrelevant
infrastructure --- especially not infrastructure that's going to be
fighting us every step of the way.  The executor isn't designed to
return raw tuples and no magic function is going to change that.
        regards, tom lane

Re: Our CLUSTER implementation is pessimal

From

Gregory Stark

Date:

09 September 2008, 07:20:39

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Yeah, I've been thinking about how to use the planner to do this.
>
> I thought the answer to that was going to be more or less "call
> cost_sort() and cost_index() and compare the answers".

That was the way I was headed. But the idea of using the executor is looking
more and more attractive the more I think about it. It would solve this
problem as well as future-proof us against any other new features which could
speed up CLUSTER. 

In particular I'm thinking of people clustering on a covering index (which
isn't as uncommon as it sounds, if you have a covering index you probably do
want to cluster it -- consider many-to-many join tables). We should be able to
do an index-only scan which might be even faster than sorting.

Also, it would cleanly solve the expression index case which my current
solution can't solve (not without much larger changes to tuplesort because the
sort key isn't present in the sort tuple). Also, if your table is very bloated
it could be faster to do a bitmap index scan and sort. Or scan another newer
and better organized index instead of the index you're clustering.

Basically I feel like what I've done is reproduce a small part of the planner
and executor and it would be a cleaner and more general solution to just do it
properly.

>> To do that it seems to me what we would need to do is add a function
>> _pg_get_rawtuple_header() which returns the visibility information that HTSV
>> needs. 
>
> You seem to be confusing "use the planner" with "use the executor".
> All that we need here is a decision about which code path to take
> within CLUSTER.  We don't need to bring in boatloads of irrelevant
> infrastructure --- especially not infrastructure that's going to be
> fighting us every step of the way.  The executor isn't designed to
> return raw tuples and no magic function is going to change that.

Actually, thinking about it, couldn't we just a new system column to do this?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: Our CLUSTER implementation is pessimal

From

Tom Lane

Date:

09 September 2008, 09:22:58

Gregory Stark <stark@enterprisedb.com> writes:
> In particular I'm thinking of people clustering on a covering index (which
> isn't as uncommon as it sounds, if you have a covering index you probably do
> want to cluster it -- consider many-to-many join tables). We should be able to
> do an index-only scan which might be even faster than sorting.

[ scratches head... ]  You need *all* the data from the heap.  Or by
"covering index" do you mean an index that contains the entire table
contents?  Doesn't really sound like a case we need to focus on; or
at least this version of clustering isn't what it needs, it wants an
implementation where the table and the index are the same thing.
        regards, tom lane

Re: Our CLUSTER implementation is pessimal

From

Bruce Momjian

Date:

07 January 2009, 20:25:47

Added to TODO:
Improve CLUSTER performance by sorting to reduce random I/O    *
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
 

---------------------------------------------------------------------------

Gregory Stark wrote:
> 
> One thing that's been annoying me for a while is that our CLUSTER
> implementation is really very slow. When I say very slow I mean it's really
> very very very slow.
> 
> The problem is that it does a full index scan and looks up each tuple in the
> order of the index. That means it a) is doing a lot of random i/o and b) has
> to access the same pages over and over again.
> 
> Our query planner knows better and normally does a sequential heap scan and
> sort for queries of this type. But CLUSTER has to use the index.
> 
> Just the other day I wanted to randomize the order of a table so I added a
> random column, indexed it and started cluster running. After 6 hours I looked
> at how far it had gotten and found it was only about 20% done rewriting the
> heap (not including rebuilding the indexes). I cancelled it and built a new
> table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!
> 
> There are a couple problems with this:
> 
> a) We need some way to decide *when* to do a sort and when to do an index
> scan. The planner has all this machinery but we don't really have all the
> pieces handy to use it in a utility statement. This is especially important
> for the case where we're doing a cluster operation on a table that's already
> clustered. In that case an index scan could conceivably actually win (though I
> kind of doubt it). I don't really have a solution for this.
> 
> b) tuplesort no longer has the pieces needed to sort whole tuples including
> visibility info. And actually even the old pieces that were removed had not
> quite the right interface and behaviour. We need to preserve t_self for the
> heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
> generate the scan keys that match the index. The cleanest approach I think is
> to just add back in the old raw tuple methods to tuplesort and tweak them to
> preserve t_self and take Scankeys instead of attnums etc.
> 
> c) expression indexes are a problem. We would have to start with constructing
> new tuples to hold the expression values but retain the entire original heap
> tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
> both in the sorttuple? For now I figure we can just punt on expression indexes
> and always do an index sacn.
> 
> I tested out the idea and it works reasonably well. The code might need some
> cleanup including, I think refactoring cluster_rel() and possibly tweaking the
> interface to tuplesort. But I'm fairly happy with it.
> 
> The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
> using maintenance_work_mem of 768MB and work_mem of 256MB:
> 
> postgres=# \d x
> 
>            Table "public.x"
>  Column |       Type       | Modifiers 
> --------+------------------+-----------
>  i      | integer          | 
>  r      | double precision | 
>  repeat | text             | 
> Indexes:
>     "xi" btree (i)
>     "xi2" btree ((i + 0))
>     "xir" btree (r) CLUSTER
> 
> postgresql=# CLUSTER xi ON x;
> CLUSTER
> Time: 736029.322 ms
> 
> postgresql=# CLUSTER xir ON x;
> CLUSTER
> Time: 723610.507 ms
> 
> postgresql=# CLUSTER xi2 ON x;
> CLUSTER
> Time: 12842793.346 ms
> 
> 
> That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
> sort...
> 

[ Attachment, skipping... ]

> 
> -- 
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>   Ask me about EnterpriseDB's RemoteDBA services!
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +