Thread: Disadvantage to CLUSTER?

Disadvantage to CLUSTER?

From

Robert James

Date:

15 May 2012, 18:02:43

Besides the one time spent CLUSTERing, do I loose anything by doing it
for every table?  Does a CLUSTER slow anything down?

It would seem to me that a) a CLUSTER should never have worse
performance than a random order b) may have better performance and c)
has the benefits of a VACUUM and REINDEX.  So, I may as well cluster
every table, if only by the primary key.

Re: Disadvantage to CLUSTER?

From

Merlin Moncure

Date:

15 May 2012, 18:18:48

On Tue, May 15, 2012 at 4:02 PM, Robert James <srobertjames@gmail.com> wrote:
> Besides the one time spent CLUSTERing, do I loose anything by doing it
> for every table?  Does a CLUSTER slow anything down?
>
> It would seem to me that a) a CLUSTER should never have worse
> performance than a random order b) may have better performance and c)
> has the benefits of a VACUUM and REINDEX.  So, I may as well cluster
> every table, if only by the primary key.

CLUSTER has huge disadvantages over VACUUM in that it's a full table
rewrite and (especially) has to take out a full table lock while it
operates.  CLUSTER by the way is not a persistent operation -- table
will drift way from index order over time.  CLUSTER is however a
general improvement over VACUUM FULL which also packs and reorganizes
the table.

For relatively small tales clustering for performance is pointless.
The reason to cluster is an artifact of the way postgres organizes
data into pages and is an attempt to make it so that you can get
related tuples that are frequently queried together physically
grouped.

If your primary key is a surrogate, it may not be that useful to
CLUSTER on it. For natural primary keys, it almost always makes sense
to cluster on the primary key.

merlin

Re: Disadvantage to CLUSTER?

From

Steve Crawford

Date:

15 May 2012, 18:27:58

On 05/15/2012 02:02 PM, Robert James wrote:
> Besides the one time spent CLUSTERing, do I loose anything by doing it
> for every table?  Does a CLUSTER slow anything down?
>
> It would seem to me that a) a CLUSTER should never have worse
> performance than a random order b) may have better performance and c)
> has the benefits of a VACUUM and REINDEX.  So, I may as well cluster
> every table, if only by the primary key.
>
Cluster requires an exclusive lock - you need a sufficiently long
maintenance window. (But so does vacuum full which is typically slower
and vacuum-full doesn't make shiny new indexes.)

Cluster requires sufficient disk space to create the new copies of the
table.

Cluster only reorganizes data at the time you do the cluster -
additional use after the cluster will cause data to lose its order.

Since you are clustering based on an index you can only choose one
cluster order.

Cluster should have better performance but it depends on the index you
choose relative to the queries you typically run. Let's say that you
have an accounting system where you most often grab the most recent
month worth of data. Clustering that keeps that data together will be
beneficial but you could easily have a different index, item-number for
instance, that would, if used for clustering, leave the commonly used
data scattered throughout the table. If that table was an append-only
detail table the most commonly used data would naturally clump together
which clustering would then destroy.

With the exception of disk-space constraints there are few reasons to
use vacuum-full/reindex instead of cluster.

Cheers,
Steve

Re: Disadvantage to CLUSTER?

From

Robert James

Date:

15 May 2012, 18:44:51

On 5/15/12, Steve Crawford <scrawford@pinpointresearch.com> wrote:
> On 05/15/2012 02:02 PM, Robert James wrote:
>> Besides the one time spent CLUSTERing, do I loose anything by doing it
>> for every table?  Does a CLUSTER slow anything down?

> Cluster should have better performance but it depends on the index you
> choose relative to the queries you typically run. Let's say that you
> have an accounting system where you most often grab the most recent
> month worth of data. Clustering that keeps that data together will be
> beneficial but you could easily have a different index, item-number for
> instance, that would, if used for clustering, leave the commonly used
> data scattered throughout the table. If that table was an append-only
> detail table the most commonly used data would naturally clump together
> which clustering would then destroy.

Okay, I understand why we still need VACUUM and why we can't always
CLUSTER.  But my question remains: assuming I have some down time, do
I loose anything by CLUSTER.  Your answer is, I believe: Not normally,
but there is one case where you do.  That's an append-only table,
where you're generally interested in the most recent data, but you
cluster on something else.

Does clustering really hurt in that case? Is the planner smart enough
to realize that the data you want is towards the end only? I would
think that it doesn't know this, and will, let's say, assume it is
scattered regardless and perform a full scan.  I guess the question
is: Does the natural order of data help if there's no explicit means
for the planner to prove it?

Re: Disadvantage to CLUSTER?

From

Merlin Moncure

Date:

15 May 2012, 19:20:13

On Tue, May 15, 2012 at 4:44 PM, Robert James <srobertjames@gmail.com> wrote:
> On 5/15/12, Steve Crawford <scrawford@pinpointresearch.com> wrote:
>> On 05/15/2012 02:02 PM, Robert James wrote:
>>> Besides the one time spent CLUSTERing, do I loose anything by doing it
>>> for every table?  Does a CLUSTER slow anything down?
>
>> Cluster should have better performance but it depends on the index you
>> choose relative to the queries you typically run. Let's say that you
>> have an accounting system where you most often grab the most recent
>> month worth of data. Clustering that keeps that data together will be
>> beneficial but you could easily have a different index, item-number for
>> instance, that would, if used for clustering, leave the commonly used
>> data scattered throughout the table. If that table was an append-only
>> detail table the most commonly used data would naturally clump together
>> which clustering would then destroy.
>
>
> Okay, I understand why we still need VACUUM and why we can't always
> CLUSTER.  But my question remains: assuming I have some down time, do
> I loose anything by CLUSTER.  Your answer is, I believe: Not normally,
> but there is one case where you do.  That's an append-only table,
> where you're generally interested in the most recent data, but you
> cluster on something else.
>
> Does clustering really hurt in that case? Is the planner smart enough
> to realize that the data you want is towards the end only? I would
> think that it doesn't know this, and will, let's say, assume it is
> scattered regardless and perform a full scan.  I guess the question
> is: Does the natural order of data help if there's no explicit means
> for the planner to prove it?

If you have regular daily downtime then sure, you can cluster and
there is little disadvantage to doing so.  It will pack indexes and do
all kinds of other things that 24x7 shops have to do in a much more
complicated fashion.

the planner makes no assumptions about how the data is ordered in the table.

merlin

Re: Disadvantage to CLUSTER?

From

Steve Crawford

Date:

15 May 2012, 19:49:41

On 05/15/2012 02:44 PM, Robert James wrote:
>
> Okay, I understand why we still need VACUUM and why we can't always
> CLUSTER.  But my question remains: assuming I have some down time, do
> I loose anything by CLUSTER.  Your answer is, I believe: Not normally,
> but there is one case where you do.  That's an append-only table,
> where you're generally interested in the most recent data, but you
> cluster on something else.
>
> Does clustering really hurt in that case? Is the planner smart enough
> to realize that the data you want is towards the end only? I would
> think that it doesn't know this, and will, let's say, assume it is
> scattered regardless and perform a full scan.  I guess the question
> is: Does the natural order of data help if there's no explicit means
> for the planner to prove it?
>
The speed benefit of clustering (aside from compacting the files and
reindexing) comes from keeping data that is likely to be used together
stored together. I.e. all the data for this week or all the records for
invoice 12345. If you are really lucky, the data a query needs might be
in one or a few contiguous disk blocks.

But say you clustered your invoice-detail data on item number. Gathering
the detail records for one invoice could require many disk seeks and a
disk-block read per record. In that case clustering would hurt so you
should choose your cluster order intelligently.

BTW, be sure to run ANALYZE after clustering a table.

See http://www.postgresql.org/docs/current/static/sql-cluster.html

Cheers,
Steve