Thread: regarding CLUSTER and HUGE work_mem / maintenance_work_mem

regarding CLUSTER and HUGE work_mem / maintenance_work_mem

From

Jon Nelson

Date:

27 January 2012, 16:43:51

Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
Furthermore, let's say I have a machine with sufficient memory for me
to set the work_mem  and maintenance_work_mem to 20GB (just for this
session).
When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
way of strace) performing an index scan which amounts to large
quantities of random I/O.
In my case, that means it takes a very, very long time. PostgreSQL is
largely at defaults, except for a 2GB shared_buffers and a few
unrelated changes. The system itself has 32GB of physical RAM and has
plenty free.
Why didn't PostgreSQL just read the table into memory (and the
interesting index) as a sequential scan, sort, and then write it out?
It seems like there would be more than enough memory for that. The
sequential I/O rate on this machine is 50-100x the random I/O rate.

I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux 6.1.

--
Jon

Re: regarding CLUSTER and HUGE work_mem / maintenance_work_mem

From

Heikki Linnakangas

Date:

27 January 2012, 17:05:31

On 27.01.2012 19:43, Jon Nelson wrote:
> Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
> Furthermore, let's say I have a machine with sufficient memory for me
> to set the work_mem  and maintenance_work_mem to 20GB (just for this
> session).
> When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
> way of strace) performing an index scan which amounts to large
> quantities of random I/O.
> In my case, that means it takes a very, very long time. PostgreSQL is
> largely at defaults, except for a 2GB shared_buffers and a few
> unrelated changes. The system itself has 32GB of physical RAM and has
> plenty free.
> Why didn't PostgreSQL just read the table into memory (and the
> interesting index) as a sequential scan, sort, and then write it out?
> It seems like there would be more than enough memory for that. The
> sequential I/O rate on this machine is 50-100x the random I/O rate.
>
> I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux 6.1.

The suppport for doing a seqscan+sort in CLUSTER was introduced in
version 9.1. Before that, CLUSTER always did an indexscan. See release
notes: http://www.postgresql.org/docs/9.1/static/release-9-1.html#AEN107416

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: regarding CLUSTER and HUGE work_mem / maintenance_work_mem

From

Jon Nelson

Date:

28 January 2012, 01:34:59

On Fri, Jan 27, 2012 at 12:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 27.01.2012 19:43, Jon Nelson wrote:
>>
>> Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
>> Furthermore, let's say I have a machine with sufficient memory for me
>> to set the work_mem  and maintenance_work_mem to 20GB (just for this
>> session).
>> When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
>> way of strace) performing an index scan which amounts to large
>> quantities of random I/O.
>> In my case, that means it takes a very, very long time. PostgreSQL is
>> largely at defaults, except for a 2GB shared_buffers and a few
>> unrelated changes. The system itself has 32GB of physical RAM and has
>> plenty free.
>> Why didn't PostgreSQL just read the table into memory (and the
>> interesting index) as a sequential scan, sort, and then write it out?
>> It seems like there would be more than enough memory for that. The
>> sequential I/O rate on this machine is 50-100x the random I/O rate.
>>
>> I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux
>> 6.1.
>
>
> The suppport for doing a seqscan+sort in CLUSTER was introduced in version
> 9.1. Before that, CLUSTER always did an indexscan. See release notes:
> http://www.postgresql.org/docs/9.1/static/release-9-1.html#AEN107416

That's what I get for digging through the source (git) but working
with 8.4.10, on a Friday, at the end of a long week.
Thanks for pointing that out to somebody that should have known better.


--
Jon

Re: regarding CLUSTER and HUGE work_mem / maintenance_work_mem

From

Scott Marlowe

Date:

28 January 2012, 02:04:28

On Fri, Jan 27, 2012 at 7:34 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Fri, Jan 27, 2012 at 12:05 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 27.01.2012 19:43, Jon Nelson wrote:
>>>
>>> Let's say I have a 7GB table with 3-4 indices for a total of 10-12GB.
>>> Furthermore, let's say I have a machine with sufficient memory for me
>>> to set the work_mem  and maintenance_work_mem to 20GB (just for this
>>> session).
>>> When I issue a CLUSTER using one of the indices, I see PostgreSQL (by
>>> way of strace) performing an index scan which amounts to large
>>> quantities of random I/O.
>>> In my case, that means it takes a very, very long time. PostgreSQL is
>>> largely at defaults, except for a 2GB shared_buffers and a few
>>> unrelated changes. The system itself has 32GB of physical RAM and has
>>> plenty free.
>>> Why didn't PostgreSQL just read the table into memory (and the
>>> interesting index) as a sequential scan, sort, and then write it out?
>>> It seems like there would be more than enough memory for that. The
>>> sequential I/O rate on this machine is 50-100x the random I/O rate.
>>>
>>> I'm using 8.4.10 (with the 'inet' de-toasting patch) on Scientific Linux
>>> 6.1.
>>
>>
>> The suppport for doing a seqscan+sort in CLUSTER was introduced in version
>> 9.1. Before that, CLUSTER always did an indexscan. See release notes:
>> http://www.postgresql.org/docs/9.1/static/release-9-1.html#AEN107416
>
> That's what I get for digging through the source (git) but working
> with 8.4.10, on a Friday, at the end of a long week.
> Thanks for pointing that out to somebody that should have known better.

But if you're stuck on < 9.1 for a while, the workaround is to cluster
the table yourself by using a select * ... order by pkey.  For
randomly distributed tables this is far faster for a first time
cluster.  After that, subsequent clusters won't have as much work to
do and the older method for clustering should work ok.

It's kinda funny to have a complaint against pgsql for NOT using a
sequential scan.  Most DBAs that come from other DBAs are upset when
it doesn't use an index.