Home > mailing lists

Re: Feature suggestion : FAST CLUSTER - Mailing list pgsql-performance

From	Jim Nasby
Subject	Re: Feature suggestion : FAST CLUSTER
Date	May 29, 2007 04:57:53
Msg-id	8EFDCEBD-76AA-4A26-A8A8-10B5B3C8DC03@decibel.org Whole thread Raw
In response to	Re: Feature suggestion : FAST CLUSTER (PFC <lists@peufeu.com>)
Responses	Re: Feature suggestion : FAST CLUSTER
List	pgsql-performance

Tree view

On May 27, 2007, at 12:34 PM, PFC wrote:
> On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby
> <decibel@decibel.org> wrote:
>> On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
>>>     This does not run a complete sort on the table. It would be
>>> about as
>>>     fast  as your seq scan disk throughput. Obviously, the end
>>> result is not as
>>> good  as a real CLUSTER since the table will be made up of
>>> several ordered
>>> chunks and a range lookup. Therefore, a range lookup on the
>>> clustered
>>> columns would need at most N seeks, versus 1 for a really
>>> clustered table.
>>> But it only scans the table once and writes it once, even
>>> counting index
>>> rebuild.
>>
>> Do you have any data that indicates such an arrangement would be
>> substantially better than less-clustered data?
>     While the little benchmark that will answer your question is
> running, I'll add a few comments :
>
>     I have been creating a new benchmark for PostgreSQL and MySQL,
> that I will call the Forum Benchmark. It mimics the activity of a
> forum.
>     So far, I have got interesting results about Postgres and InnoDB
> and will publish an extensive report with lots of nasty stuff in
> it, in, say, 2 weeks, since I'm doing this in spare time.
>
>     Anyway, forums like clustered tables, specifically clusteriing
> posts on (topic_id, post_id), in order to be able to display a page
> with one disk seek, instead of one seek per post.
>     PostgreSQL humiliates InnoDB on CPU-bound workloads (about 2x
> faster since I run it on dual core ; InnoDB uses only one core).
> However, InnoDB can automatically cluster tables without
> maintenance. This means InnoDB will, even though it sucks and is
> awfully bloated, run a lot faster than postgres if things become IO-
> bound, ie. if the dataset is larger than RAM.
>     Postgres needs to cluster the posts table in order to keep going.
> CLUSTER is very slow. I tried inserting into a new posts table,
> ordering by (post_id, topic_id), then renaming the new table in
> place of the old. It is faster, but still slow when handling lots
> of data.
>     I am trying other approaches, some quite hack-ish, and will report
> my findings.

I assume you meant topic_id, post_id. :)

The problem with your proposal is that it does nothing to ensure that
posts for a topic stay together as soon as the table is large enough
that you can't sort it in a single pass. If you've got a long-running
thread, it's still going to get spread out throughout the table.

What you really want is CLUSTER CONCURRENTLY, which I believe is on
the TODO list. BUT... there's another caveat here: for any post where
the row ends up being larger than 2k, the text is going to get
TOASTed anyway, which means it's going to be in a separate table, in
a different ordering. I don't know of a good way to address that; you
can cluster the toast table, but you'll be clustering on an OID,
which isn't going to help you.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

pgsql-performance by date:

From: PFC
Date: 29 May 2007, 03:21:23
Subject: Re: Feature suggestion : FAST CLUSTER

From: PFC
Date: 29 May 2007, 05:41:35
Subject: Re: Feature suggestion : FAST CLUSTER

Re: Feature suggestion : FAST CLUSTER - Mailing list pgsql-performance

Previous

Next