Re: Scaling SELECT:s with the number of disks on a stripe

From: Peter Schuller
Subject: Re: Scaling SELECT:s with the number of disks on a stripe
Date: ,
Msg-id: 20070402105355.GB57038@hyperion.scode.org
(view: Whole thread, Raw)
In response to: Re: Scaling SELECT:s with the number of disks on a stripe  (Andrew - Supernews)
List: pgsql-performance

Tree view

Scaling SELECT:s with the number of disks on a stripe  (Peter Schuller, )
 Re: Scaling SELECT:s with the number of disks on a stripe  ("Marc Mamin", )
  Re: Scaling SELECT:s with the number of disks on a stripe  (Peter Schuller, )
 Re: Scaling SELECT:s with the number of disks on a stripe  (Andrew - Supernews, )
  Re: Scaling SELECT:s with the number of disks on a stripe  (Peter Schuller, )
 Re: Scaling SELECT:s with the number of disks on a stripe  (Andrew - Supernews, )
  Re: Scaling SELECT:s with the number of disks on a stripe  (Peter Schuller, )
   Re: Scaling SELECT:s with the number of disks on a stripe  (Dave Cramer, )
 Re: Scaling SELECT:s with the number of disks on a stripe  (Andrew - Supernews, )
  Re: Scaling SELECT:s with the number of disks on a stripe  (Peter Schuller, )

Hello,

> > SELECT * FROM test WHERE value = 'xxx' LIMIT 1000;
>
> I tested this on a 14-way software raid10 on freebsd, using pg 8.1.6, and
> couldn't reproduce anything like it. With one client I get about 200 disk
> requests per second, scaling almost exactly linearly for the first 5 or so
> clients, as expected. At 14 clients it was down to about 150 reqs/sec per
> client, but the total throughput continued to increase with additional
> concurrency up to about 60 clients, giving about 3600 reqs/sec (260 per
> disk, which is about right for 10krpm scsi disks under highly concurrent
> random loads).

Ok. That is very intersting; so there is definitely nothing
fundamental in PG that prevents the scaling (even if on FreeBSD).

> A good question. Have you tried testing the disks directly? e.g. create
> some huge files, and run a few concurrent random readers on them? That
> would test the array and the filesystem without involving postgres.

I have confirmed that I am seeing expected performance for random
short and highly concurrent reads in one large (> 200 GB) file. The
I/O is done using libaio however, so depending on implementation I
suppose the I/O scheduling behavior of the fs/raid driver might be
affected compared to having a number of concurrent threads doing
synchronous reads. I will try to confirm performance in a way that
will more closely match PostgreSQL's behavior.

I have to say though that I will be pretty surprised if the
performance is not matched in that test.

Is there any chance there is some operation system conditional code in
pg itself that might affect this behavior? Some kind of purposeful
serialization of I/O for example (even if that sounds like an
extremely strange thing to do)?

> This is entirely expected. With the larger row count, it is more likely
> (or so the planner estimates) that rows will need to be fetched from
> adjacent or at least nearby blocks, thus a plan which fetches rows in
> physical table order rather than index order would be expected to be
> superior. The planner takes into account the estimated startup cost and
> per-row cost when planning LIMIT queries; therefore it is no surprise
> that for larger limits, it switches to a plan with a higher startup cost
> but lower per-row cost.

Roger that, makes sense. I had misunderstood the meaning of the heap
scan.

> Most likely your index is small enough that large parts of it will be
> cached in RAM, so that the scan of the index to build the bitmap does
> not need to hit the disk much if at all.

Even so however, several seconds of CPU activity to scan the index for
a few tens of thousands of entries sounds a bit excessive. Or does it
not? Because at that level, the CPU bound period alone is approaching
the time it would take to seek for each entry instead. But then I
presume the amount of work is similar/the same for the other case,
except it's being done at the beginning of the query instead of before
each seek.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <>'
Key retrieval: Send an E-Mail to 
E-Mail:  Web: http://www.scode.org


Attachment

pgsql-performance by date:

From: "Alex Deucher"
Date:
Subject: postgres 7.4 vs. 8.x redux
From: Andrew - Supernews
Date:
Subject: Re: Scaling SELECT:s with the number of disks on a stripe