Home > mailing lists

Re: Hardware/OS recommendations for large databases ( - Mailing list pgsql-performance

From	Bruce Momjian
Subject	Re: Hardware/OS recommendations for large databases (
Date	November 22, 2005 20:13:38
Msg-id	200511230013.jAN0DKV10698@candle.pha.pa.us Whole thread Raw
In response to	Re: Hardware/OS recommendations for large databases ( (Greg Stark <gsstark@mit.edu>)
Responses	Re: Hardware/OS recommendations for large databases ( Re: Hardware/OS recommendations for large databases (
List	pgsql-performance

Tree view

Greg Stark wrote:
>
> Alan Stange <stange@rentec.com> writes:
>
> > The point your making doesn't match my experience with *any* storage or program
> > I've ever used, including postgresql.   Your point suggests that the storage
> > system is idle  and that postgresql is broken because it isn't able to use the
> > resources available...even when the cpu is very idle.  How can that make sense?
>
> Well I think what he's saying is that Postgres is issuing a read, then waiting
> for the data to return. Then it does some processing, and goes back to issue
> another read. The CPU is idle half the time because Postgres isn't capable of
> doing any work while waiting for i/o, and the i/o system is idle half the time
> while the CPU intensive part happens.
>
> (Consider as a pathological example a program that reads 8k then sleeps for
> 10ms, and loops doing that 1,000 times. Now consider the same program
> optimized to read 8M asynchronously and sleep for 10s. By the time it's
> finished sleeping it has probably read in all 8M. Whereas the program that
> read 8k in little chunks interleaved with small sleeps would probably take
> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
> idle.)
>
> It's a reasonable theory and it's not inconsistent with the results you sent.
> But it's not exactly proven either. Nor is it clear how to improve matters.
> Adding additional threads to handle the i/o adds an enormous amount of
> complexity and creates lots of opportunity for other contention that could
> easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-performance by date:

From: "Anjan Dave"
Date: 22 November 2005, 19:17:38
Subject: Re: High context switches occurring

From: Ralph Mason
Date: 22 November 2005, 23:39:07
Subject: Binary Refcursor possible?

Re: Hardware/OS recommendations for large databases ( - Mailing list pgsql-performance

Previous

Next