Buffer Management - Mailing list pgsql-hackers

From Curt Sampson
Subject Buffer Management
Date
Msg-id Pine.NEB.4.43.0206251232130.17448-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Index Scans become Seq Scans after VACUUM ANALYSE  ("J. R. Nield" <jrnield@usol.com>)
Responses Re: Buffer Management  (Curt Sampson <cjs@cynic.net>)
List pgsql-hackers
I'm splitting off this buffer mangement stuff into a separate thread.

On 24 Jun 2002, J. R. Nield wrote:

> I'll back off on that. I don't know if we want to use the OS buffer
> manager, but shouldn't we try to have our buffer manager group writes
> together by files, and pro-actively get them out to disk?

The only way the postgres buffer manager can "get [data] out to disk"
is to do an fsync(). For data files (as opposed to log files), this can
only slow down overall system throughput, as this would only disrupt the
OS's write management.

> Right now, it
> looks like all our write requests are delayed as long as possible and
> the order in which they are written is pretty-much random, as is the
> backend that writes the block, so there is no locality of reference even
> when the blocks are adjacent on disk, and the write calls are spread-out
> over all the backends.

It doesn't matter. The OS will introduce locality of reference with its
write algorithms. Take a look at
   http://www.cs.wisc.edu/~solomon/cs537/disksched.html

for an example. Most OSes use the elevator or one-way elevator
algorithm.  So it doesn't matter whether it's one back-end or many
writing, and it doesn't matter in what order they do the write.

> Would it not be the case that things like read-ahead, grouping writes,
> and caching written data are probably best done by PostgreSQL, because
> only our buffer manager can understand when they will be useful or when
> they will thrash the cache?

Operating systems these days are not too bad at guessing guessing what
you're doing. Pretty much every OS I've seen will do read-ahead when
it detects you're doing sequential reads, at least in the forward
direction. And Solaris is even smart enough to mark the pages you've
read as "not needed" so that they quickly get flushed from the cache,
rather than blowing out your entire cache if you go through a large
file.

> Would O_DSYNC|O_RSYNC turn off the cache?

No. I suppose there's nothing to stop it doing so, in some
implementations, but the interface is not designed for direct I/O.

> Since you know a lot about NetBSD internals, I'd be interested in
> hearing about what postgresql looks like to the NetBSD buffer manager.

Well, looks like pretty much any program, or group of programs,
doing a lot of I/O. :-)

> Am I right that strings of successive writes get randomized?

No; as I pointed out, they in fact get de-randomized as much as
possible. The more proceses you have throwing out requests, the better
the throughput will be in fact.

> What do our cache-hit percentages look like? I'm going to do some
> experimenting with this.

Well, that depends on how much memory you have and what your working
set is. :-)

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 





pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE
Next
From: Tom Lane
Date:
Subject: Re: Democracy and organisation : let's make a revolution in the Debian way