Home > mailing lists

Re: Buffer Management - Mailing list pgsql-hackers

From	Curt Sampson
Subject	Re: Buffer Management
Date	June 26, 2002 00:13:54
Msg-id	Pine.NEB.4.43.0206261149170.670-100000@angelic.cynic.net Whole thread Raw
In response to	Re: Buffer Management (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Buffer Management
List	pgsql-hackers

Tree view

On Tue, 25 Jun 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
>
> > I don't understand why there would be any loss of visibility of changes.
> > If two backends mmap the same block of a file, and it's shared, that's
> > the same block of physical memory that they're accessing.
>
> Is it?  You have a mighty narrow conception of the range of
> implementations that's possible for mmap.

It's certainly possible to implement something that you call mmap
that is not. But if you are using the posix-defined MAP_SHARED flag,
the behaviour above is what you see. It might be implemented slightly
differently internally, but that's no concern of postgres. And I find
it pretty unlikely that it would be implemented otherwise without good
reason.

Note that your proposal of using mmap to replace sysv shared memory
relies on the behaviour I've described too. As well, if you're replacing
sysv shared memory with an mmap'd file, you may end up doing excessive
disk I/O on systems without the MAP_NOSYNC option. (Without this option,
the update thread/daemon may ensure that every buffer is flushed to the
backing store on disk every 30 seconds or so. You might be able to get
around this by using a small file-backed area for things that need to
persist after a crash, and a larger anonymous area for things that don't
need to persist after a crash.)

> But the main problem is that mmap doesn't let us control when changes to
> the memory buffer will get reflected back to disk --- AFAICT, the OS is
> free to do the write-back at any instant after you dirty the page, and
> that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
> the log entry describing a change must hit disk before the data page
> change itself does.)

Hm. Well ,we could try not to write the data to the page until
after we receive notification that our WAL data is committed to
stable storage. However, new the data has to be availble to all of
the backends at the exact time that the commit happens. Perhaps a
shared list of pending writes?

Another option would be to just let it write, but on startup, scan
all of the data blocks in the database for tuples that have a
transaction ID later than the last one we updated to, and remove
them. That could pretty darn expensive on a large database, though.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC

pgsql-hackers by date:

From: "Jonah H. Harris"
Date: 25 June 2002, 23:49:15
Subject: TPC-C Benchmarks

From: Justin Clift
Date: 26 June 2002, 02:08:12
Subject: Nextgres?

Re: Buffer Management - Mailing list pgsql-hackers

Previous

Next