Re: Buffer Management - Mailing list pgsql-hackers
From | Curt Sampson |
---|---|
Subject | Re: Buffer Management |
Date | |
Msg-id | Pine.NEB.4.43.0206261149170.670-100000@angelic.cynic.net Whole thread Raw |
In response to | Re: Buffer Management (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Buffer Management
|
List | pgsql-hackers |
On Tue, 25 Jun 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > > I don't understand why there would be any loss of visibility of changes. > > If two backends mmap the same block of a file, and it's shared, that's > > the same block of physical memory that they're accessing. > > Is it? You have a mighty narrow conception of the range of > implementations that's possible for mmap. It's certainly possible to implement something that you call mmap that is not. But if you are using the posix-defined MAP_SHARED flag, the behaviour above is what you see. It might be implemented slightly differently internally, but that's no concern of postgres. And I find it pretty unlikely that it would be implemented otherwise without good reason. Note that your proposal of using mmap to replace sysv shared memory relies on the behaviour I've described too. As well, if you're replacing sysv shared memory with an mmap'd file, you may end up doing excessive disk I/O on systems without the MAP_NOSYNC option. (Without this option, the update thread/daemon may ensure that every buffer is flushed to the backing store on disk every 30 seconds or so. You might be able to get around this by using a small file-backed area for things that need to persist after a crash, and a larger anonymous area for things that don't need to persist after a crash.) > But the main problem is that mmap doesn't let us control when changes to > the memory buffer will get reflected back to disk --- AFAICT, the OS is > free to do the write-back at any instant after you dirty the page, and > that completely breaks the WAL algorithm. (WAL = write AHEAD log; > the log entry describing a change must hit disk before the data page > change itself does.) Hm. Well ,we could try not to write the data to the page until after we receive notification that our WAL data is committed to stable storage. However, new the data has to be availble to all of the backends at the exact time that the commit happens. Perhaps a shared list of pending writes? Another option would be to just let it write, but on startup, scan all of the data blocks in the database for tuples that have a transaction ID later than the last one we updated to, and remove them. That could pretty darn expensive on a large database, though. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
pgsql-hackers by date: