Thread: Re: Buffer Management

Re: Buffer Management

From
"Mario Weilguni"
Date:
Isn't that what msync() is for? Or is this not portable?

-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Gesendet: Dienstag, 25. Juni 2002 16:30
An: Curt Sampson
Cc: J. R. Nield; Bruce Momjian; PostgreSQL Hacker
Betreff: Re: [HACKERS] Buffer Management


Curt Sampson <cjs@cynic.net> writes:
> On Tue, 25 Jun 2002, Tom Lane wrote:
>> The other discussion seemed to be considering how to mmap individual
>> data files right into backends' address space.  I do not believe this
>> can possibly work, because of loss of control over visibility of data
>> changes to other backends, timing of write-backs, etc.

> I don't understand why there would be any loss of visibility of changes.
> If two backends mmap the same block of a file, and it's shared, that's
> the same block of physical memory that they're accessing.

Is it?  You have a mighty narrow conception of the range of
implementations that's possible for mmap.

But the main problem is that mmap doesn't let us control when changes to
the memory buffer will get reflected back to disk --- AFAICT, the OS is
free to do the write-back at any instant after you dirty the page, and
that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
the log entry describing a change must hit disk before the data page
change itself does.)
        regards, tom lane



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org






Re: Buffer Management

From
Tom Lane
Date:
"Mario Weilguni" <mario.weilguni@icomedias.com> writes:
> Isn't that what msync() is for? Or is this not portable?

msync can force not-yet-written changes down to disk.  It does not
prevent the OS from choosing to write changes *before* you invoke msync.
For example, the HPUX man page for msync says:
    Normal system activity can cause pages to be written to disk.    Therefore, there are no guarantees that msync() is
theonly control    over when pages are or are not written to disk.
 

Our problem is that we want to enforce the write ordering "WAL before
data file".  To do that, we write and fsync (or DSYNC, or something)
a WAL entry before we issue the write() against the data file.  We
don't really care if the kernel delays the data file write beyond that
point, but we can be certain that the data file write did not occur
too early.

msync is designed to ensure exactly the opposite constraint: it can
guarantee that no changes remain unwritten after time T, but it can't
guarantee that changes aren't written before time T.
        regards, tom lane




Re: Buffer Management

From
Bradley McLean
Date:
* Tom Lane (tgl@sss.pgh.pa.us) [020625 11:00]:
> 
> msync can force not-yet-written changes down to disk.  It does not
> prevent the OS from choosing to write changes *before* you invoke msync.
> 
> Our problem is that we want to enforce the write ordering "WAL before
> data file".  To do that, we write and fsync (or DSYNC, or something)
> a WAL entry before we issue the write() against the data file.  We
> don't really care if the kernel delays the data file write beyond that
> point, but we can be certain that the data file write did not occur
> too early.
> 
> msync is designed to ensure exactly the opposite constraint: it can
> guarantee that no changes remain unwritten after time T, but it can't
> guarantee that changes aren't written before time T.

Okay, so instead of looking for constraints from the OS on the data file,
use the constraints on the WAL file.  It would work at the cost of a buffer
copy?  Er, maybe two:

mmap the data file and WAL separately.
Copy the data file page to the WAL mmap area.
Modify the page.
msync() the WAL.
Copy the page to the data file mmap area.
msync() or not the data file.

(This is half baked, just thought I'd see if it stirred further thought).

As another approach, how expensive is re-MMAPing portions of the files
compared to the copies.

-Brad

> 
>             regards, tom lane
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 




Re: Buffer Management

From
Tom Lane
Date:
Bradley McLean <brad@bradm.net> writes:
> Okay, so instead of looking for constraints from the OS on the data file,
> use the constraints on the WAL file.  It would work at the cost of a buffer
> copy?  Er, maybe two:

> mmap the data file and WAL separately.
> Copy the data file page to the WAL mmap area.
> Modify the page.
> msync() the WAL.
> Copy the page to the data file mmap area.
> msync() or not the data file.

Huh?  The primary argument in favor of mmap is to avoid buffer copies;
seems like you are paying that price anyway.  Also, we do not want to
msync WAL for every single WAL record, but I think you'd have to with
the above scheme.  (Assuming you have adequate shared buffer space,
the present scheme only has to fsync WAL at transaction commit and
checkpoints, because it won't actually push out data pages except at
checkpoint time.)
        regards, tom lane