Thread: Re: Buffer Management
Isn't that what msync() is for? Or is this not portable? -----Ursprüngliche Nachricht----- Von: Tom Lane [mailto:tgl@sss.pgh.pa.us] Gesendet: Dienstag, 25. Juni 2002 16:30 An: Curt Sampson Cc: J. R. Nield; Bruce Momjian; PostgreSQL Hacker Betreff: Re: [HACKERS] Buffer Management Curt Sampson <cjs@cynic.net> writes: > On Tue, 25 Jun 2002, Tom Lane wrote: >> The other discussion seemed to be considering how to mmap individual >> data files right into backends' address space. I do not believe this >> can possibly work, because of loss of control over visibility of data >> changes to other backends, timing of write-backs, etc. > I don't understand why there would be any loss of visibility of changes. > If two backends mmap the same block of a file, and it's shared, that's > the same block of physical memory that they're accessing. Is it? You have a mighty narrow conception of the range of implementations that's possible for mmap. But the main problem is that mmap doesn't let us control when changes to the memory buffer will get reflected back to disk --- AFAICT, the OS is free to do the write-back at any instant after you dirty the page, and that completely breaks the WAL algorithm. (WAL = write AHEAD log; the log entry describing a change must hit disk before the data page change itself does.) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
"Mario Weilguni" <mario.weilguni@icomedias.com> writes: > Isn't that what msync() is for? Or is this not portable? msync can force not-yet-written changes down to disk. It does not prevent the OS from choosing to write changes *before* you invoke msync. For example, the HPUX man page for msync says: Normal system activity can cause pages to be written to disk. Therefore, there are no guarantees that msync() is theonly control over when pages are or are not written to disk. Our problem is that we want to enforce the write ordering "WAL before data file". To do that, we write and fsync (or DSYNC, or something) a WAL entry before we issue the write() against the data file. We don't really care if the kernel delays the data file write beyond that point, but we can be certain that the data file write did not occur too early. msync is designed to ensure exactly the opposite constraint: it can guarantee that no changes remain unwritten after time T, but it can't guarantee that changes aren't written before time T. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) [020625 11:00]: > > msync can force not-yet-written changes down to disk. It does not > prevent the OS from choosing to write changes *before* you invoke msync. > > Our problem is that we want to enforce the write ordering "WAL before > data file". To do that, we write and fsync (or DSYNC, or something) > a WAL entry before we issue the write() against the data file. We > don't really care if the kernel delays the data file write beyond that > point, but we can be certain that the data file write did not occur > too early. > > msync is designed to ensure exactly the opposite constraint: it can > guarantee that no changes remain unwritten after time T, but it can't > guarantee that changes aren't written before time T. Okay, so instead of looking for constraints from the OS on the data file, use the constraints on the WAL file. It would work at the cost of a buffer copy? Er, maybe two: mmap the data file and WAL separately. Copy the data file page to the WAL mmap area. Modify the page. msync() the WAL. Copy the page to the data file mmap area. msync() or not the data file. (This is half baked, just thought I'd see if it stirred further thought). As another approach, how expensive is re-MMAPing portions of the files compared to the copies. -Brad > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Bradley McLean <brad@bradm.net> writes: > Okay, so instead of looking for constraints from the OS on the data file, > use the constraints on the WAL file. It would work at the cost of a buffer > copy? Er, maybe two: > mmap the data file and WAL separately. > Copy the data file page to the WAL mmap area. > Modify the page. > msync() the WAL. > Copy the page to the data file mmap area. > msync() or not the data file. Huh? The primary argument in favor of mmap is to avoid buffer copies; seems like you are paying that price anyway. Also, we do not want to msync WAL for every single WAL record, but I think you'd have to with the above scheme. (Assuming you have adequate shared buffer space, the present scheme only has to fsync WAL at transaction commit and checkpoints, because it won't actually push out data pages except at checkpoint time.) regards, tom lane