Re: WAL and commit_delay - Mailing list pgsql-hackers

From Tom Lane
Subject Re: WAL and commit_delay
Date
Msg-id 4540.982439095@sss.pgh.pa.us
Whole thread Raw
In response to Re: WAL and commit_delay  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: WAL and commit_delay  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Another thing I am wondering about is why we're not using fdatasync(),
> where available, instead of fsync().  The whole point of preallocating
> the WAL files is to make fdatasync safe, no?

> Don't we have to fsync the inode too?  Actually, I was hoping sequential
> fsync's could sit on the WAL disk track, but I can imagine it has to
> seek around to hit both areas.

That's the point: we're trying to get things set up so that successive
writes/fsyncs in the WAL file do the minimum amount of seeking.  The WAL
code tries to preallocate the whole log file (incorrectly, but that's
easily fixed, see below) so that we should not need to update the file
metadata when we write into the file.

> I don't have fdatasync() here.  How does it compare to fsync().

HPUX's man page says

:     fdatasync() causes all modified data and file attributes of fildes
:     required to retrieve the data to be written to disk.

:     fsync() causes all modified data and all file attributes of fildes
:     (including access time, modification time and status change time) to
:     be written to disk.

The implication is that the only thing you can lose after fdatasync is
the highly-inessential file mod time.  However, I have been told that
on some implementations, fdatasync only flushes data blocks, and never
writes the inode or indirect blocks.  That would mean that if you had
allocated new disk space to the file, fdatasync would not guarantee
that that allocation was reflected on disk.  This is the reason for
preallocating the WAL log file (and doing a full fsync *at that time*).
Then you know the inode block pointers and indirect blocks are down
on disk, and so fdatasync is sufficient even if you have the cheesy
version of fdatasync.

Right now the WAL preallocation code (XLogFileInit) is not good enough
because it does lseek to the 16MB position and then writes 1 byte there.
On an implementation that supports holes in files (which is most Unixen)
that doesn't cause physical allocation of the intervening space.  We'd
have to actually write zeroes into all 16MB to ensure the space is
allocated ... but that's just a couple more lines of code.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: WAL and commit_delay
Next
From: The Hermit Hacker
Date:
Subject: Re: Performance lossage in checkpoint dumping