Re: sync() - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: sync()
Date
Msg-id 20030113073207.GI20180@filer
Whole thread Raw
In response to Re: sync()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > So the backends have to keep a common list of all the files they
> > touch.  Admittedly, that could be a problem if it means using a bunch
> > of shared memory, and it may have additional performance implications
> > depending on the implementation ...
> 
> It would have to be a list of all files that have been touched since the
> last checkpoint.  That's a serious problem for storage in shared memory,
> which is by definition fixed-size.

Of course, the file list needn't be stored in SysV shared memory.  It
could be stored in a file that's later read by the checkpointing
process.  The backends could serialize their writes via fcntl() or
ioctl() style locks, whichever is appropriate.  Locking might even be
avoided entirely if the individual writes are small enough.

> Right.  "Portably" was the key word in my comment (sorry for not
> emphasizing this more clearly).  The real problem here is how to know
> what is the actual behavior of each platform?  I'm certainly not
> prepared to trust reading-between-the-lines-of-some-man-pages.  

Reading between the lines isn't necessarily required, just literal
interpretation.  :-)

> And I can't think of a simple yet reliable direct test.  You'd
> really have to invest detailed study of the kernel source code to
> know for sure ...  and many of our platforms don't have open-source
> kernels.

Linux appears to do the right thing with the file data itself, even if
it doesn't handle the directory entry simultaneously.  Others claim,
in messages written to pgsql-general and elsewhere (via Google
search), that FreeBSD does the right thing for sure.

I certainly agree that non-open-source kernels are uncertain.  That's
why it wouldn't be a bad idea to control this via a GUC variable.

> > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the
> > directories leading to the file as well, so that the state of the
> > filesystem on disk is consistent and safe in the event that the files
> > in question are newly-created.
> 
> AFAIK, all Unix implementations are paranoid about consistency of
> filesystem metadata, including directory contents.  

Not ext2 under Linux!  By default, it writes everything
asynchronously.  I don't know how many people use ext2 to do serious
tasks under Linux, so this may not be that much of an issue.

> So fsync'ing directories from a user process strikes me as a waste
> of time, even assuming that it were portable, which I doubt.  What
> we need to worry about is whether fsync'ing a bunch of our own data
> files is a practical substitute for a global sync() call.

I'm positive that under certain operating systems, fsyncing the data
is a better option than a global sync(), especially since sync() isn't
guaranteed to wait until the buffers are flushed.  Right now the state
of the data on disk immediately after a checkpoint is just a guess
because of that.  I don't see that using fsync() would introduce
significantly more uncertainty on systems where the manpage explicitly
says that the buffers associated with the file referenced by the file
descriptor are the ones written to disk.  For instance, the FreeBSD
manpage says:
   Fsync() causes all modified data and attributes of fd to be moved   to a permanent storage device.  This normally
resultsin all   in-core modified copies of buffers for the associated file to be   written to a disk.
 
   Fsync() should be used by programs that require a file to be in a   known state, for example, in building a simple
transaction  facility.
 

and the Linux manpage says:
   fsync copies all in-core parts of a file to disk, and waits until   the device reports that all parts are on stable
storage. It also   updates metadata stat information.  It does not necessarily ensure   that the entry in the directory
containingthe file has also   reached disk.  For that an explicit fsync on the file descriptor   of the directory is
alsoneeded.
 

Both are rather unambiguous, and a cursory review of the Linux source
confirms what its manpage says, at least.  The FreeBSD manpage might
be ambiguous, but the fact that they also have an fsync command line
utility essentially proves that FreeBSD's fsync() flushes all buffers
associated with the file.

Conversely, the Solaris manpage says:
   The fsync() function moves all modified data and attributes of the   file descriptor fildes to a storage device.
Whenfsync() returns,   all in-memory modified copies of buffers associated with fildes   have been written to the
physicalmedium.
 

It's pretty clear from the Solaris description that its fsync()
concerns itself only with the buffers associated with a file
descriptor and not with the file itself.  The fact that it's
implemented as a library call (the manpage is in section 3 instead of
section 2) convinces me further that its fsync() implementation is as
described.


The PostgreSQL default for checkpoints should probably be sync(), but
I think fsync() should be an available option, just as it's possible
to control whether or not synchronous writes are used for the
transaction log as well as the type of synchronization mechanism used
for it.  Yes, it's another parameter for the administrator to concern
himself with, but it seems to me that a significant amount of speed
could be gained under certain (perhaps quite common) circumstances
with such a mechanism.




-- 
Kevin Brown                          kevin@sysexperts.com


pgsql-hackers by date:

Previous
From: Adrian 'Dagurashibanipal' von Bidder
Date:
Subject: Re: PostgreSQL site, put up or shut up?
Next
From: Kevin Brown
Date:
Subject: Re: COLUMN MODIFY