Thread: sync()

sync()

From
Tatsuo Ishii
Date:
I noticed sync() is used in PostgreSQL.

CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync()

Can someone tell me why we need sync() here?
--
Tatsuo Ishii


Re: sync()

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> I noticed sync() is used in PostgreSQL.
> 
> CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync()
> 
> Can someone tell me why we need sync() here?

As part of checkpoint, we discard some WAL files.  To do that, we must
first be sure that all the dirty buffers we have written to the kernel
are actually on the disk.  That is why the sync() is required.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: sync()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tatsuo Ishii wrote:
>> Can someone tell me why we need sync() here?

> As part of checkpoint, we discard some WAL files.  To do that, we must
> first be sure that all the dirty buffers we have written to the kernel
> are actually on the disk.  That is why the sync() is required.

What we really need is something better than sync(), viz flush all dirty
buffers to disk *and* wait till they're written.  But sync() and sleep
for awhile is the closest portable approximation.
        regards, tom lane


Re: sync()

From
Tatsuo Ishii
Date:
> Tatsuo Ishii wrote:
> > I noticed sync() is used in PostgreSQL.
> > 
> > CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync()
> > 
> > Can someone tell me why we need sync() here?
> 
> As part of checkpoint, we discard some WAL files.  To do that, we must
> first be sure that all the dirty buffers we have written to the kernel
> are actually on the disk.  That is why the sync() is required.

?? I thought WAL files are synced by pg_fsync() (if needed).
--
Tatsuo Ishii


Re: sync()

From
Tatsuo Ishii
Date:
> > As part of checkpoint, we discard some WAL files.  To do that, we must
> > first be sure that all the dirty buffers we have written to the kernel
> > are actually on the disk.  That is why the sync() is required.
> 
> What we really need is something better than sync(), viz flush all dirty
> buffers to disk *and* wait till they're written.  But sync() and sleep
> for awhile is the closest portable approximation.

Are you saying that fsync() might not wait untill the IO completes?
--
Tatsuo Ishii


Re: sync()

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Can someone tell me why we need sync() here?

> ?? I thought WAL files are synced by pg_fsync() (if needed).

They are.  But to write a checkpoint record --- which implies that the
WAL records before it need no longer be replayed --- we have to ensure
that all the changes-so-far in the regular database files are written
down to disk.  That is what we need sync() for.
        regards, tom lane


Re: sync()

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> What we really need is something better than sync(), viz flush all dirty
>> buffers to disk *and* wait till they're written.  But sync() and sleep
>> for awhile is the closest portable approximation.

> Are you saying that fsync() might not wait untill the IO completes?

No, I said that sync() might not.  Read the man pages.  HPUX's man
page for sync(2) says
    sync() causes all information in memory that should be on disk to be    written out.    ...    The writing,
althoughscheduled, is not necessarily complete upon    return from sync.
 
        regards, tom lane


Re: sync()

From
Tatsuo Ishii
Date:
> > Are you saying that fsync() might not wait untill the IO completes?
> 
> No, I said that sync() might not.  Read the man pages.  HPUX's man
> page for sync(2) says
> 
>      sync() causes all information in memory that should be on disk to be
>      written out.
>      ...
>      The writing, although scheduled, is not necessarily complete upon
>      return from sync.

I'm just wondering why we do not use fsync() to flush data/index
pages.
--
Tatsuo Ishii


Re: sync()

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I'm just wondering why we do not use fsync() to flush data/index
> pages.

There isn't any efficient way to do that AFAICS.  The process that wants
to do the checkpoint hasn't got any way to know just which files need to
be sync'd.  Even if it did know, it's not clear to me that we can
portably assume that process A issuing an fsync on a file descriptor F
it's opened for file X will force to disk previous writes issued against
the same physical file X by a different process B using a different file
descriptor G.

sync() is surely overkill, in that it writes out dirty kernel buffers
that might have nothing at all to do with Postgres.  But I don't see how
to do better.
        regards, tom lane


Re: sync()

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I'm just wondering why we do not use fsync() to flush data/index
> > pages.
> 
> There isn't any efficient way to do that AFAICS.  The process that wants
> to do the checkpoint hasn't got any way to know just which files need to
> be sync'd.  Even if it did know, it's not clear to me that we can
> portably assume that process A issuing an fsync on a file descriptor F
> it's opened for file X will force to disk previous writes issued against
> the same physical file X by a different process B using a different file
> descriptor G.
> 
> sync() is surely overkill, in that it writes out dirty kernel buffers
> that might have nothing at all to do with Postgres.  But I don't see how
> to do better.

Thanks for a good summary. Maybe this is yet another reason to have
a separate IO process like Oracle...
--
Tatsuo Ishii


Re: sync()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >> What we really need is something better than sync(), viz flush all dirty
> >> buffers to disk *and* wait till they're written.  But sync() and sleep
> >> for awhile is the closest portable approximation.
> 
> > Are you saying that fsync() might not wait untill the IO completes?
> 
> No, I said that sync() might not.  Read the man pages.  HPUX's man
> page for sync(2) says
> 
>      sync() causes all information in memory that should be on disk to be
>      written out.
>      ...
>      The writing, although scheduled, is not necessarily complete upon
>      return from sync.

Yep, BSD/OS says:
BUGS     Sync() may return before the buffers are completely flushed.

At least they classify it as a bug.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: sync()

From
Kevin Brown
Date:
Tom Lane wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I'm just wondering why we do not use fsync() to flush data/index
> > pages.
> 
> There isn't any efficient way to do that AFAICS.  The process that wants
> to do the checkpoint hasn't got any way to know just which files need to
> be sync'd.

So the backends have to keep a common list of all the files they
touch.  Admittedly, that could be a problem if it means using a bunch
of shared memory, and it may have additional performance implications
depending on the implementation ...

>  Even if it did know, it's not clear to me that we can
> portably assume that process A issuing an fsync on a file descriptor F
> it's opened for file X will force to disk previous writes issued against
> the same physical file X by a different process B using a different file
> descriptor G.

If the manpages are to be believed, then under FreeBSD, Linux, and
HP-UX, calling fsync() will force to disk *all* unwritten buffers
associated with the file pointed to by the filedescriptor.

Sadly, however, the Solaris and IRIX manpages suggest that only
buffers associated with the specific file descriptor itself are
written, not necessarily all buffers associated with the file pointed
at by the file descriptor (and interestingly, the Solaris version
appears to be implemented as a library function and not a system call,
if the manpage's section is any indication).

> sync() is surely overkill, in that it writes out dirty kernel buffers
> that might have nothing at all to do with Postgres.  But I don't see how
> to do better.

It's obvious to me that sync() can have some very significant
performance implications on a system that is acting as more than just
a database server.  So it should probably be used only when there's no
good alternative.

So: this is probably one of those cases where it's important to
distinguish between operating systems and use the sync() approach only
when it's uncertain that fsync() will do the job.  So FreeBSD (and
probably all the other BSD derivatives) definitely should use fsync()
since they have known-good implementations.  Linux and HP-UX 11 (if
the manpage's wording can be trusted.  Not sure about earlier
versions) should use fsync() as well.  Solaris and IRIX should use
sync() since their manpages indicate that only data associated with
the filedescriptor will be written to disk.

Under Linux (and perhaps HP-UX), it may be necessary to fsync() the
directories leading to the file as well, so that the state of the
filesystem on disk is consistent and safe in the event that the files
in question are newly-created.  Whether that's truly necessary or not
appears to be filesystem-dependent.  A quick perusal of the Linux
source shows that ext2 appears to only sync the data and metadata
associated with the inode of the specific file and not any parent
directories, so it's probably a safe bet to fsync() any ancestor
directories that matter as well as the file even if the system is
running on top of a journalled filesystem.  Since all the files in
question probably reside in the same set of directories, the directory
fsync()s can be deferred until the very end.


-- 
Kevin Brown                          kevin@sysexperts.com


Re: sync()

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> So the backends have to keep a common list of all the files they
> touch.  Admittedly, that could be a problem if it means using a bunch
> of shared memory, and it may have additional performance implications
> depending on the implementation ...

It would have to be a list of all files that have been touched since the
last checkpoint.  That's a serious problem for storage in shared memory,
which is by definition fixed-size.

>> Even if it did know, it's not clear to me that we can
>> portably assume that process A issuing an fsync on a file descriptor F
>> it's opened for file X will force to disk previous writes issued against
>> the same physical file X by a different process B using a different file
>> descriptor G.

> If the manpages are to be believed, then under FreeBSD, Linux, and
> HP-UX, calling fsync() will force to disk *all* unwritten buffers
> associated with the file pointed to by the filedescriptor.

> Sadly, however, the Solaris and IRIX manpages suggest that only
> buffers associated with the specific file descriptor itself are
> written, not necessarily all buffers associated with the file pointed
> at by the file descriptor (and interestingly, the Solaris version
> appears to be implemented as a library function and not a system call,
> if the manpage's section is any indication).

Right.  "Portably" was the key word in my comment (sorry for not
emphasizing this more clearly).  The real problem here is how to know
what is the actual behavior of each platform?  I'm certainly not
prepared to trust reading-between-the-lines-of-some-man-pages.  And I
can't think of a simple yet reliable direct test.  You'd really have to
invest detailed study of the kernel source code to know for sure ...
and many of our platforms don't have open-source kernels.

> Under Linux (and perhaps HP-UX), it may be necessary to fsync() the
> directories leading to the file as well, so that the state of the
> filesystem on disk is consistent and safe in the event that the files
> in question are newly-created.

AFAIK, all Unix implementations are paranoid about consistency of
filesystem metadata, including directory contents.  So fsync'ing
directories from a user process strikes me as a waste of time, even
assuming that it were portable, which I doubt.  What we need to worry
about is whether fsync'ing a bunch of our own data files is a practical
substitute for a global sync() call.
        regards, tom lane


Re: sync()

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > So the backends have to keep a common list of all the files they
> > touch.  Admittedly, that could be a problem if it means using a bunch
> > of shared memory, and it may have additional performance implications
> > depending on the implementation ...
> 
> It would have to be a list of all files that have been touched since the
> last checkpoint.  That's a serious problem for storage in shared memory,
> which is by definition fixed-size.

Of course, the file list needn't be stored in SysV shared memory.  It
could be stored in a file that's later read by the checkpointing
process.  The backends could serialize their writes via fcntl() or
ioctl() style locks, whichever is appropriate.  Locking might even be
avoided entirely if the individual writes are small enough.

> Right.  "Portably" was the key word in my comment (sorry for not
> emphasizing this more clearly).  The real problem here is how to know
> what is the actual behavior of each platform?  I'm certainly not
> prepared to trust reading-between-the-lines-of-some-man-pages.  

Reading between the lines isn't necessarily required, just literal
interpretation.  :-)

> And I can't think of a simple yet reliable direct test.  You'd
> really have to invest detailed study of the kernel source code to
> know for sure ...  and many of our platforms don't have open-source
> kernels.

Linux appears to do the right thing with the file data itself, even if
it doesn't handle the directory entry simultaneously.  Others claim,
in messages written to pgsql-general and elsewhere (via Google
search), that FreeBSD does the right thing for sure.

I certainly agree that non-open-source kernels are uncertain.  That's
why it wouldn't be a bad idea to control this via a GUC variable.

> > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the
> > directories leading to the file as well, so that the state of the
> > filesystem on disk is consistent and safe in the event that the files
> > in question are newly-created.
> 
> AFAIK, all Unix implementations are paranoid about consistency of
> filesystem metadata, including directory contents.  

Not ext2 under Linux!  By default, it writes everything
asynchronously.  I don't know how many people use ext2 to do serious
tasks under Linux, so this may not be that much of an issue.

> So fsync'ing directories from a user process strikes me as a waste
> of time, even assuming that it were portable, which I doubt.  What
> we need to worry about is whether fsync'ing a bunch of our own data
> files is a practical substitute for a global sync() call.

I'm positive that under certain operating systems, fsyncing the data
is a better option than a global sync(), especially since sync() isn't
guaranteed to wait until the buffers are flushed.  Right now the state
of the data on disk immediately after a checkpoint is just a guess
because of that.  I don't see that using fsync() would introduce
significantly more uncertainty on systems where the manpage explicitly
says that the buffers associated with the file referenced by the file
descriptor are the ones written to disk.  For instance, the FreeBSD
manpage says:
   Fsync() causes all modified data and attributes of fd to be moved   to a permanent storage device.  This normally
resultsin all   in-core modified copies of buffers for the associated file to be   written to a disk.
 
   Fsync() should be used by programs that require a file to be in a   known state, for example, in building a simple
transaction  facility.
 

and the Linux manpage says:
   fsync copies all in-core parts of a file to disk, and waits until   the device reports that all parts are on stable
storage. It also   updates metadata stat information.  It does not necessarily ensure   that the entry in the directory
containingthe file has also   reached disk.  For that an explicit fsync on the file descriptor   of the directory is
alsoneeded.
 

Both are rather unambiguous, and a cursory review of the Linux source
confirms what its manpage says, at least.  The FreeBSD manpage might
be ambiguous, but the fact that they also have an fsync command line
utility essentially proves that FreeBSD's fsync() flushes all buffers
associated with the file.

Conversely, the Solaris manpage says:
   The fsync() function moves all modified data and attributes of the   file descriptor fildes to a storage device.
Whenfsync() returns,   all in-memory modified copies of buffers associated with fildes   have been written to the
physicalmedium.
 

It's pretty clear from the Solaris description that its fsync()
concerns itself only with the buffers associated with a file
descriptor and not with the file itself.  The fact that it's
implemented as a library call (the manpage is in section 3 instead of
section 2) convinces me further that its fsync() implementation is as
described.


The PostgreSQL default for checkpoints should probably be sync(), but
I think fsync() should be an available option, just as it's possible
to control whether or not synchronous writes are used for the
transaction log as well as the type of synchronization mechanism used
for it.  Yes, it's another parameter for the administrator to concern
himself with, but it seems to me that a significant amount of speed
could be gained under certain (perhaps quite common) circumstances
with such a mechanism.




-- 
Kevin Brown                          kevin@sysexperts.com


Re: sync()

From
Giles Lean
Date:
Tom Lane writes:

> Right.  "Portably" was the key word in my comment (sorry for not
> emphasizing this more clearly).  The real problem here is how to know
> what is the actual behavior of each platform?  I'm certainly not
> prepared to trust reading-between-the-lines-of-some-man-pages.  And I
> can't think of a simple yet reliable direct test.

Is the "Single Unix Standard, version 2" (aka UNIX98) any better?
It says for fsync():
   "The fsync() function forces all currently queued I/O operations   associated with the file indicated by file
descriptorfildes to   the synchronised I/O completion state. All I/O operations are   completed as defined for
synchronisedI/O file integrity   completion."
 

This to me clearly says that changes to the file must be written,
not just changes made via this file descriptor.

I did have to test this behaviour once (for a customer, strange
situation) but I couldn't find a portable way to do it, either.

What I did was read the appropriate disk block from the raw device to
bypass the buffer cache.  As this required low level knowledge of the
on-disk filesystem layout it was not very portable.  For anyone
interested Tom Christiansen's "icat" program can be ported to UFS
derived filesystems fairly easily:
   http://www.rosat.mpe-garching.mpg.de/mailing-lists/perl5-porters/1997-04/msg00487.html

> AFAIK, all Unix implementations are paranoid about consistency of
> filesystem metadata, including directory contents.  So fsync'ing
> directories from a user process strikes me as a waste of time, ...

There is one variant where this is not the case: Linux using ext2fs
and possibly other filesystems.

There was a flame fest of great entertainment value a few years ago
between Linus Torvalds and Dan Bernstein.  Of course, neither was able
to influence the opinion of the other to any noticible degree, but it
made fun reading. I think this might be a starting point:
   http://www.ornl.gov/cts/archives/mailing-lists/qmail/1998/05/msg00667.html

A more recent posting from Linus where he continues to recommend
fsync() is this:
   http://www.cs.helsinki.fi/linux/linux-kernel/2001-29/0659.html

I've not heard that any other Unix-like OS has abandoned the
traditional and POSIX semantic.

> assuming that it were portable, which I doubt.  What we need to worry
> about is whether fsync'ing a bunch of our own data files is a practical
> substitute for a global sync() call.

I wish that it were.  There are situations (serveral GB buffer caches,
for example) where I mistrust the current use of sync() to have all
writes completed before the sleep() returns.  My concern is
theoretical at the moment -- I never get to play with machines that
large!

Regards,

Giles




Re: sync()

From
Kurt Roeckx
Date:
On Mon, Jan 13, 2003 at 07:31:08PM +1100, Giles Lean wrote:
> 
> Is the "Single Unix Standard, version 2" (aka UNIX98) any better?
> It says for fsync():
> 
>     "The fsync() function forces all currently queued I/O operations
>     associated with the file indicated by file descriptor fildes to
>     the synchronised I/O completion state. All I/O operations are
>     completed as defined for synchronised I/O file integrity
>     completion."

In version 3 it says:
    The fsync() function shall request that all data for the open file    descriptor named by fildes is to be
transferredto the storage    device associated with the file described by fildes in an    implementation-defined
manner.The fsync() function shall not    return until the system has completed that action or until an error    is
detected.
    [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the    fsync() function shall force all currently queued
I/Ooperations    associated with the file indicated by file descriptor fildes to the    synchronized I/O completion
state.All I/O operations shall be    completed as defined for synchronized I/O file integrity    completion. [Option
End]


Kurt



Re: sync()

From
Kevin Brown
Date:
Kurt Roeckx wrote:
>      [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the
>      fsync() function shall force all currently queued I/O operations
>      associated with the file indicated by file descriptor fildes to the
>      synchronized I/O completion state. All I/O operations shall be
>      completed as defined for synchronized I/O file integrity
>      completion. [Option End]

Hmmm....so if I consistently want these semantics out of fsync() I
have to #define _POSIX_SYNCHRONIZED_IO?  Or does the above mean that
you'll get those semantics if and only if the OS defines the above for
you?

I certainly hope the former is the case, because the newer semantics
which you mentioned in the section I cut don't do us any good at all
and we can't rely on the OS to define something like
_POSIX_SYNCHRONIZED_IO for us...

Being able to open a file, do an fsync(), and have the kernel actually
write all the buffers associated with that file to disk could be, I
think, a significant performance win compared with the "flush
everything known to the kernel" approach we take now, at least on
systems that do something other than PostgreSQL...



-- 
Kevin Brown                          kevin@sysexperts.com


Re: sync()

From
Kurt Roeckx
Date:
On Sat, Feb 01, 2003 at 08:15:17AM -0800, Kevin Brown wrote:
> Kurt Roeckx wrote:
> >      [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the
> >      fsync() function shall force all currently queued I/O operations
> >      associated with the file indicated by file descriptor fildes to the
> >      synchronized I/O completion state. All I/O operations shall be
> >      completed as defined for synchronized I/O file integrity
> >      completion. [Option End]
> 
> Hmmm....so if I consistently want these semantics out of fsync() I
> have to #define _POSIX_SYNCHRONIZED_IO?  Or does the above mean that
> you'll get those semantics if and only if the OS defines the above for
> you?

It's something that will be defined in unistd.h.  Depending on
the value you know if the system supports it always, you can turn
it on per application, or it's always on.

You know that this standard is freely available on internet?
(http://www.unix-systems.org/version3/online.html)

There are other comments in about the usage of it.

Note that there also is a function call fdatasync() in the
Synchronized IO extention.


Kurt



Re: sync()

From
Kevin Brown
Date:
Kurt Roeckx wrote:
> On Sat, Feb 01, 2003 at 08:15:17AM -0800, Kevin Brown wrote:
> > Kurt Roeckx wrote:
> > >      [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the
> > >      fsync() function shall force all currently queued I/O operations
> > >      associated with the file indicated by file descriptor fildes to the
> > >      synchronized I/O completion state. All I/O operations shall be
> > >      completed as defined for synchronized I/O file integrity
> > >      completion. [Option End]
> > 
> > Hmmm....so if I consistently want these semantics out of fsync() I
> > have to #define _POSIX_SYNCHRONIZED_IO?  Or does the above mean that
> > you'll get those semantics if and only if the OS defines the above for
> > you?
> 
> It's something that will be defined in unistd.h.  Depending on
> the value you know if the system supports it always, you can turn
> it on per application, or it's always on.
> 
> You know that this standard is freely available on internet?
> (http://www.unix-systems.org/version3/online.html)
> 
> There are other comments in about the usage of it.
> 
> Note that there also is a function call fdatasync() in the
> Synchronized IO extention.

Ah, excellent, thank you.  Yes, fdatasync() is *exactly* what we need,
since it's defined thusly: "The functionality shall be equivalent to
fsync() with the symbol _POSIX_SYNCHRONIZED_IO defined, with the
exception that all I/O operations shall be completed as defined for
synchronized I/O data integrity completion".

Looks to me like we have a winner.  Question is, can we bank on its
existence and, if so, is it properly implemented on all platforms that
support it?


Since we've been talking about porting to rather different platforms
(win32 in particular), it seems logical to build a PGFileSync()
function or something (perhaps a single PGSync() which synchronizes
all relevant PG files to disk, with sync() if necessary) and which
would thus use fdatasync() or its equivalent.



-- 
Kevin Brown                          kevin@sysexperts.com