Thread: sync()
I noticed sync() is used in PostgreSQL. CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync() Can someone tell me why we need sync() here? -- Tatsuo Ishii
Tatsuo Ishii wrote: > I noticed sync() is used in PostgreSQL. > > CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync() > > Can someone tell me why we need sync() here? As part of checkpoint, we discard some WAL files. To do that, we must first be sure that all the dirty buffers we have written to the kernel are actually on the disk. That is why the sync() is required. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tatsuo Ishii wrote: >> Can someone tell me why we need sync() here? > As part of checkpoint, we discard some WAL files. To do that, we must > first be sure that all the dirty buffers we have written to the kernel > are actually on the disk. That is why the sync() is required. What we really need is something better than sync(), viz flush all dirty buffers to disk *and* wait till they're written. But sync() and sleep for awhile is the closest portable approximation. regards, tom lane
> Tatsuo Ishii wrote: > > I noticed sync() is used in PostgreSQL. > > > > CHECKPOINT -> FlushBufferPool() -> smgrsync() -> mdsync() -> sync() > > > > Can someone tell me why we need sync() here? > > As part of checkpoint, we discard some WAL files. To do that, we must > first be sure that all the dirty buffers we have written to the kernel > are actually on the disk. That is why the sync() is required. ?? I thought WAL files are synced by pg_fsync() (if needed). -- Tatsuo Ishii
> > As part of checkpoint, we discard some WAL files. To do that, we must > > first be sure that all the dirty buffers we have written to the kernel > > are actually on the disk. That is why the sync() is required. > > What we really need is something better than sync(), viz flush all dirty > buffers to disk *and* wait till they're written. But sync() and sleep > for awhile is the closest portable approximation. Are you saying that fsync() might not wait untill the IO completes? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Can someone tell me why we need sync() here? > ?? I thought WAL files are synced by pg_fsync() (if needed). They are. But to write a checkpoint record --- which implies that the WAL records before it need no longer be replayed --- we have to ensure that all the changes-so-far in the regular database files are written down to disk. That is what we need sync() for. regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> What we really need is something better than sync(), viz flush all dirty >> buffers to disk *and* wait till they're written. But sync() and sleep >> for awhile is the closest portable approximation. > Are you saying that fsync() might not wait untill the IO completes? No, I said that sync() might not. Read the man pages. HPUX's man page for sync(2) says sync() causes all information in memory that should be on disk to be written out. ... The writing, althoughscheduled, is not necessarily complete upon return from sync. regards, tom lane
> > Are you saying that fsync() might not wait untill the IO completes? > > No, I said that sync() might not. Read the man pages. HPUX's man > page for sync(2) says > > sync() causes all information in memory that should be on disk to be > written out. > ... > The writing, although scheduled, is not necessarily complete upon > return from sync. I'm just wondering why we do not use fsync() to flush data/index pages. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I'm just wondering why we do not use fsync() to flush data/index > pages. There isn't any efficient way to do that AFAICS. The process that wants to do the checkpoint hasn't got any way to know just which files need to be sync'd. Even if it did know, it's not clear to me that we can portably assume that process A issuing an fsync on a file descriptor F it's opened for file X will force to disk previous writes issued against the same physical file X by a different process B using a different file descriptor G. sync() is surely overkill, in that it writes out dirty kernel buffers that might have nothing at all to do with Postgres. But I don't see how to do better. regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I'm just wondering why we do not use fsync() to flush data/index > > pages. > > There isn't any efficient way to do that AFAICS. The process that wants > to do the checkpoint hasn't got any way to know just which files need to > be sync'd. Even if it did know, it's not clear to me that we can > portably assume that process A issuing an fsync on a file descriptor F > it's opened for file X will force to disk previous writes issued against > the same physical file X by a different process B using a different file > descriptor G. > > sync() is surely overkill, in that it writes out dirty kernel buffers > that might have nothing at all to do with Postgres. But I don't see how > to do better. Thanks for a good summary. Maybe this is yet another reason to have a separate IO process like Oracle... -- Tatsuo Ishii
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> What we really need is something better than sync(), viz flush all dirty > >> buffers to disk *and* wait till they're written. But sync() and sleep > >> for awhile is the closest portable approximation. > > > Are you saying that fsync() might not wait untill the IO completes? > > No, I said that sync() might not. Read the man pages. HPUX's man > page for sync(2) says > > sync() causes all information in memory that should be on disk to be > written out. > ... > The writing, although scheduled, is not necessarily complete upon > return from sync. Yep, BSD/OS says: BUGS Sync() may return before the buffers are completely flushed. At least they classify it as a bug. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I'm just wondering why we do not use fsync() to flush data/index > > pages. > > There isn't any efficient way to do that AFAICS. The process that wants > to do the checkpoint hasn't got any way to know just which files need to > be sync'd. So the backends have to keep a common list of all the files they touch. Admittedly, that could be a problem if it means using a bunch of shared memory, and it may have additional performance implications depending on the implementation ... > Even if it did know, it's not clear to me that we can > portably assume that process A issuing an fsync on a file descriptor F > it's opened for file X will force to disk previous writes issued against > the same physical file X by a different process B using a different file > descriptor G. If the manpages are to be believed, then under FreeBSD, Linux, and HP-UX, calling fsync() will force to disk *all* unwritten buffers associated with the file pointed to by the filedescriptor. Sadly, however, the Solaris and IRIX manpages suggest that only buffers associated with the specific file descriptor itself are written, not necessarily all buffers associated with the file pointed at by the file descriptor (and interestingly, the Solaris version appears to be implemented as a library function and not a system call, if the manpage's section is any indication). > sync() is surely overkill, in that it writes out dirty kernel buffers > that might have nothing at all to do with Postgres. But I don't see how > to do better. It's obvious to me that sync() can have some very significant performance implications on a system that is acting as more than just a database server. So it should probably be used only when there's no good alternative. So: this is probably one of those cases where it's important to distinguish between operating systems and use the sync() approach only when it's uncertain that fsync() will do the job. So FreeBSD (and probably all the other BSD derivatives) definitely should use fsync() since they have known-good implementations. Linux and HP-UX 11 (if the manpage's wording can be trusted. Not sure about earlier versions) should use fsync() as well. Solaris and IRIX should use sync() since their manpages indicate that only data associated with the filedescriptor will be written to disk. Under Linux (and perhaps HP-UX), it may be necessary to fsync() the directories leading to the file as well, so that the state of the filesystem on disk is consistent and safe in the event that the files in question are newly-created. Whether that's truly necessary or not appears to be filesystem-dependent. A quick perusal of the Linux source shows that ext2 appears to only sync the data and metadata associated with the inode of the specific file and not any parent directories, so it's probably a safe bet to fsync() any ancestor directories that matter as well as the file even if the system is running on top of a journalled filesystem. Since all the files in question probably reside in the same set of directories, the directory fsync()s can be deferred until the very end. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > So the backends have to keep a common list of all the files they > touch. Admittedly, that could be a problem if it means using a bunch > of shared memory, and it may have additional performance implications > depending on the implementation ... It would have to be a list of all files that have been touched since the last checkpoint. That's a serious problem for storage in shared memory, which is by definition fixed-size. >> Even if it did know, it's not clear to me that we can >> portably assume that process A issuing an fsync on a file descriptor F >> it's opened for file X will force to disk previous writes issued against >> the same physical file X by a different process B using a different file >> descriptor G. > If the manpages are to be believed, then under FreeBSD, Linux, and > HP-UX, calling fsync() will force to disk *all* unwritten buffers > associated with the file pointed to by the filedescriptor. > Sadly, however, the Solaris and IRIX manpages suggest that only > buffers associated with the specific file descriptor itself are > written, not necessarily all buffers associated with the file pointed > at by the file descriptor (and interestingly, the Solaris version > appears to be implemented as a library function and not a system call, > if the manpage's section is any indication). Right. "Portably" was the key word in my comment (sorry for not emphasizing this more clearly). The real problem here is how to know what is the actual behavior of each platform? I'm certainly not prepared to trust reading-between-the-lines-of-some-man-pages. And I can't think of a simple yet reliable direct test. You'd really have to invest detailed study of the kernel source code to know for sure ... and many of our platforms don't have open-source kernels. > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the > directories leading to the file as well, so that the state of the > filesystem on disk is consistent and safe in the event that the files > in question are newly-created. AFAIK, all Unix implementations are paranoid about consistency of filesystem metadata, including directory contents. So fsync'ing directories from a user process strikes me as a waste of time, even assuming that it were portable, which I doubt. What we need to worry about is whether fsync'ing a bunch of our own data files is a practical substitute for a global sync() call. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > So the backends have to keep a common list of all the files they > > touch. Admittedly, that could be a problem if it means using a bunch > > of shared memory, and it may have additional performance implications > > depending on the implementation ... > > It would have to be a list of all files that have been touched since the > last checkpoint. That's a serious problem for storage in shared memory, > which is by definition fixed-size. Of course, the file list needn't be stored in SysV shared memory. It could be stored in a file that's later read by the checkpointing process. The backends could serialize their writes via fcntl() or ioctl() style locks, whichever is appropriate. Locking might even be avoided entirely if the individual writes are small enough. > Right. "Portably" was the key word in my comment (sorry for not > emphasizing this more clearly). The real problem here is how to know > what is the actual behavior of each platform? I'm certainly not > prepared to trust reading-between-the-lines-of-some-man-pages. Reading between the lines isn't necessarily required, just literal interpretation. :-) > And I can't think of a simple yet reliable direct test. You'd > really have to invest detailed study of the kernel source code to > know for sure ... and many of our platforms don't have open-source > kernels. Linux appears to do the right thing with the file data itself, even if it doesn't handle the directory entry simultaneously. Others claim, in messages written to pgsql-general and elsewhere (via Google search), that FreeBSD does the right thing for sure. I certainly agree that non-open-source kernels are uncertain. That's why it wouldn't be a bad idea to control this via a GUC variable. > > Under Linux (and perhaps HP-UX), it may be necessary to fsync() the > > directories leading to the file as well, so that the state of the > > filesystem on disk is consistent and safe in the event that the files > > in question are newly-created. > > AFAIK, all Unix implementations are paranoid about consistency of > filesystem metadata, including directory contents. Not ext2 under Linux! By default, it writes everything asynchronously. I don't know how many people use ext2 to do serious tasks under Linux, so this may not be that much of an issue. > So fsync'ing directories from a user process strikes me as a waste > of time, even assuming that it were portable, which I doubt. What > we need to worry about is whether fsync'ing a bunch of our own data > files is a practical substitute for a global sync() call. I'm positive that under certain operating systems, fsyncing the data is a better option than a global sync(), especially since sync() isn't guaranteed to wait until the buffers are flushed. Right now the state of the data on disk immediately after a checkpoint is just a guess because of that. I don't see that using fsync() would introduce significantly more uncertainty on systems where the manpage explicitly says that the buffers associated with the file referenced by the file descriptor are the ones written to disk. For instance, the FreeBSD manpage says: Fsync() causes all modified data and attributes of fd to be moved to a permanent storage device. This normally resultsin all in-core modified copies of buffers for the associated file to be written to a disk. Fsync() should be used by programs that require a file to be in a known state, for example, in building a simple transaction facility. and the Linux manpage says: fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage. It also updates metadata stat information. It does not necessarily ensure that the entry in the directory containingthe file has also reached disk. For that an explicit fsync on the file descriptor of the directory is alsoneeded. Both are rather unambiguous, and a cursory review of the Linux source confirms what its manpage says, at least. The FreeBSD manpage might be ambiguous, but the fact that they also have an fsync command line utility essentially proves that FreeBSD's fsync() flushes all buffers associated with the file. Conversely, the Solaris manpage says: The fsync() function moves all modified data and attributes of the file descriptor fildes to a storage device. Whenfsync() returns, all in-memory modified copies of buffers associated with fildes have been written to the physicalmedium. It's pretty clear from the Solaris description that its fsync() concerns itself only with the buffers associated with a file descriptor and not with the file itself. The fact that it's implemented as a library call (the manpage is in section 3 instead of section 2) convinces me further that its fsync() implementation is as described. The PostgreSQL default for checkpoints should probably be sync(), but I think fsync() should be an available option, just as it's possible to control whether or not synchronous writes are used for the transaction log as well as the type of synchronization mechanism used for it. Yes, it's another parameter for the administrator to concern himself with, but it seems to me that a significant amount of speed could be gained under certain (perhaps quite common) circumstances with such a mechanism. -- Kevin Brown kevin@sysexperts.com
Tom Lane writes: > Right. "Portably" was the key word in my comment (sorry for not > emphasizing this more clearly). The real problem here is how to know > what is the actual behavior of each platform? I'm certainly not > prepared to trust reading-between-the-lines-of-some-man-pages. And I > can't think of a simple yet reliable direct test. Is the "Single Unix Standard, version 2" (aka UNIX98) any better? It says for fsync(): "The fsync() function forces all currently queued I/O operations associated with the file indicated by file descriptorfildes to the synchronised I/O completion state. All I/O operations are completed as defined for synchronisedI/O file integrity completion." This to me clearly says that changes to the file must be written, not just changes made via this file descriptor. I did have to test this behaviour once (for a customer, strange situation) but I couldn't find a portable way to do it, either. What I did was read the appropriate disk block from the raw device to bypass the buffer cache. As this required low level knowledge of the on-disk filesystem layout it was not very portable. For anyone interested Tom Christiansen's "icat" program can be ported to UFS derived filesystems fairly easily: http://www.rosat.mpe-garching.mpg.de/mailing-lists/perl5-porters/1997-04/msg00487.html > AFAIK, all Unix implementations are paranoid about consistency of > filesystem metadata, including directory contents. So fsync'ing > directories from a user process strikes me as a waste of time, ... There is one variant where this is not the case: Linux using ext2fs and possibly other filesystems. There was a flame fest of great entertainment value a few years ago between Linus Torvalds and Dan Bernstein. Of course, neither was able to influence the opinion of the other to any noticible degree, but it made fun reading. I think this might be a starting point: http://www.ornl.gov/cts/archives/mailing-lists/qmail/1998/05/msg00667.html A more recent posting from Linus where he continues to recommend fsync() is this: http://www.cs.helsinki.fi/linux/linux-kernel/2001-29/0659.html I've not heard that any other Unix-like OS has abandoned the traditional and POSIX semantic. > assuming that it were portable, which I doubt. What we need to worry > about is whether fsync'ing a bunch of our own data files is a practical > substitute for a global sync() call. I wish that it were. There are situations (serveral GB buffer caches, for example) where I mistrust the current use of sync() to have all writes completed before the sleep() returns. My concern is theoretical at the moment -- I never get to play with machines that large! Regards, Giles
On Mon, Jan 13, 2003 at 07:31:08PM +1100, Giles Lean wrote: > > Is the "Single Unix Standard, version 2" (aka UNIX98) any better? > It says for fsync(): > > "The fsync() function forces all currently queued I/O operations > associated with the file indicated by file descriptor fildes to > the synchronised I/O completion state. All I/O operations are > completed as defined for synchronised I/O file integrity > completion." In version 3 it says: The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferredto the storage device associated with the file described by fildes in an implementation-defined manner.The fsync() function shall not return until the system has completed that action or until an error is detected. [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/Ooperations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state.All I/O operations shall be completed as defined for synchronized I/O file integrity completion. [Option End] Kurt
Kurt Roeckx wrote: > [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the > fsync() function shall force all currently queued I/O operations > associated with the file indicated by file descriptor fildes to the > synchronized I/O completion state. All I/O operations shall be > completed as defined for synchronized I/O file integrity > completion. [Option End] Hmmm....so if I consistently want these semantics out of fsync() I have to #define _POSIX_SYNCHRONIZED_IO? Or does the above mean that you'll get those semantics if and only if the OS defines the above for you? I certainly hope the former is the case, because the newer semantics which you mentioned in the section I cut don't do us any good at all and we can't rely on the OS to define something like _POSIX_SYNCHRONIZED_IO for us... Being able to open a file, do an fsync(), and have the kernel actually write all the buffers associated with that file to disk could be, I think, a significant performance win compared with the "flush everything known to the kernel" approach we take now, at least on systems that do something other than PostgreSQL... -- Kevin Brown kevin@sysexperts.com
On Sat, Feb 01, 2003 at 08:15:17AM -0800, Kevin Brown wrote: > Kurt Roeckx wrote: > > [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the > > fsync() function shall force all currently queued I/O operations > > associated with the file indicated by file descriptor fildes to the > > synchronized I/O completion state. All I/O operations shall be > > completed as defined for synchronized I/O file integrity > > completion. [Option End] > > Hmmm....so if I consistently want these semantics out of fsync() I > have to #define _POSIX_SYNCHRONIZED_IO? Or does the above mean that > you'll get those semantics if and only if the OS defines the above for > you? It's something that will be defined in unistd.h. Depending on the value you know if the system supports it always, you can turn it on per application, or it's always on. You know that this standard is freely available on internet? (http://www.unix-systems.org/version3/online.html) There are other comments in about the usage of it. Note that there also is a function call fdatasync() in the Synchronized IO extention. Kurt
Kurt Roeckx wrote: > On Sat, Feb 01, 2003 at 08:15:17AM -0800, Kevin Brown wrote: > > Kurt Roeckx wrote: > > > [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the > > > fsync() function shall force all currently queued I/O operations > > > associated with the file indicated by file descriptor fildes to the > > > synchronized I/O completion state. All I/O operations shall be > > > completed as defined for synchronized I/O file integrity > > > completion. [Option End] > > > > Hmmm....so if I consistently want these semantics out of fsync() I > > have to #define _POSIX_SYNCHRONIZED_IO? Or does the above mean that > > you'll get those semantics if and only if the OS defines the above for > > you? > > It's something that will be defined in unistd.h. Depending on > the value you know if the system supports it always, you can turn > it on per application, or it's always on. > > You know that this standard is freely available on internet? > (http://www.unix-systems.org/version3/online.html) > > There are other comments in about the usage of it. > > Note that there also is a function call fdatasync() in the > Synchronized IO extention. Ah, excellent, thank you. Yes, fdatasync() is *exactly* what we need, since it's defined thusly: "The functionality shall be equivalent to fsync() with the symbol _POSIX_SYNCHRONIZED_IO defined, with the exception that all I/O operations shall be completed as defined for synchronized I/O data integrity completion". Looks to me like we have a winner. Question is, can we bank on its existence and, if so, is it properly implemented on all platforms that support it? Since we've been talking about porting to rather different platforms (win32 in particular), it seems logical to build a PGFileSync() function or something (perhaps a single PGSync() which synchronizes all relevant PG files to disk, with sync() if necessary) and which would thus use fdatasync() or its equivalent. -- Kevin Brown kevin@sysexperts.com