Thread: fsync or fdatasync
Hi all, apparently the default value for wal_sync_method is fsync, and apparently the best method is fdatasync. There is any special consideration for not use the fdatasync as default ? Ciao Gaetano
On Mon, 9 Sep 2002, Gaetano Mendola wrote: > Hi all, > apparently the default value for wal_sync_method is fsync, > and apparently the best method is fdatasync. > There is any special consideration for not use the fdatasync > as default ? IIRC, there were systems that did not have a functional fdatasync.
On Mon, 9 Sep 2002, Gaetano Mendola wrote: GM> apparently the default value for wal_sync_method is fsync, GM> and apparently the best method is fdatasync. GM> There is any special consideration for not use the fdatasync GM> as default ? #wal_sync_method = fsync # the default varies across platforms: # # fsync, fdatasync, open_sync, or open_datasync I suppose fdatasync *is* the default on platforms where it exists. On *BSD it does not. Sincerely, D.Marck [DM5020, DM268-RIPE, DM3-RIPN] ------------------------------------------------------------------------ *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- marck@rinet.ru *** ------------------------------------------------------------------------
On Mon, Sep 09, 2002 at 11:19:50AM +0200, Gaetano Mendola wrote: > Hi all, > apparently the default value for wal_sync_method is fsync, > and apparently the best method is fdatasync. ^^^^^^^^^^ On which platform? There are all sorts of variables to consider here, most notably whether the given platform happens to support fdatasync. I have found that on some platforms, open_datasync is faster than anything. But you can certainly use fdatasync if your platform supports it. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
"Gaetano Mendola" <mendola@bigfoot.com> writes: > apparently the default value for wal_sync_method is fsync, > and apparently the best method is fdatasync. Best on what platform, and according to what evidence? We'll be glad to consider changing the default for a specific platform if we have a reasonably convincing argument that the other value is better. So far, not much study has been done of which method is best on which platforms (and under what load conditions). regards, tom lane
Dmitry Morozovsky <marck@rinet.ru> writes: > #wal_sync_method = fsync # the default varies across platforms: > # # fsync, fdatasync, open_sync, or open_datasync > I suppose fdatasync *is* the default on platforms where it exists. On > *BSD it does not. [ looks at code... ] Actually, the current algorithm for choosing the default is "open_datasync if it exists, else fdatasync if it exists, else fsync". There probably are platforms where this method yields a non-optimal answer, but we need more data before fooling with it. regards, tom lane
> > apparently the default value for wal_sync_method is fsync, > > and apparently the best method is fdatasync. > > Best on what platform, and according to what evidence? > > We'll be glad to consider changing the default for a specific platform > if we have a reasonably convincing argument that the other value is > better. So far, not much study has been done of which method is best > on which platforms (and under what load conditions). heh, just a quick note: I had one of FreeBSD's kernel guru's point out that fsync() on linux is a no-op. I haven't had that src tree in years so I can't confirm, but I'm inclined to believe him. Just an FYI. -sc -- Sean Chittenden
Attachment
On Mon, Sep 09, 2002 at 04:43:04PM -0700, Sean Chittenden wrote: > > We'll be glad to consider changing the default for a specific platform > > if we have a reasonably convincing argument that the other value is > > better. So far, not much study has been done of which method is best > > on which platforms (and under what load conditions). > > heh, just a quick note: I had one of FreeBSD's kernel guru's point out > that fsync() on linux is a no-op. I haven't had that src tree in > years so I can't confirm, but I'm inclined to believe him. Just an No, fsync() is not a no-op on linux. Unless the filesystem is mounted with o_sync, I suppose - then everything is written at write() so fsync() is not needed. But generally, it does sync. -- Ragnar Kjørstad Big Storage
> > > We'll be glad to consider changing the default for a specific > > > platform if we have a reasonably convincing argument that the > > > other value is better. So far, not much study has been done of > > > which method is best on which platforms (and under what load > > > conditions). > > > > heh, just a quick note: I had one of FreeBSD's kernel guru's point > > out that fsync() on linux is a no-op. I haven't had that src tree > > in years so I can't confirm, but I'm inclined to believe him. > > Just an > > No, fsync() is not a no-op on linux. > Unless the filesystem is mounted with o_sync, I suppose - then > everything is written at write() so fsync() is not needed. But > generally, it does sync. Hrm, alright. From what I've figured out, about ~6wk ago fsync() was added to linux to have it actually fsync()... mind you someone quickly turned around and created a new patchset that ripped the functionality out and added it to an extreme linux distro. ::shrug:: <opinion>Linux is out of control.</opinion> -sc PS This wasn't a flame/troll, just trying to figure out what the best recommendation would be for FreeBSD and it is fsync(). -- Sean Chittenden
On Mon, Sep 09, 2002 at 05:11:27PM -0700, Sean Chittenden wrote: > > No, fsync() is not a no-op on linux. > > Unless the filesystem is mounted with o_sync, I suppose - then > > everything is written at write() so fsync() is not needed. But > > generally, it does sync. > > Hrm, alright. From what I've figured out, about ~6wk ago fsync() was > added to linux to have it actually fsync()... mind you someone quickly > turned around and created a new patchset that ripped the functionality > out and added it to an extreme linux distro. ::shrug:: <opinion>Linux > is out of control.</opinion> -sc "6wk"? Linux has had fsync for as long as I can remember. Maybe you have it confused with fsync() over NFS? The NFSv2 implementation on linux used to have "async" flag for nfs as default - making it non NFS-compliant without reconfiguration. -- Ragnar Kjørstad Big Storage
> > > No, fsync() is not a no-op on linux. Unless the filesystem is > > > mounted with o_sync, I suppose - then everything is written at > > > write() so fsync() is not needed. But generally, it does sync. > > > > Hrm, alright. From what I've figured out, about ~6wk ago fsync() > > was added to linux to have it actually fsync()... mind you someone > > quickly turned around and created a new patchset that ripped the > > functionality out and added it to an extreme linux distro. > > ::shrug:: <opinion>Linux is out of control.</opinion> -sc > > "6wk"? > > Linux has had fsync for as long as I can remember. > > Maybe you have it confused with fsync() over NFS? The NFSv2 > implementation on linux used to have "async" flag for nfs as default > - making it non NFS-compliant without reconfiguration. The fsync() call has existed, but in the kernel it didn't actually do anything is what I've been told. -sc -- Sean Chittenden
Attachment
"Tom Lane" <tgl@sss.pgh.pa.us> wrote in message news:11753.1031590251@sss.pgh.pa.us... > "Gaetano Mendola" <mendola@bigfoot.com> writes: > > apparently the default value for wal_sync_method is fsync, > > and apparently the best method is fdatasync. > > Best on what platform, and according to what evidence? Well, the man say ( Linux ): fdatasync flushes all data buffers of a file to disk (before the system call returns). It resembles fsync but is not required to update the metadata such as access time. Applications that access databases or log files often write a tiny data fragment (e.g., one line in a log file) and then call fsync immediately in order to ensure that the written data is physically stored on the harddisk. Unfortunately, fsync will always initiate two write operations: one for the newly written data and another one in order to update the modification time stored in the inode. If the modification time is not a part of the transac� tion concept fdatasync can be used to avoid unnecessary inode disk write operations. So, what is wrong here ? Seems clear that one write is better than two. Ciao Gaetano
The original poster was wrong about the default. We use fdatasync where available, and fsync when it is not. We also use O_SYNC on open if it is available. --------------------------------------------------------------------------- Gaetano Mendola wrote: > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote in message > news:11753.1031590251@sss.pgh.pa.us... > > "Gaetano Mendola" <mendola@bigfoot.com> writes: > > > apparently the default value for wal_sync_method is fsync, > > > and apparently the best method is fdatasync. > > > > Best on what platform, and according to what evidence? > > Well, the man say ( Linux ): > > > fdatasync flushes all data buffers of a file to disk (before the system call > returns). It resembles fsync but is > not required to update the metadata such as access time. > > Applications that access databases or log files often write a tiny > data fragment (e.g., one line in a log file) > and then call fsync immediately in order to ensure that the > written data is physically stored on the harddisk. > Unfortunately, fsync will always initiate two write operations: one > for the newly written data and another one in > order to update the modification time stored in the inode. If the > modification time is not a part of the transac� > tion concept fdatasync can be used to avoid unnecessary inode disk > write operations. > > > So, what is wrong here ? Seems clear that one write is better than two. > > Ciao > Gaetano > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: > > The original poster was wrong about the default. > > We use fdatasync where available, and fsync when it is not. Makes sense. > We also use > O_SYNC on open if it is available. Why? That will slow tings down... -- Ragnar Kjørstad
Ragnar Kj�rstad wrote: > On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: > > > > The original poster was wrong about the default. > > > > We use fdatasync where available, and fsync when it is not. > > Makes sense. > > > We also use > > O_SYNC on open if it is available. > > Why? That will slow tings down... Actually, no, we are only O_SYNC'ing the WAL writes and sometimes that is faster because you are not writing then fsyncing, you are just writing. The fdatasync only is better than O_SYNC when you are doing multiple WAL writes before an fdatasync and we normally don't do that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, Sep 10, 2002 at 01:17:56PM -0400, Bruce Momjian wrote: > Ragnar Kjørstad wrote: > > On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: > > > We also use > > > O_SYNC on open if it is available. > > > > Why? That will slow tings down... > > Actually, no, we are only O_SYNC'ing the WAL writes and sometimes that > is faster because you are not writing then fsyncing, you are just > writing. The fdatasync only is better than O_SYNC when you are doing > multiple WAL writes before an fdatasync and we normally don't do that. OK, if it is a single write it makes sense. (But I doubt it makes much of a difference - the overhead of a system call is almoust nothing compared to a write to disk...) -- Ragnar Kjørstad
On Mon, Sep 09, 2002 at 05:33:18PM -0700, Sean Chittenden wrote: > > Linux has had fsync for as long as I can remember. > > > > Maybe you have it confused with fsync() over NFS? The NFSv2 > > implementation on linux used to have "async" flag for nfs as default > > - making it non NFS-compliant without reconfiguration. > > The fsync() call has existed, but in the kernel it didn't actually do > anything is what I've been told. -sc I think that's just wrong. I just downloaded linux-2.0.1 (from 1996) to check, and it _does_ fsync. OK, so it does it pretty inefficiently, but it still does the job. -- Ragnar Kjørstad Big Storage
=?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <postgres@ragnark.vestdata.no> writes: > On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: >> We use fdatasync where available, and fsync when it is not. > Makes sense. >> We also use O_SYNC on open if it is available. s/also/instead/ ... open_datasync is the first choice if available. > Why? That will slow tings down... On what evidence do you assert that? In theory open_datasync can be the fastest alternative for WAL writing, because it should cause the kernel to force each WAL write() request down to disk immediately. fdatasync will result in the same amount of I/O, but it will also require the kernel to scan its disk cache to see if there are any other dirty blocks that need to be written. On many kernels this check is not very efficient and can chew substantial amounts of CPU time. The tradeoff is that open_datasync syncs each WAL block individually, which is unnecessary if you are committing multiple blocks worth of WAL entries at once --- but there's no hard evidence that that slows things down, especially not when the WAL logs are on their own disk spindle. Giving the kernel scheduling freedom for a small number of blocks doesn't help much anyway in that case. Check the pghackers archives (a year or two back) for lots and lots of discussion, but I recall we demonstrated that the current default choices are reasonable for at least some set of Unixen. If you've got more information showing that the present default is wrong on some kernel, let's have it ... but don't waste our time with blanket assertions that "X is the right (or wrong) choice", because we know that's not so across all the platforms we support. We'd not have bothered with four sync methods if there weren't good evidence that each is the best available choice on some platforms. regards, tom lane
On Tue, Sep 10, 2002 at 03:17:00PM -0400, Tom Lane wrote: > =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <postgres@ragnark.vestdata.no> writes: > > On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: > >> We use fdatasync where available, and fsync when it is not. > > > Makes sense. > > >> We also use O_SYNC on open if it is available. > > s/also/instead/ ... Yes, I understood that... > open_datasync is the first choice if available. I assume open_datasync means open with O_SYNC flag.. > > Why? That will slow tings down... > > On what evidence do you assert that? > > In theory open_datasync can be the fastest alternative for WAL writing, > because it should cause the kernel to force each WAL write() request > down to disk immediately. fdatasync will result in the same amount of > I/O, but it will also require the kernel to scan its disk cache to see > if there are any other dirty blocks that need to be written. On many > kernels this check is not very efficient and can chew substantial > amounts of CPU time. Yes, I see your argument. However, I've just checked the linux-implementation of fsync() and I can't really see how it could chew substantial amounts of CPU time. The way it works every inode has a list of dirty data buffers - all it does it traverse that list and do a write on each. Anyway - I'm sure this is not enough to convince you, so I'll have to set up a test instead. But not tonight. > The tradeoff is that open_datasync syncs each WAL > block individually, which is unnecessary if you are committing > multiple blocks worth of WAL entries at once --- but there's no hard > evidence that that slows things down, especially not when the WAL logs > are on their own disk spindle. Well, in theory fsync() will allow the disk to reorder the writes, and that should give significantly better performance, because it will reduce the required number of seeks. If the WAL is on a seperate spindel there will very few seeks in the first place, so there is less to gain, but for the case with the WAL on the same disk as something else there is probably some gain. But it makes sense to optimize for the WAL-on-seperate-disk case... Another advantage is that fsync() would allow the elevator to merge multiple IO-requests. Still the same number of bytes to write, but fewer bigger requests are typicly faster. But again; numbers speek. I'll get back to you once I find the time to test it. > Check the pghackers archives (a year or two back) for lots and lots of > discussion, but I recall we demonstrated that the current default > choices are reasonable for at least some set of Unixen. If you've got > more information showing that the present default is wrong on some > kernel, let's have it ... but don't waste our time with blanket > assertions that "X is the right (or wrong) choice", because we know > that's not so across all the platforms we support. We'd not have > bothered with four sync methods if there weren't good evidence that each > is the best available choice on some platforms. No argument there; I'm sure there are applications for all of them. My point is that I think fdatasync() would be the fastest choice for the linux kernel. -- Ragnar Kjørstad
Ragnar Kj�rstad wrote: > > open_datasync is the first choice if available. > > I assume open_datasync means open with O_SYNC flag.. Yes. > > > Why? That will slow tings down... > > > > On what evidence do you assert that? > > > > In theory open_datasync can be the fastest alternative for WAL writing, > > because it should cause the kernel to force each WAL write() request > > down to disk immediately. fdatasync will result in the same amount of > > I/O, but it will also require the kernel to scan its disk cache to see > > if there are any other dirty blocks that need to be written. On many > > kernels this check is not very efficient and can chew substantial > > amounts of CPU time. > > Yes, I see your argument. > However, I've just checked the linux-implementation of fsync() and I > can't really see how it could chew substantial amounts of CPU time. The > way it works every inode has a list of dirty data buffers - all it does > it traverse that list and do a write on each. Remember we support >15 platforms, and I know there is at least one (HPUX?) which does the fsync/fdatasync block finding inefficiently. It may have even been old Linux; I can not remember. > Anyway - I'm sure this is not enough to convince you, so I'll have to > set up a test instead. But not tonight. Again, that is a test case for only one OS. It is helpful if we are going to start doing per-OS defaults, which is something we have talked about. What would be great is a test program we can run on different OS's to find out which is more efficient. > > > > The tradeoff is that open_datasync syncs each WAL > > block individually, which is unnecessary if you are committing > > multiple blocks worth of WAL entries at once --- but there's no hard > > evidence that that slows things down, especially not when the WAL logs > > are on their own disk spindle. > > Well, in theory fsync() will allow the disk to reorder the writes, and > that should give significantly better performance, because it will > reduce the required number of seeks. If the WAL is on a seperate spindel > there will very few seeks in the first place, so there is less to gain, > but for the case with the WAL on the same disk as something else there > is probably some gain. But it makes sense to optimize for the > WAL-on-seperate-disk case... Remember, in most cases, we are fsync'ing only one block so there is no _gathering_ to do. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, Sep 10, 2002 at 05:07:30PM -0400, Bruce Momjian wrote: > > Anyway - I'm sure this is not enough to convince you, so I'll have to > > set up a test instead. But not tonight. > > Again, that is a test case for only one OS. It is helpful if we are > going to start doing per-OS defaults, which is something we have talked > about. Oh; I assumed that was already the case. > What would be great is a test program we can run on different > OS's to find out which is more efficient. Yes. Bare in mind though, that this is as much a filesystem issue as a kernel issue. Two different filesystems on the same kernel may behave very differently. Of course one can't distribute seperate postgresql in different versions optimized for differet filesystems, so perhaps it's just as good to leave the default as it is and rather put some info (e.g. benchmarks for different setting on different filesystems on different operating systems, and the benchmark-script itself so people can do their own tests). This way the default is allright, and users that need to tweek a little extra have the info they need. > Remember, in most cases, we are fsync'ing only one block so there is no > _gathering_ to do. Yes, I know you said so. But if that's the case for only most cases there are some cases were it's not - so there is still some potential. -- Ragnar Kjørstad
Ragnar Kj�rstad wrote: > On Tue, Sep 10, 2002 at 05:07:30PM -0400, Bruce Momjian wrote: > > > Anyway - I'm sure this is not enough to convince you, so I'll have to > > > set up a test instead. But not tonight. > > > > Again, that is a test case for only one OS. It is helpful if we are > > going to start doing per-OS defaults, which is something we have talked > > about. > > Oh; I assumed that was already the case. > > > What would be great is a test program we can run on different > > OS's to find out which is more efficient. > > Yes. Bare in mind though, that this is as much a filesystem issue as a > kernel issue. Two different filesystems on the same kernel may behave > very differently. > > Of course one can't distribute seperate postgresql in different versions > optimized for differet filesystems, so perhaps it's just as good to > leave the default as it is and rather put some info (e.g. benchmarks for > different setting on different filesystems on different operating > systems, and the benchmark-script itself so people can do their own > tests). This way the default is allright, and users that need to tweek a > little extra have the info they need. > What we could do ideally is set the default by running some test during initdb perhaps. > > Remember, in most cases, we are fsync'ing only one block so there is no > > _gathering_ to do. > > Yes, I know you said so. But if that's the case for only most cases > there are some cases were it's not - so there is still some potential. Yes, but the cases are so rare it is probably not worth bothering about especially since O_SYNC has to be set on file open so you can't switch between that and fdatasync depending on how many blocks you have. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgman@candle.pha.pa.us (Bruce Momjian) writes: > Ragnar Kjørstad wrote: > > On Tue, Sep 10, 2002 at 11:40:24AM -0400, Bruce Momjian wrote: > > > > > > The original poster was wrong about the default. > > > > > > We use fdatasync where available, and fsync when it is not. > > > > Makes sense. > > > > > We also use > > > O_SYNC on open if it is available. > > > > Why? That will slow tings down... > > Actually, no, we are only O_SYNC'ing the WAL writes and sometimes that > is faster because you are not writing then fsyncing, you are just > writing. The fdatasync only is better than O_SYNC when you are doing > multiple WAL writes before an fdatasync and we normally don't do that. > I may be wrong on this, but my understanding is that the difference between fsync() and O_SYNC on the one hand and fdatasync() and O_DSYNC on the other hand is that the latter don't have to sync metadata (e.g. file access times) which saves a write to the inode that is more or less guarantied to require an extra seek. Iff this is true you never want to use fsync() or O_SYNC when fdatasync() and O_DSYNC is available (unless you really need the metadata to be synced too). _ Mats Lofkvist mal@algonet.se
Mats Lofkvist wrote: > > Actually, no, we are only O_SYNC'ing the WAL writes and sometimes that > > is faster because you are not writing then fsyncing, you are just > > writing. The fdatasync only is better than O_SYNC when you are doing > > multiple WAL writes before an fdatasync and we normally don't do that. > > > > I may be wrong on this, but my understanding is that the difference > between fsync() and O_SYNC on the one hand and fdatasync() and O_DSYNC > on the other hand is that the latter don't have to sync metadata > (e.g. file access times) which saves a write to the inode that is > more or less guarantied to require an extra seek. > > Iff this is true you never want to use fsync() or O_SYNC when > fdatasync() and O_DSYNC is available (unless you really need the > metadata to be synced too). Yes, I didn't mention O_DSYNC. It is in the cards. If you are interested, look at the code and how the defaults are chosen. postgresql.conf say: #wal_sync_method = fsync # the default varies across platforms: # # fsync, fdatasync, open_sync, or open_datasync Which means exactly that, varies based on the platform. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote in message news:200209101540.g8AFeOo21086@candle.pha.pa.us... > > The original poster was wrong about the default. > > We use fdatasync where available, and fsync when it is not. We also use > O_SYNC on open if it is available. The original poster is me! :-) I was pointed to a document ( that I don't find anymore ) where I understood that the default was fsync. I'm sorry for the little flame.... :-( PS. It is in the plans to use raw disks ? Ciao Gaetano
Gaetano Mendola wrote: > > "Bruce Momjian" <pgman@candle.pha.pa.us> wrote in message > news:200209101540.g8AFeOo21086@candle.pha.pa.us... > > > > The original poster was wrong about the default. > > > > We use fdatasync where available, and fsync when it is not. We also use > > O_SYNC on open if it is available. > > > The original poster is me! :-) > > I was pointed to a document ( that I don't find anymore ) > where I understood that the default was fsync. > I'm sorry for the little flame.... :-( > > > PS. It is in the plans to use raw disks ? No plans. Raw disk is only marginally faster and a lot more complicated. See the TODO performance link for details. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073