Thread: Maximum transaction rate
I'm using postgresql 8.3.6 through JDBC, and trying to measure the maximum transaction rate on a given Linux box. I wrote a test program that: - Creates a table with two int columns and no indexes, - loads the table through a configurable number of threads, with each transaction writing one row and then committing, (auto commit is false), and - reports transactions/sec. The postgres configuration regarding syncing is standard: fsync = on, synchronous_commit = on, wal_sync_method = fsync. My linux kernel is 2.6.27.19-78.2.30.fc9.i686. The transaction rates I'm getting seem way too high: 2800-2900 with one thread, 5000-7000 with ten threads. I'm guessing that writes aren't really reaching the disk. Can someone suggest how to figure out where, below postgres, someone is lying about writes reaching the disk? Jack
Jack Orenstein <jack.orenstein@hds.com> writes: > The transaction rates I'm getting seem way too high: 2800-2900 with > one thread, 5000-7000 with ten threads. I'm guessing that writes > aren't really reaching the disk. Can someone suggest how to figure out > where, below postgres, someone is lying about writes reaching the > disk? AFAIK there are two trouble sources in recent Linux machines: LVM and the disk drive itself. LVM is apparently broken by design --- it simply fails to pass fsync requests. If you're using it you have to stop. (Which sucks, because it's exactly the kind of thing DBAs tend to want.) Otherwise you need to reconfigure your drive to not cache writes. I forget the incantation for that but it's in the PG list archives. regards, tom lane
On Fri, 6 Mar 2009, Tom Lane wrote: > Otherwise you need to reconfigure your drive to not cache writes. > I forget the incantation for that but it's in the PG list archives. There's a dicussion of this in the docs now, http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html hdparm -I lets you check if write caching is on, hdparm -W lets you toggle it off. That's for ATA disks; SCSI ones can use sdparm instead, but usually it's something you can adjust more permanently in the card configuration or BIOS instead for those. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 6 Mar 2009, Greg Smith wrote: > On Fri, 6 Mar 2009, Tom Lane wrote: > >> Otherwise you need to reconfigure your drive to not cache writes. >> I forget the incantation for that but it's in the PG list archives. > > There's a dicussion of this in the docs now, > http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html How does turning off write caching on the disk stop the problem with LVM? It still seems like you have to get the data out of the OS buffer, and if fsync() doesn't do that for you....
On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote: > On Fri, 6 Mar 2009, Greg Smith wrote: > >> On Fri, 6 Mar 2009, Tom Lane wrote: >> >>> Otherwise you need to reconfigure your drive to not cache writes. >>> I forget the incantation for that but it's in the PG list archives. >> >> There's a dicussion of this in the docs now, >> http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html > > How does turning off write caching on the disk stop the problem with LVM? It > still seems like you have to get the data out of the OS buffer, and if > fsync() doesn't do that for you.... I think he was saying otherwise (if you're not using LVM and you still have this super high transaction rate) you'll need to turn off the drive's write caches. I kinda wondered at it for a second too.
On Fri, 6 Mar 2009, Ben Chobot wrote: > How does turning off write caching on the disk stop the problem with LVM? It doesn't. Linux LVM is awful and broken, I was just suggesting more details on what you still need to check even when it's not involved. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Scott Marlowe wrote: > On Fri, Mar 6, 2009 at 2:22 PM, Ben Chobot <bench@silentmedia.com> wrote: >> On Fri, 6 Mar 2009, Greg Smith wrote: >> >>> On Fri, 6 Mar 2009, Tom Lane wrote: >>> >>>> Otherwise you need to reconfigure your drive to not cache writes. >>>> I forget the incantation for that but it's in the PG list archives. >>> There's a dicussion of this in the docs now, >>> http://www.postgresql.org/docs/8.3/interactive/wal-reliability.html >> How does turning off write caching on the disk stop the problem with LVM? It >> still seems like you have to get the data out of the OS buffer, and if >> fsync() doesn't do that for you.... > > I think he was saying otherwise (if you're not using LVM and you still > have this super high transaction rate) you'll need to turn off the > drive's write caches. I kinda wondered at it for a second too. > And I'm still wondering. The problem with LVM, AFAIK, is missing support for write barriers. Once you disable the write-back cache on the disk, you no longer need write barriers. So I'm missing something, what else does LVM do to break fsync()? It was my understanding that disabling disk caches was enough. .TM.
Marco Colombo <pgsql@esiway.net> writes: > And I'm still wondering. The problem with LVM, AFAIK, is missing support > for write barriers. Once you disable the write-back cache on the disk, > you no longer need write barriers. So I'm missing something, what else > does LVM do to break fsync()? I think you're imagining that the disk hardware is the only source of write reordering, which isn't the case ... various layers in the kernel can reorder operations before they get sent to the disk. regards, tom lane
Tom Lane wrote: > Marco Colombo <pgsql@esiway.net> writes: >> And I'm still wondering. The problem with LVM, AFAIK, is missing support >> for write barriers. Once you disable the write-back cache on the disk, >> you no longer need write barriers. So I'm missing something, what else >> does LVM do to break fsync()? > > I think you're imagining that the disk hardware is the only source of > write reordering, which isn't the case ... various layers in the kernel > can reorder operations before they get sent to the disk. > > regards, tom lane You mean some layer (LVM) is lying about the fsync()? write(A); fsync(); ... write(B); fsync(); ... write(C); fsync(); you mean that the process may be awakened after the first fsync() while A is still somewhere in OS buffers and not sent to disk yet, so it's possible that B gets to the disk BEFORE A. And if the system crashes, A never hits the platters while B (possibly) does. Is it this you mean by "write reodering"? But doesn't this break any application with transactional-like behavior, such as sendmail? The problem being 3rd parties, if sendmail declares "ok, I saved the message" (*after* a fsync()) to the SMTP client, it's actually lying 'cause the message hasn't hit the platters yet. Same applies to IMAP/POP server, say. Well, it applies to anything using fsync(). I mean, all this with disk caches in write-thru modes? It's the OS lying, not the disks? Wait, this breaks all journaled FSes as well, a DM device is just a block device to them, if it's lying about synchronous writes the whole purpose of the journal is defeated... I find it hard to believe, I have to say. .TM.
Marco Colombo <pgsql@esiway.net> writes: > You mean some layer (LVM) is lying about the fsync()? Got it in one. regards, tom lane
On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote: > Marco Colombo <pgsql@esiway.net> writes: > > You mean some layer (LVM) is lying about the fsync()? > > Got it in one. > I wouldn't think this would be a problem with the proper battery backed raid controller correct? Joshua D. Drake > regards, tom lane > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Fri, 13 Mar 2009, Joshua D. Drake wrote: > On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote: >> Marco Colombo <pgsql@esiway.net> writes: >>> You mean some layer (LVM) is lying about the fsync()? >> >> Got it in one. >> > > I wouldn't think this would be a problem with the proper battery backed > raid controller correct? It seems to me that all you get with a BBU-enabled card is the ability to get burts of writes out of the OS faster. So you still have the problem, it's just less like to be encountered.
On Fri, 2009-03-13 at 11:17 -0700, Ben Chobot wrote: > On Fri, 13 Mar 2009, Joshua D. Drake wrote: > > > On Fri, 2009-03-13 at 14:00 -0400, Tom Lane wrote: > >> Marco Colombo <pgsql@esiway.net> writes: > >>> You mean some layer (LVM) is lying about the fsync()? > >> > >> Got it in one. > >> > > > > I wouldn't think this would be a problem with the proper battery backed > > raid controller correct? > > It seems to me that all you get with a BBU-enabled card is the ability to > get burts of writes out of the OS faster. So you still have the problem, > it's just less like to be encountered. A BBU controller is about more than that. It is also supposed to be about data integrity. The ability to have unexpected outages and have the drives stay consistent because the controller remembers the state (if that is a reasonable way to put it). Joshua D. Drake > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Fri, 13 Mar 2009, Joshua D. Drake wrote: >> It seems to me that all you get with a BBU-enabled card is the ability to >> get burts of writes out of the OS faster. So you still have the problem, >> it's just less like to be encountered. > > A BBU controller is about more than that. It is also supposed to be > about data integrity. The ability to have unexpected outages and have > the drives stay consistent because the controller remembers the state > (if that is a reasonable way to put it). Of course. But if you can't reliably flush the OS buffers (because, say, you're using LVM so fsync() doesn't work), then you can't say what actually has made it to the safety of the raid card.
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote: > On Fri, 13 Mar 2009, Joshua D. Drake wrote: > Of course. But if you can't reliably flush the OS buffers (because, say, > you're using LVM so fsync() doesn't work), then you can't say what > actually has made it to the safety of the raid card. Good point. So the next question of course is, does EVMS do it right? http://evms.sourceforge.net/ This is actually a pretty significant issue. Joshua D. Drake > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Fri, 2009-03-13 at 11:41 -0700, Ben Chobot wrote: > On Fri, 13 Mar 2009, Joshua D. Drake wrote: > > >> It seems to me that all you get with a BBU-enabled card is the ability to > >> get burts of writes out of the OS faster. So you still have the problem, > >> it's just less like to be encountered. > > > > A BBU controller is about more than that. It is also supposed to be > > about data integrity. The ability to have unexpected outages and have > > the drives stay consistent because the controller remembers the state > > (if that is a reasonable way to put it). > > Of course. But if you can't reliably flush the OS buffers (because, say, > you're using LVM so fsync() doesn't work), then you can't say what > actually has made it to the safety of the raid card. Wait, actually a good BBU RAID controller will disable the cache on the drives. So everything that is cached is already on the controller vs. the drives itself. Or am I missing something? Joshua D. Drake > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote: > Wait, actually a good BBU RAID controller will disable the cache on > the > drives. So everything that is cached is already on the controller vs. > the drives itself. > > Or am I missing something? Maybe I'm missing something, but a BBU controller moves the "safe point" from the platters to the controller, but it doesn't move it all the way into the OS. So, if the software calls fsync, but fsync doesn't actually push the data to the controller, you are still at risk... right?
On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote: > > On Mar 13, 2009, at 11:59 AM, Joshua D. Drake wrote: >> >> Wait, actually a good BBU RAID controller will disable the cache on the >> drives. So everything that is cached is already on the controller vs. >> the drives itself. >> >> Or am I missing something? > > Maybe I'm missing something, but a BBU controller moves the "safe point" > from the platters to the controller, but it doesn't move it all the way into > the OS. > > So, if the software calls fsync, but fsync doesn't actually push the data to > the controller, you are still at risk... right? Ding!
Scott Marlowe wrote: > On Fri, Mar 13, 2009 at 1:09 PM, Christophe <xof@thebuild.com> wrote: >> So, if the software calls fsync, but fsync doesn't actually push the data to >> the controller, you are still at risk... right? > > Ding! > I've been doing some googling, now I'm not sure that not supporting barriers implies not supporting (of lying) at blkdev_issue_flush(). It seems that it's pretty common (and well-defined) for block devices to report -EOPNOTSUPP at BIO_RW_BARRIER requests. device mapper apparently falls in this category. See: http://lkml.org/lkml/2007/5/25/71 this is an interesting discussion on barriers and flushing. It seems to me that PostgreSQL needs both ordered and synchronous writes, maybe at different times (not that EVERY write must be both ordered and synchronous). You can emulate ordered with single+synchronous althought with a price. You can't emulate synchronous writes with just barriers. OPTIMAL: write-barrier-write-barrier-write-barrier-flush SUBOPTIMAL: write-flush-write-flush-write-flush As I understand it, fsync() should always issue a real flush: it's unrelated to the barriers issue. There's no API to issue ordered writes (or barriers) at user level, AFAIK. (Uhm... O_DIRECT, maybe implies that?) FS code may internally issue barrier requests to the block device, for its own purposes (e.g. journal updates), but there's not useland API for that. Yet, there's no reference to DM not supporting flush correctly in the whole thread... actually there are refereces to the opposite. DM devices are defined as FLUSHABLE. Also see: http://lkml.org/lkml/2008/2/26/41 but it seems to me that all this discussion is under the assuption that disks have write-back caches. "The alternative is to disable the disk write cache." says it all. .TM.
On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote: > Scott Marlowe wrote: > Also see: > http://lkml.org/lkml/2008/2/26/41 > but it seems to me that all this discussion is under the assuption that > disks have write-back caches. > "The alternative is to disable the disk write cache." says it all. If this applies to raid based cache as well then performance is going to completely tank. For users of Linux + PostgreSQL using LVM. Joshua D. Drake > > .TM. > -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Joshua D. Drake wrote: > On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote: >> Scott Marlowe wrote: > >> Also see: >> http://lkml.org/lkml/2008/2/26/41 >> but it seems to me that all this discussion is under the assuption that >> disks have write-back caches. >> "The alternative is to disable the disk write cache." says it all. > > If this applies to raid based cache as well then performance is going to > completely tank. For users of Linux + PostgreSQL using LVM. > > Joshua D. Drake Yet that's not the point. The point is safety. I may have a lightly loaded database, with low write rate, but still I want it to be reliable. I just want to know if disabling the caches makes it reliable or not. People on LK seem to think it does. And it seems to me they may have a point. fsync() is a flush operation on the block device, not a write barrier. LVM doesn't pass write barriers down, but that doesn't mean it doesn't perform a flush when requested to. .TM.
On Sun, 2009-03-15 at 01:48 +0100, Marco Colombo wrote: > Joshua D. Drake wrote: > > On Sat, 2009-03-14 at 05:25 +0100, Marco Colombo wrote: > >> Scott Marlowe wrote: > > > >> Also see: > >> http://lkml.org/lkml/2008/2/26/41 > >> but it seems to me that all this discussion is under the assuption that > >> disks have write-back caches. > >> "The alternative is to disable the disk write cache." says it all. > > > > If this applies to raid based cache as well then performance is going to > > completely tank. For users of Linux + PostgreSQL using LVM. > > > > Joshua D. Drake > > Yet that's not the point. The point is safety. I may have a lightly loaded > database, with low write rate, but still I want it to be reliable. I just > want to know if disabling the caches makes it reliable or not. I understand but disabling cache is not an option for anyone I know. So I need to know the other :) Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Joshua D. Drake wrote: > > I understand but disabling cache is not an option for anyone I know. So > I need to know the other :) > > Joshua D. Drake > Come on, how many people/organizations do you know who really need 30+ MB/s sustained write throughtput in the disk subsystem but can't afford a battery backed controller at the same time? Something must take care of writing data in the disk cache on permanent storage; write-thru caches, battery backed controllers, write barriers are all alternatives, choose the one you like most. The problem here is fsync(). We know that not fsync()'ing gives you a big performance boost, but that's not the point. I want to choose, and I want a true fsync() when I ask for one. Because if the data don't make it to the disk cache, the whole point about wt, bb and wb is moot. .TM.
Tom Lane wrote: > Jack Orenstein <jack.orenstein@hds.com> writes: >> The transaction rates I'm getting seem way too high: 2800-2900 with >> one thread, 5000-7000 with ten threads. I'm guessing that writes >> aren't really reaching the disk. Can someone suggest how to figure out >> where, below postgres, someone is lying about writes reaching the >> disk? > > AFAIK there are two trouble sources in recent Linux machines: LVM and > the disk drive itself. LVM is apparently broken by design --- it simply > fails to pass fsync requests. If you're using it you have to stop. > (Which sucks, because it's exactly the kind of thing DBAs tend to want.) > Otherwise you need to reconfigure your drive to not cache writes. > I forget the incantation for that but it's in the PG list archives. hmm are you sure this is what is happening? In my understanding LVM is not passing down barriers(generally - it seems to do in some limited circumstances) which means in my understanding it is not safe on any storage drive that has write cache enabled. This seems to be the very same issue like linux had for ages before ext3 got barrier support(not sure if even today all filesystems do have that). So in my understanding LVM is safe on disks that have write cache disabled or "behave" as one (like a controller with a battery backed cache). For storage with write caches it seems to be unsafe, even if the filesystem supports barriers and it has them enabled (which I don't think all have) which is basically what all of linux was not too long ago. Stefan
On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > So in my understanding LVM is safe on disks that have write cache disabled > or "behave" as one (like a controller with a battery backed cache). > For storage with write caches it seems to be unsafe, even if the filesystem > supports barriers and it has them enabled (which I don't think all have) > which is basically what all of linux was not too long ago. I definitely didn't have this problem with SCSI drives directly attached to a machine under pgsql on ext2 back in the day (way back, like 5 to 10 years ago). IDE / PATA drives, on the other hand, definitely suffered with having write caches enabled.
Stefan Kaltenbrunner wrote: > So in my understanding LVM is safe on disks that have write cache > disabled or "behave" as one (like a controller with a battery backed > cache). what about drive write caches on battery backed raid controllers? do the controllers ensure the drive cache gets flushed prior to releasing the cached write blocks ?
Scott Marlowe wrote: > On Mon, Mar 16, 2009 at 2:03 PM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> So in my understanding LVM is safe on disks that have write cache disabled >> or "behave" as one (like a controller with a battery backed cache). >> For storage with write caches it seems to be unsafe, even if the filesystem >> supports barriers and it has them enabled (which I don't think all have) >> which is basically what all of linux was not too long ago. > > I definitely didn't have this problem with SCSI drives directly > attached to a machine under pgsql on ext2 back in the day (way back, > like 5 to 10 years ago). IDE / PATA drives, on the other hand, > definitely suffered with having write caches enabled. I guess thats likely because most SCSI drives (at least back in the days) had write caches turned off by default (whereas IDE drives had them turned on). The Linux kernel docs actually have some stuff on the barrier implementation ( http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=Documentation/block/barrier.txt;hb=HEAD) which seems to explain some of the issues related to that. Stefan
John R Pierce wrote: > Stefan Kaltenbrunner wrote: >> So in my understanding LVM is safe on disks that have write cache >> disabled or "behave" as one (like a controller with a battery backed >> cache). > > what about drive write caches on battery backed raid controllers? do > the controllers ensure the drive cache gets flushed prior to releasing > the cached write blocks ? If LVM/dm is lying about fsync(), all this is moot. There's no point talking about disk caches. BTW. This discussion is continuing on the linux-lvm mailing list. https://www.redhat.com/archives/linux-lvm/2009-March/msg00025.html I have some PG databases on LVM systems, so I need to know for sure I have have to move them elsewhere. It seemed to me the right place for asking about the issue. Someone there pointed out that fsycn() is not LVM's responsibility. Correct. For sure, there's an API (or more than one) a filesystem uses to force a flush on the underlying block device, and for sure it has to called while inside the fsync() system call. So "lying to fsync()" maybe is more correct than "lying about fsync()". .TM.
On Tue, 17 Mar 2009, Marco Colombo wrote: > If LVM/dm is lying about fsync(), all this is moot. There's no point > talking about disk caches. I decided to run some tests to see what's going on there, and it looks like some of my quick criticism of LVM might not actually be valid--it's only the performance that is problematic, not necessarily the reliability. Appears to support fsync just fine. I tested with kernel 2.6.22, so certainly not before the recent changes to LVM behavior improving this area, but with the bugs around here from earlier kernels squashed (like crummy HPA support circa 2.6.18-2.6.19, see https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 ) You can do a quick test of fsync rate using sysbench; got the idea from http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ (their command has some typos, fixed one below) If fsync is working properly, you'll get something near the RPM rate of the disk. If it's lying, you'll see a much higher number. I couldn't get the current sysbench-0.4.11 to compile (bunch of X complains from libtool), but the old 0.4.8 I had around still works fine. Let's start with a regular ext3 volume. Here's what I see against a 7200 RPM disk (=120 rotations/second) with the default caching turned on: $ alias fsynctest="~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384--file-test-mode=rndwr run | grep \"Requests/sec\"" $ fsynctest 6469.36 Requests/sec executed That's clearly lying as expected (and I ran all these a couple of times, just reporting one for brevity sake; snipped some other redundant stuff too). I followed the suggestions at http://www.postgresql.org/docs/current/static/wal-reliability.html to turn off the cache and tested again: $ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache" * Write cache $ sudo /sbin/hdparm -W0 /dev/sdf /dev/sdf: setting drive write-caching to 0 (off) $ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache" Write cache $ fsynctest 106.05 Requests/sec executed $ sudo /sbin/hdparm -W1 /dev/sdf $ fsynctest 6469.36 Requests/sec executed Great: I was expecting ~120 commits/sec from a 7200 RPM disk, that's what I get when caching is off. Now, let's switch to using a LVM volume on a different partition of that disk, and run the same test to see if anything changes. $ sudo mount /dev/lvmvol/lvmtest /mnt/ $ cd /mnt/test $ fsynctest 6502.67 Requests/sec executed $ sudo /sbin/hdparm -W0 /dev/sdf $ fsynctest 112.78 Requests/sec executed $ sudo /sbin/hdparm -W1 /dev/sdf $ fsynctest 6499.11 Requests/sec executed Based on this test, it looks to me like fsync works fine on LVM. It must be passing that down to the physical disk correctly or I'd still be seeing inflated rates. If you've got a physical disk that lies about fsync, and you put a database on it, you're screwed whether or not you use LVM; nothing different on LVM than in the regular case. A battery-backed caching controller should also handle fsync fine if it turns off the physical disk cache, which most of them do--and, again, you're no more or less exposed to that particular problem with LVM than a regular filesystem. The thing that barriers helps out with is that it makes it possible to optimize flushing ext3 journal metadata when combined with hard drives that support the appropriate cache flushing mechanism (what hdparm calls "FLUSH CACHE EXT"; see http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html ). That way you can prioritize flushing just the metadata needed to prevent filesystem corruption while still fully caching less critical regular old writes. In that situation, performance could be greatly improved over turning off caching altogether. However, in the PostgreSQL case, the fsync hammer doesn't appreciate this optimization anyway--all the database writes are going to get forced out by that no matter what before the database considers them reliable. Proper barriers support might be helpful in the case where you're using a database on a shared disk that has other files being written to as well, basically allowing caching on those while forcing the database blocks to physical disk, but that presumes the Linux fsync implementation is more sophisticated than I believe it currently is. Far as I can tell, the main open question I didn't directly test here is whether LVM does any write reordering that can impact database use because it doesn't handle write barriers properly. According to https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does not, and I never got the impression that was impacted by the LVM layer before. The concern is nicely summarized by the comment from Xman at http://lwn.net/Articles/283161/ : "fsync will block until the outstanding requests have been sync'd do disk, but it doesn't guarantee that subsequent I/O's to the same fd won't potentially also get completed, and potentially ahead of the I/O's submitted prior to the fsync. In fact it can't make such guarantees without functioning barriers." Since we know LVM does not have functioning barriers, this would seem to be one area where PostgreSQL would be vulnerable. But since ext3 doesn't have barriers turned by default either (except some recent SuSE system), it's not unique to a LVM setup, and if this were really a problem it would be nailing people everywhere. I believe the WAL design handles this situation. There are some known limitations to Linux fsync that I remain somewhat concerned about, independantly of LVM, like "ext3 fsync() only does a journal commit when the inode has changed" (see http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ). The way files are preallocated, the PostgreSQL WAL is supposed to function just fine even if you're using fdatasync after WAL writes, which also wouldn't touch the journal (last time I checked fdatasync was implemented as a full fsync on Linux). Since the new ext4 is more aggressive at delaying writes than ext3, it will be interesting to see if that uncovers some subtle race conditions here that have been lying dormant so far. I leave it as an exercise to the dedicated reader to modify the sysbench test to use O_SYNC/O_DIRECT in order to re-test LVM for the situation if you changed wal_sync_method=open_sync , how to do that is mentioned briefly at http://sysbench.sourceforge.net/docs/ -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > There are some known limitations to Linux fsync that I remain somewhat > concerned about, independantly of LVM, like "ext3 fsync() only does a > journal commit when the inode has changed" (see > http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ). The > way files are preallocated, the PostgreSQL WAL is supposed to function > just fine even if you're using fdatasync after WAL writes, which also > wouldn't touch the journal (last time I checked fdatasync was > implemented as a full fsync on Linux). Since the new ext4 is more Indeed it does. I wonder if there should be an optional fsync mode in postgres should turn fsync() into fchmod (fd, 0644); fchmod (fd, 0664); to work around this issue. For example this program below will show one write per disk revolution if you leave the fchmod() in there, and run many times faster (i.e. lying) if you remove it. This with ext3 on a standard IDE drive with the write cache enabled, and no LVM or anything between them. ========================================================== /* ** based on http://article.gmane.org/gmane.linux.file-systems/21373 ** http://thread.gmane.org/gmane.linux.kernel/646040 */ #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> int main(int argc,char *argv[]) { if (argc<2) { printf("usage: fs <filename>\n"); exit(1); } int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); int i; for (i=0;i<100;i++) { char byte; pwrite (fd, &byte, 1, 0); fchmod (fd, 0644); fchmod (fd, 0664); fsync (fd); } } ==========================================================
On Tue, 17 Mar 2009, Ron Mayer wrote: > I wonder if there should be an optional fsync mode > in postgres should turn fsync() into > fchmod (fd, 0644); fchmod (fd, 0664); > to work around this issue. The test I haven't had time to run yet is to turn the bug exposing program you were fiddling with into a more accurate representation of WAL activity, to see if that chmod still changes the behavior there. I think the most dangerous possibility here is if you create a new WAL segment and immediately fill it, all in less than a second. Basically, what XLogFileInit does: -Open with O_RDWR | O_CREAT | O_EXCL -Write XLogSegSize (16MB) worth of zeros -fsync Followed by simulating what XLogWrite would do if you fed it enough data to force a segment change: -Write a new 16MB worth of data -fsync If you did all that in under a second, would you still get a filesystem flush each time? From the description of the problem I'm not so sure anymore. I think that's how tight the window would have to be for this issue to show up right now, you'd only be exposed if you filled a new WAL segment faster than the associated journal commit happened (basically, a crash when WAL write volume >16MB/s in a situation where new segments are being created). But from what I've read about ext4 I think that window for mayhem might widen on that filesystem--that's what got me reading up on this whole subject recently, before this thread even started. The other ameliorating factor here is that in order for this to bite you, I think you'd need to have another, incorrectly ordered write somewhere else that could happen before the delayed write. Not sure where that might be possible in the PostgreSQL WAL implementation yet. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Tue, 17 Mar 2009, Marco Colombo wrote: > >> If LVM/dm is lying about fsync(), all this is moot. There's no point >> talking about disk caches. > > I decided to run some tests to see what's going on there, and it looks > like some of my quick criticism of LVM might not actually be valid--it's > only the performance that is problematic, not necessarily the > reliability. Appears to support fsync just fine. I tested with kernel > 2.6.22, so certainly not before the recent changes to LVM behavior > improving this area, but with the bugs around here from earlier kernels > squashed (like crummy HPA support circa 2.6.18-2.6.19, see > https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 ) I've run tests too, you can seen them here: https://www.redhat.com/archives/linux-lvm/2009-March/msg00055.html in case you're looking for something trivial (write/fsync loop). > You can do a quick test of fsync rate using sysbench; got the idea from > http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/ > (their command has some typos, fixed one below) > > If fsync is working properly, you'll get something near the RPM rate of > the disk. If it's lying, you'll see a much higher number. Same results. -W1 gives x50 speedup, it must be waiting for something at disk level with -W0. [...] > Based on this test, it looks to me like fsync works fine on LVM. It > must be passing that down to the physical disk correctly or I'd still be > seeing inflated rates. If you've got a physical disk that lies about > fsync, and you put a database on it, you're screwed whether or not you > use LVM; nothing different on LVM than in the regular case. A > battery-backed caching controller should also handle fsync fine if it > turns off the physical disk cache, which most of them do--and, again, > you're no more or less exposed to that particular problem with LVM than > a regular filesystem. That was my initial understanding. > The thing that barriers helps out with is that it makes it possible to > optimize flushing ext3 journal metadata when combined with hard drives > that support the appropriate cache flushing mechanism (what hdparm calls > "FLUSH CACHE EXT"; see > http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html > ). That way you can prioritize flushing just the metadata needed to > prevent filesystem corruption while still fully caching less critical > regular old writes. In that situation, performance could be greatly > improved over turning off caching altogether. However, in the > PostgreSQL case, the fsync hammer doesn't appreciate this optimization > anyway--all the database writes are going to get forced out by that no > matter what before the database considers them reliable. Proper > barriers support might be helpful in the case where you're using a > database on a shared disk that has other files being written to as well, > basically allowing caching on those while forcing the database blocks to > physical disk, but that presumes the Linux fsync implementation is more > sophisticated than I believe it currently is. This is the same conclusion I came to. Moreover, once you have barriers passed down to the disks, it would be nice to have a userland API to send them to the kernel. Any application managing a 'journal' or 'log' type of object, would benefit from that. I'm not familiar with PG internals, but it's likely you can have some records you just want to be ordered, and you can do something like write-barrier-write-barrier-...-fsync instead of write-fsync-write-fsync-... Currenly fsync() (and friends, O_SYNC, fdatasync(), O_DSYNC) is the only way to enforce ordering on writes from userland. > Far as I can tell, the main open question I didn't directly test here is > whether LVM does any write reordering that can impact database use > because it doesn't handle write barriers properly. According to > https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it > does not, and I never got the impression that was impacted by the LVM > layer before. The concern is nicely summarized by the comment from Xman > at http://lwn.net/Articles/283161/ : > > "fsync will block until the outstanding requests have been sync'd do > disk, but it doesn't guarantee that subsequent I/O's to the same fd > won't potentially also get completed, and potentially ahead of the I/O's > submitted prior to the fsync. In fact it can't make such guarantees > without functioning barriers." Sure, but from userland you can't set barriers. If you fsync() after each write you want ordered, there can't be any "subsequent I/O" (unless there are many different processes cuncurrently writing to the file w/o synchronization). > Since we know LVM does not have functioning barriers, this would seem to > be one area where PostgreSQL would be vulnerable. But since ext3 > doesn't have barriers turned by default either (except some recent SuSE > system), it's not unique to a LVM setup, and if this were really a > problem it would be nailing people everywhere. I believe the WAL design > handles this situation. Well well. Ext3 is definitely in the lucky area. The journal on most ext3 instances is contiguous on disk. The disk won't reorder requests only because the are already ordered... only when the journal wraps around there's a (extremely) small window of vulnerability. You need to write a careful crafted torture program to get any chance to observe that... such program exists, and triggers the problem (leaving inconsistent fs) almost 50% of the times. But it's extremely unlikely you can see it happen in real workloads. http://lwn.net/Articles/283168/ .TM.
Ron Mayer wrote: > Greg Smith wrote: >> There are some known limitations to Linux fsync that I remain somewhat >> concerned about, independantly of LVM, like "ext3 fsync() only does a >> journal commit when the inode has changed" (see >> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ). The >> way files are preallocated, the PostgreSQL WAL is supposed to function >> just fine even if you're using fdatasync after WAL writes, which also >> wouldn't touch the journal (last time I checked fdatasync was >> implemented as a full fsync on Linux). Since the new ext4 is more > > Indeed it does. > > I wonder if there should be an optional fsync mode > in postgres should turn fsync() into > fchmod (fd, 0644); fchmod (fd, 0664); > to work around this issue. Question is... why do you care if the journal is not flushed on fsync? Only the file data blocks need to be, if the inode is unchanged. > For example this program below will show one write > per disk revolution if you leave the fchmod() in there, > and run many times faster (i.e. lying) if you remove it. > This with ext3 on a standard IDE drive with the write > cache enabled, and no LVM or anything between them. > > ========================================================== > /* > ** based on http://article.gmane.org/gmane.linux.file-systems/21373 > ** http://thread.gmane.org/gmane.linux.kernel/646040 > */ > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include <stdio.h> > #include <stdlib.h> > > int main(int argc,char *argv[]) { > if (argc<2) { > printf("usage: fs <filename>\n"); > exit(1); > } > int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); > int i; > for (i=0;i<100;i++) { > char byte; > pwrite (fd, &byte, 1, 0); > fchmod (fd, 0644); fchmod (fd, 0664); > fsync (fd); > } > } > ========================================================== > I ran the program above, w/o the fchmod()s. $ time ./test2 testfile real 0m0.056s user 0m0.001s sys 0m0.008s This is with ext3+LVM+raid1+sata disks with hdparm -W1. With -W0 I get: $ time ./test2 testfile real 0m1.014s user 0m0.000s sys 0m0.008s Big difference. The fsync() there does its job. The same program runs with a x3 slowdown with the fsyncs, but that's expected, it's doing twice the writes, and in different places. .TM.
On Wed, 18 Mar 2009, Marco Colombo wrote: > If you fsync() after each write you want ordered, there can't be any > "subsequent I/O" (unless there are many different processes cuncurrently > writing to the file w/o synchronization). Inside PostgreSQL, each of the database backend processes ends up writing blocks to the database disk, if they need to allocate a new buffer and the one they are handed is dirty. You can easily have several of those writing to the same 1GB underlying file on disk. So that prerequisite is there. The main potential for a problem here would be if a stray unsynchronized write from one of those backends happened in a way that wasn't accounted for by the WAL+checkpoint design. What I was suggesting is that the way that synchronization happens in the database provides some defense from running into problems in this area. The way backends handle writes themselves is also why your suggestion about the database being able to utilize barriers isn't really helpful. Those trickle out all the time, and normally you don't even have to care about ordering them. The only you do need to care, at checkpoint time, only a hard line is really practical--all writes up to that point, period. Trying to implement ordered writes for everything that happened before then would complicate the code base, which isn't going to happen for such a platform+filesystem specific feature, one that really doesn't offer much acceleration from the database's perspective. > only when the journal wraps around there's a (extremely) small window of > vulnerability. You need to write a careful crafted torture program to > get any chance to observe that... such program exists, and triggers the > problem Yeah, I've been following all that. The PostgreSQL WAL design works on ext2 filesystems with no journal at all. Some people even put their pg_xlog directory onto ext2 filesystems for best performance, relying on the WAL to be the journal. As long as fsync is honored correctly, the WAL writes should be re-writing already allocated space, which makes this category of journal mayhem not so much of a problem. But when I read about fsync doing unexpected things, that gets me more concerned. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Marco Colombo wrote: > Ron Mayer wrote: >> Greg Smith wrote: >>> There are some known limitations to Linux fsync that I remain somewhat >>> concerned about, independantly of LVM, like "ext3 fsync() only does a >>> journal commit when the inode has changed" (see >>> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).... >> I wonder if there should be an optional fsync mode >> in postgres should turn fsync() into >> fchmod (fd, 0644); fchmod (fd, 0664); 'course I meant: "fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);" >> to work around this issue. > > Question is... why do you care if the journal is not flushed on fsync? > Only the file data blocks need to be, if the inode is unchanged. You don't - but ext3 fsync won't even push the file data blocks through a disk cache unless the inode was changed. The point is that ext3 only does the "write barrier" processing that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI) commands on inode changes, not data changes. And with no FLUSH CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks cache after the fsync() as well. PS: not sure if this is still true - last time I tested it was nov 2006. Ron
Greg Smith wrote: > On Wed, 18 Mar 2009, Marco Colombo wrote: > >> If you fsync() after each write you want ordered, there can't be any >> "subsequent I/O" (unless there are many different processes >> cuncurrently writing to the file w/o synchronization). > > Inside PostgreSQL, each of the database backend processes ends up > writing blocks to the database disk, if they need to allocate a new > buffer and the one they are handed is dirty. You can easily have > several of those writing to the same 1GB underlying file on disk. So > that prerequisite is there. The main potential for a problem here would > be if a stray unsynchronized write from one of those backends happened > in a way that wasn't accounted for by the WAL+checkpoint design. Wow, that would be quite a bug. That's why I wrote "w/o synchronization". "stray" + "unaccounted" + "cuncurrent" smells like the recipe for an explosive to me :) > What I > was suggesting is that the way that synchronization happens in the > database provides some defense from running into problems in this area. I hope it's "full defence". If you have two processes doing at the same time write(); fsycn(); on the same file, either there are no order requirements, or it will boom sooner or later... fsync() works inside a single process, but any system call may put the process to sleep, and who knows when it will be awakened and what other processes did to that file meanwhile. I'm pretty confident that PG code protects access to shared resources with synchronization primitives. Anyway I was referring to WAL writes... due to the nature of a log, it's hard to think of many unordered writes and of cuncurrent access w/o synchronization. But inside a critical region, there can be more than one single write, and you may need to enforce an order, but no more than that before the final fsycn(). If so, userland originated barriers instead of full fsync()'s may help with performance. But I'm speculating. > The way backends handle writes themselves is also why your suggestion > about the database being able to utilize barriers isn't really helpful. > Those trickle out all the time, and normally you don't even have to care > about ordering them. The only you do need to care, at checkpoint time, > only a hard line is really practical--all writes up to that point, > period. Trying to implement ordered writes for everything that happened > before then would complicate the code base, which isn't going to happen > for such a platform+filesystem specific feature, one that really doesn't > offer much acceleration from the database's perspective. I don't know the internals of WAL writing, I can't really reply on that. >> only when the journal wraps around there's a (extremely) small window >> of vulnerability. You need to write a careful crafted torture program >> to get any chance to observe that... such program exists, and triggers >> the problem > > Yeah, I've been following all that. The PostgreSQL WAL design works on > ext2 filesystems with no journal at all. Some people even put their > pg_xlog directory onto ext2 filesystems for best performance, relying on > the WAL to be the journal. As long as fsync is honored correctly, the > WAL writes should be re-writing already allocated space, which makes > this category of journal mayhem not so much of a problem. But when I > read about fsync doing unexpected things, that gets me more concerned. Well, that's highly dependant on your expectations :) I don't expect a fsync to trigger a journal commit, if metadata hasn't changed. That's obviuosly true for metadata-only journals (like most of them, with notable exceptions of ext3 in data=journal mode). Yet, if you're referring to this http://article.gmane.org/gmane.linux.file-systems/21373 well that seems to me the same usual thing/bug, fsync() allows disks to lie when it comes to caching writes. Nothing new under the sun. Barriers don't change much, because they don't replace a flush. They're about consistency, not durability. So even with full barriers support, a fsync implementation needs to end up in a disk cache flush, to be fully compliant with its own semantics. .TM.
On Wed, Mar 18, 2009 at 10:58:39PM +0100, Marco Colombo wrote: > I hope it's "full defence". If you have two processes doing at the > same time write(); fsycn(); on the same file, either there are no order > requirements, or it will boom sooner or later... fsync() works inside > a single process, but any system call may put the process to sleep, and > who knows when it will be awakened and what other processes did to that > file meanwhile. I'm pretty confident that PG code protects access to > shared resources with synchronization primitives. Generally PG uses O_SYNC on open, so it's only one system call, not two. And the file it's writing to is generally preallocated (not always though). > Well, that's highly dependant on your expectations :) I don't expect > a fsync to trigger a journal commit, if metadata hasn't changed. That's > obviuosly true for metadata-only journals (like most of them, with > notable exceptions of ext3 in data=journal mode). Really the only thing needed is that the WAL entry reaches disk before the actual data does. AIUI as long as you have that the situation is recoverable. Given that the actual data probably won't be written for a while it'd need to go pretty wonky before you see an issue. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Ron Mayer wrote: > Marco Colombo wrote: >> Ron Mayer wrote: >>> Greg Smith wrote: >>>> There are some known limitations to Linux fsync that I remain somewhat >>>> concerned about, independantly of LVM, like "ext3 fsync() only does a >>>> journal commit when the inode has changed" (see >>>> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).... >>> I wonder if there should be an optional fsync mode >>> in postgres should turn fsync() into >>> fchmod (fd, 0644); fchmod (fd, 0664); > 'course I meant: "fchmod (fd, 0644); fchmod (fd, 0664); fsync(fd);" >>> to work around this issue. >> Question is... why do you care if the journal is not flushed on fsync? >> Only the file data blocks need to be, if the inode is unchanged. > > You don't - but ext3 fsync won't even push the file data blocks > through a disk cache unless the inode was changed. > > The point is that ext3 only does the "write barrier" processing > that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI) > commands on inode changes, not data changes. And with no FLUSH > CACHE or SYNCHRONINZE IDE the data blocks may sit in the disks > cache after the fsync() as well. Yes, but we knew it already, didn't we? It's always been like that, with IDE disks and write-back cache enabled, fsync just waits for the disk reporting completion and disks lie about that. Write barriers enforce ordering, WHEN writes are committed to disk, they will be in order, but that doesn't mean NOW. Ordering is enough for FS a journal, the only requirement is consistency. Anyway, it's the block device job to control disk caches. A filesystem is just a client to the block device, it posts a flush request, what happens depends on the block device code. The FS doesn't talk to disks directly. And a write barrier is not a flush request, is a "please do not reorder" request. On fsync(), ext3 issues a flush request to the block device, that's all it's expected to do. Now, some block devices may implement write barriers issuing FLUSH commands to the disk, but that's another matter. A FS shouldn't rely on that. You can replace a barrier with a flush (not as efficently), but not the other way around. If a block device driver issues FLUSH for a barrier, and doesn't issue a FLUSH for a flush, well, it's a buggy driver, IMHO. .TM.
On Wed, 18 Mar 2009, Martijn van Oosterhout wrote: > Generally PG uses O_SYNC on open Only if you change wal_sync_method=open_sync. That's the very last option PostgreSQL will try--only if none of the other are available will it use that. Last time I checked the defaults value for that parameter broke down like this by platform: open_datasync (O_DSYNC): Solaris, Windows (I think there's a PG wrapper involved for Win32) fdatasync: Linux (even though the OS just provides a fake wrapper around fsync for that call) fsync_writethrough: Mac OS X fsync: FreeBSD That makes the only UNIX{-ish} OS where the default is a genuine sync write Solaris. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Martijn van Oosterhout wrote: > Generally PG uses O_SYNC on open, so it's only one system call, not > two. And the file it's writing to is generally preallocated (not > always though). It has to wait for I/O completion on write(), then, it has to go to sleep. If two different processes do a write(), you don't know which will be awakened first. Preallocation don't mean much here, since with O_SYNC you expect a physical write to be done (with the whole sleep/ HW interrupt/SW interrupt/awake dance). It's true that you may expect the writes to be carried out in order, and that might be enough. I'm not sure tho. >> Well, that's highly dependant on your expectations :) I don't expect >> a fsync to trigger a journal commit, if metadata hasn't changed. That's >> obviuosly true for metadata-only journals (like most of them, with >> notable exceptions of ext3 in data=journal mode). > > Really the only thing needed is that the WAL entry reaches disk before > the actual data does. AIUI as long as you have that the situation is > recoverable. Given that the actual data probably won't be written for a > while it'd need to go pretty wonky before you see an issue. You're giveing up Durability here. In a closed system, that doesn't mean much, but when you report "payment accepted" to third parties, you can't forget about it later. The requirement you stated is for Consistency only. That's what a journaled FS cares about, i.e. no need for fsck (internal consistency checks) after a crash. It may be acceptable for a remote standby backup, you replay as much of the WAL as it's available after the crash (the part you managed to copy, that is). But you know there can be lost transactions. It may be acceptable or not. Sometimes it's not. Sometimes you must be sure the data in on platters before you report "committed". Sometimes when you say "fsync!" you mean "i want data flushed to disk NOW, and I really mean it!". :) .TM.
Hello, As a continued follow up to this thread, Tim Post replied on the LVM list to this affect: " If a logical volume spans physical devices where write caching is enabled, the results of fsync() can not be trusted. This is an issue with device mapper, lvm is one of a few possible customers of DM. Now it gets interesting: Enter virtualization. When you have something like this: fsync -> guest block device -> block tap driver -> CLVM -> iscsi -> storage -> physical disk. Even if device mapper passed along the write barrier, would it be reliable? Is every part of that chain going to pass the same along, and how many opportunities for re-ordering are presented in the above? So, even if its fixed in DM, can fsync() still be trusted? I think, at the least, more testing should be done with various configurations even after a suitable patch to DM is merged. What about PGSQL users using some kind of elastic hosting? Given the craze in 'cloud' technology, its an important question to ask (and research). Cheers, --Tim " Joshua D. Drake -- PostgreSQL - XMPP: jdrake@jabber.postgresql.org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Marco Colombo wrote: > Yes, but we knew it already, didn't we? It's always been like > that, with IDE disks and write-back cache enabled, fsync just > waits for the disk reporting completion and disks lie about I've looked hard, and I have yet to see a disk that lies. ext3, OTOH seems to lie. IDE drives happily report whether they support write barriers or not, which you can see with the command: %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT I've tested about a dozen drives, and I've never seen one claims to support flushing that doesn't. And I haven't seen one that doesn't support it that was made less than half a decade ago. IIRC, ATA-5 specs from 2000 made supporting this mandatory. Linux kernels since 2005 or so check for this feature. It'll happily tell you which of your devices don't support it. %dmesg | grep 'disabling barriers' JBD: barrier-based sync failed on md1 - disabling barriers And for devices that do, it will happily send IDE FLUSH CACHE commands to IDE drives that support the feature. At the same time Linux kernels started sending the very similar. SCSI SYNCHRONIZE CACHE commands. > Anyway, it's the block device job to control disk caches. A > filesystem is just a client to the block device, it posts a > flush request, what happens depends on the block device code. > The FS doesn't talk to disks directly. And a write barrier is > not a flush request, is a "please do not reorder" request. > On fsync(), ext3 issues a flush request to the block device, > that's all it's expected to do. But AFAICT ext3 fsync() only tell the block device to flush disk caches if the inode was changed. Or, at least empirically if I modify a file and do fsync(fd); on ext3 it does not wait until the disk spun to where it's supposed to spin. But if I put a couple fchmod()'s right before the fsync() it does.
I am jumping into this thread late, and maybe this has already been stated clearly, but from my experience benchmarking, LVM does *not* lie about fsync() on the servers I've configured. An fsync() goes to the physical device. You can see it clearly by setting the write cache on the RAID controller to write-through policy. Performance decreases to what the disks can do. And my colleagues and clients have tested yanking the power plug and checking that the data got to the RAID controller's battery-backed cache, many many times. In other words, the data is safe and durable, even on LVM. However, I have never tried to do this on volumes that span multiple physical devices, because LVM can't take an atomic snapshot across them, which completely negates the benefit of LVM for my purposes. So I always create one logical disk in the RAID controller, and then carve that up with LVM, partitions, etc however I please. I almost surely know less about this topic than anyone on this thread. Baron
On Thu, Mar 19, 2009 at 12:49:52AM +0100, Marco Colombo wrote: > It has to wait for I/O completion on write(), then, it has to go to > sleep. If two different processes do a write(), you don't know which > will be awakened first. Preallocation don't mean much here, since with > O_SYNC you expect a physical write to be done (with the whole sleep/ > HW interrupt/SW interrupt/awake dance). It's true that you may expect > the writes to be carried out in order, and that might be enough. I'm > not sure tho. True, but the relative wakeup order of two different processes is not important since by definition they are working on different transactions. As long as the WAL writes for a single transaction (in a single process) are not reordered you're fine. The benefit of a non-overwriting storage manager is that you don't need to worry about undo's. Any incomplete transaction is uncomitted and so any data produced by that transaction is ignored. > It may be acceptable or not. Sometimes it's not. Sometimes you must be > sure the data in on platters before you report "committed". Sometimes > when you say "fsync!" you mean "i want data flushed to disk NOW, and I > really mean it!". :) Ofcourse. Committing a transaction comes down to flipping a single bit. Before you flip it all the WAL data for that transaction must have hit disk. And you don't tell the client the transaction has committed until the fipped bit has hit disk. And fsync better do what you're asking (how fast is just a performance issue, just as long as it's done). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Martijn van Oosterhout wrote: > True, but the relative wakeup order of two different processes is not > important since by definition they are working on different > transactions. As long as the WAL writes for a single transaction (in a > single process) are not reordered you're fine. I'm not totally sure, but I think I understand what you mean here, indipendent transactions by definition don't care about relative ordering. .TM.
Ron Mayer wrote: > Marco Colombo wrote: >> Yes, but we knew it already, didn't we? It's always been like >> that, with IDE disks and write-back cache enabled, fsync just >> waits for the disk reporting completion and disks lie about > > I've looked hard, and I have yet to see a disk that lies. No, "lie" in the sense they report completion before the data hit the platters. Of course, that's the expected behaviour with write-back caches. > ext3, OTOH seems to lie. ext3 simply doesn't know, it interfaces with a block device, which does the caching (OS level) and the reordering (e.g. elevator algorithm). ext3 doesn't directly send commands to the disk, neither manages the OS cache. When software raid and device mapper come into play, you have "virtual" block devices built on top of other block devices. My home desktop has ext3 on top of a dm device (/dev/mapper/something, a LV set up by LVM in this case), on top of a raid1 device (/dev/mdX), on top of /dev/sdaX and /dev/sdbX, which, in a way, on their own are blocks device built on others, /dev/sda and /dev/sdb (you don't actually send commands to partitions, do you? although the mapping "sector offset relative to partition -> real sector on disk" is trivial). Each of these layers potentially caches writes and reorders them, it's the job of a block device, although it makes sense at most only for the last one, the one that controls the disk. Anyway there isn't much ext3 can do, but posting wb and flush requests to the block device at the top of the "stack". > IDE drives happily report whether they support write barriers > or not, which you can see with the command: > %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT Of course a write barrier is not a cache flush. A flush is synchronous, a write barrier asyncronous. The disk supports flushing, not write barriers. Well, technically if you can control the ordering of the requests, that's barriers proper. With SCSI you can, IIRC. But a cache flush is, well, a flush. > Linux kernels since 2005 or so check for this feature. It'll > happily tell you which of your devices don't support it. > %dmesg | grep 'disabling barriers' > JBD: barrier-based sync failed on md1 - disabling barriers > And for devices that do, it will happily send IDE FLUSH CACHE > commands to IDE drives that support the feature. At the same > time Linux kernels started sending the very similar. SCSI > SYNCHRONIZE CACHE commands. >> Anyway, it's the block device job to control disk caches. A >> filesystem is just a client to the block device, it posts a >> flush request, what happens depends on the block device code. >> The FS doesn't talk to disks directly. And a write barrier is >> not a flush request, is a "please do not reorder" request. >> On fsync(), ext3 issues a flush request to the block device, >> that's all it's expected to do. > > But AFAICT ext3 fsync() only tell the block device to > flush disk caches if the inode was changed. No, ext3 posts a write barrier request when the inode changes and it commits the journal, which is not a flush. [*] > Or, at least empirically if I modify a file and do > fsync(fd); on ext3 it does not wait until the disk > spun to where it's supposed to spin. But if I put > a couple fchmod()'s right before the fsync() it does. If you were right, and ext3 didn't wait, it would make no difference to have disk cache enabled or not, on fsync. My test shows a 50x speedup when turning the disk cache on. So for sure ext3 is waiting for the block device to report completion. It's the block device that - on flush - doesn't issue a FLUSH command to the disk. .TM. [*] A barrier ends up in a FLUSH for the disk, but it doesn't mean it's synchronous, like a real flush. Even journal updates done with barriers don't mean "hit the disk now", they just mean "keep order" when writing. If you turn off automatic page cache flushing and if you have zero memory pressure, a write request with a barrier may stay forever in the OS cache, at least in theory. Imagine you don't have bdflush and nothing reclaims resources: days of activity may stay in RAM, as far as write barriers are concerned. Now someone types 'sync' as root. The block device starts flushing dirty pages, reordering writes, but honoring barriers, that is, it reorders anything up to the first barrier, posts write requests to the disk, issues a FLUSH command then waits until the flush is completed. Then "consumes" the barrier, and starts processing writes, reordering them up to the next barrier, and so on. So yes, a barrier turns into a FLUSH command for the disk. But in this scenario, days have passed since the original write/barrier request from the filesystem. Compare with a fsync(). Even in the above scenario, a fsync() should end up in a FLUSH command to the disk, and wait for the request to complete, before awakening the process that issued it. So the filesystem has to request a flush operation to the block device, not a barrier. And so it does. If it turns out that the block device just issues writes but no FLUSH command to disks, that's not the FS fault. And issuing barrier requests won't change anything. All this in theory. In practice there may be implementation details that make things different. I've read that in the linux kernel at some time (maybe even now) only one outstanding write barrier is possibile in the stack of block devices. So I guess that a second write barrier request triggers a real disk flush. That's why when you use fchmod() repeatedly, you see all those flushes. But technically it's a side effect and I think at closer analysis you may notice it's always lagging one request behind, which you don't see just by looking at numbers or listening to disk noise. So, multiple journal commits may really help in having the disk cache flushed as a side effect, but I think the bug is elsewhere. The day linux supports multiple outstanding wb requests, that stops working. It's the block devices that should be fixed, so that it performs a cache FLUSH when the filesystem asks for a flush. Why they don't do that today, it's a mistery to me, but I think there must be something that I'm missing. Anyway, the point here is that using LVM is no less safe than using directly an IDE block device. There may be filesystems that on fsync issue not only a flush request, but also a journal commit, with attacched barrier requests, thus getting the Right Thing done by double side effect. And yes ext3 is NOT among them, unless you trigger those commits with the fchmod() dance.
Hi, Martijn van Oosterhout wrote: > And fsync better do what you're asking > (how fast is just a performance issue, just as long as it's done). Where are we on this issue? I've read all of this thread and the one on the lvm-linux mailing list as well, but still don't feel confident. In the following scenario: fsync -> filesystem -> physical disk I'm assuming the filesystem correctly issues an blkdev_issue_flush() on the physical disk upon fsync(), to do what it's told: flush the cache(s) to disk. Further, I'm also assuming the physical disk is flushable (i.e. it correctly implements the blkdev_issue_flush() call). Here we can be pretty certain that fsync works as advertised, I think. The unanswered question to me is, what's happening, if I add LVM in between as follows: fsync -> filesystmem -> device mapper (lvm) -> physical disk(s) Again, assume the filesystem issues a blkdev_issue_flush() to the lower layer and the physical disks are all flushable (and implement that correctly). How does the device mapper behave? I'd expect it to forward the blkdev_issue_flush() call to all affected devices and only return after the last one has confirmed and completed flushing its caches. Is that the case? I've also read about the newish write barriers and about filesystems implementing fsync with such write barriers. That seems fishy to me and would of course break in combination with LVM (which doesn't completely support write barriers, AFAIU). However, that's clearly the filesystem side of the story and has not much to do with whether fsync lies on top of LVM or not. Help in clarifying this issue greatly appreciated. Kind Regards Markus Wanner
Markus Wanner wrote: > Hi, > > Martijn van Oosterhout wrote: >> And fsync better do what you're asking >> (how fast is just a performance issue, just as long as it's done). > > Where are we on this issue? I've read all of this thread and the one on > the lvm-linux mailing list as well, but still don't feel confident. > > In the following scenario: > > fsync -> filesystem -> physical disk > > I'm assuming the filesystem correctly issues an blkdev_issue_flush() on > the physical disk upon fsync(), to do what it's told: flush the cache(s) > to disk. Further, I'm also assuming the physical disk is flushable (i.e. > it correctly implements the blkdev_issue_flush() call). Here we can be > pretty certain that fsync works as advertised, I think. > > The unanswered question to me is, what's happening, if I add LVM in > between as follows: > > fsync -> filesystmem -> device mapper (lvm) -> physical disk(s) > > Again, assume the filesystem issues a blkdev_issue_flush() to the lower > layer and the physical disks are all flushable (and implement that > correctly). How does the device mapper behave? > > I'd expect it to forward the blkdev_issue_flush() call to all affected > devices and only return after the last one has confirmed and completed > flushing its caches. Is that the case? > > I've also read about the newish write barriers and about filesystems > implementing fsync with such write barriers. That seems fishy to me and > would of course break in combination with LVM (which doesn't completely > support write barriers, AFAIU). However, that's clearly the filesystem > side of the story and has not much to do with whether fsync lies on top > of LVM or not. > > Help in clarifying this issue greatly appreciated. > > Kind Regards > > Markus Wanner Well, AFAIK, the summary would be: 1) adding LVM to the chain makes no difference; 2) you still need to disable the write-back cache in IDE/SATA disks, for fsync() to work properly. 3) without LVM and with write-back cache enabled, due to current(?) limitations in the linux kernel, with some journaled filesystems (but not ext3 in data=write-back or data=ordered mode, I'm not sure about data=journal), you may be less vulnerable, if you use fsync() (or O_SYNC). "less vulnerable" means that all pending changes are commetted to disk, but the very last one. So: - write-back cache + EXT3 = unsafe - write-back cache + other fs = (depending on the fs)[*] safer but not 100% safe - write-back cache + LVM + any fs = unsafe - write-thru cache + any fs = safe - write-thru cache + LVM + any fs = safe [*] the fs must use (directly or indirectly via journal commit) a write barrier on fsync(). Ext3 doesn't (it does when the inode changes, but that happens once a second only). If you want both speed and safety, use a batter-backed controller (and write-thru cache on disks, but the controller should enforce it when you plug the disks in). It's the usual "Fast, Safe, Cheap: choose two". This is an interesting article: http://support.microsoft.com/kb/234656/en-us/ note how for all three kinds of disk (IDE/SATA/SCSI) they say: "Disk caching should be disabled in order to use the drive with SQL Server". They don't mention write barriers. .TM.