Thread: Fwd: Apple Darwin disabled fsync?
>Date: Sat, 19 Feb 2005 17:59:21 -0800 >From: Dominic Giampaolo <dbg@apple.com> >Subject: Re: bad fsync? (A.M.) >To: darwin-dev@lists.apple.com > >>MySQL makes the following claim at: >>http://dev.mysql.com/doc/mysql/en/news-4-1-9.html >> >>"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3 >>and up. Apple had disabled fsync() in Mac OS X for internal disk >>drives, which caused corruption at power outages." >> >>First of all, is this accurate? A pointer to some docs or a tech note >>on this would be helpful. >> >The comments about fsync() are wrong... > >On MacOS X, fsync() always has and always will flush all file data >from host memory to the drive on which the file resides. The behavior >of fsync() on MacOS X is the same as it is on every other version of >Unix since the dawn of time (well, since the introduction of fsync >anyway :-). > >I believe that what the above comment refers to is the fact that >fsync() is not sufficient to guarantee that your data is on stable >storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, >to ask the drive to flush all buffered data to stable storage. > >Let me explain in more detail. With fsync() even though the OS >writes the data through to the disk and the disk says "yes I wrote >the data", the data is not actually on permanent storage. Unless >you explicitly disable it, all disks have a write buffer which holds >data you've written. The disk buffers the data you wrote until it >decides to flush it to the platters (and the writes may not be in >the order you wrote them). If you lose power or the system crashes >before the data is written, you can wind up in a situation where only >some of your data is actually on disk. What is worse is that even if >you write blocks A, B and C, call fsync() and then write block D you >may find after rebooting that blocks A and D are on disk but B and C >are not (in fact any ordering of A, B, C, and D is possible). > >While this may seem like a rare case it is not. In fact if you sit >down and pull the plug on a system you can make it happen in one or >two plug pulls. I have even gone so far as to watch this behavior >with a logic analyzer on the ATA bus: I saw the data for two writes >come across the ATA cable, the drive replied and said the writes were >successful and then when we rebooted the data from the second write >was correct on disk but the data from the first write was not. > >To deal with this we introduced the F_FULLFSYNC fcntl which will ask >the drive to flush all of its buffered data to disk. When an app >needs to guarantee that data is on disk it should use F_FULLFSYNC. >In most cases you do not need such a heavy handed operation and >fsync() is good enough. But in an app like a database, it is >essential if you want transactional integrity. > >Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC >with the FLUSH_TRACK_CACHE command. All drives sold by Apple will >honor this command. Unfortunately quite a few firewire drive vendors >disable this command and do not pass it to the drive. This means that >most external firewire drives are not reliable if you lose power or >the system crashes. We can't work-around that unless we ask the drive >to disable the write cache completely (which hurts performance quite >badly -- and even that may not be enough as some drives will ignore >that request too). > >So in summary, I believe that the comments in the MySQL news posting >are slightly confused. On MacOS X fsync() behaves the same as it does >on all Unices. That's not good enough if you really care about data >integrity and so we also provide the F_FULLFSYNC fcntl. As far as I >know, MacOS X is the only OS to provide this feature for apps that >need to truly guarantee their data is on disk. > >Hope this clears things up. > >--dominic
Peter Bierman <bierman@apple.com> writes: >> I believe that what the above comment refers to is the fact that >> fsync() is not sufficient to guarantee that your data is on stable >> storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, >> to ask the drive to flush all buffered data to stable storage. I've been looking for documentation on this without a lot of luck ("man fcntl" on OS X 10.3.8 has certainly never heard of it). It's not completely clear whether this subsumes fsync() or whether you're supposed to fsync() and then use the fcntl. Also, isn't it fundamentally at the wrong level? One would suppose that the drive flush operation is going to affect everything the drive currently has queued, not just the one file. That makes it difficult if not impossible to use efficiently. regards, tom lane
Peter Bierman <bierman@apple.com> writes: > > In most cases you do not need such a heavy handed operation and fsync() is > > good enough. Really? Can you think of a single application for which this definition of fsync is useful? Kernel buffers are transparent to the application, just as the disk buffer is. It doesn't matter to an application whether the data is sitting in a kernel buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee the writes actually end up on non-volatile disk then as far as the application is concerned it's just an expensive noop. -- greg
At 12:38 AM -0500 2/20/05, Tom Lane wrote: >Dominic Giampaolo <dbg@apple.com> writes: >>> I believe that what the above comment refers to is the fact that >>> fsync() is not sufficient to guarantee that your data is on stable >>> storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, >>> to ask the drive to flush all buffered data to stable storage. > >I've been looking for documentation on this without a lot of luck >("man fcntl" on OS X 10.3.8 has certainly never heard of it). >It's not completely clear whether this subsumes fsync() or whether >you're supposed to fsync() and then use the fcntl. My understanding is that you're supposed to fsync() and then use the fcntl, but I'm not the filesystems expert. (Dominic, who wrote the original message that I forwarded, is.) I've filed a bug report asking for better documentation about this to be placed in the fsync man page. <radar://4012378> >Also, isn't it fundamentally at the wrong level? One would suppose that >the drive flush operation is going to affect everything the drive >currently has queued, not just the one file. That makes it difficult >if not impossible to use efficiently. I think the intent is to make the fcntl more accurate in time, as the ability to do so appears in hardware. One of the advantages Apple has is the ability to set very specific requirements for our hardware. So if a block specific flush command becomes part of the ATA spec, Apple can require vendors to support it, and support it correctly, before using those drives. On the other hand, as Dominic described, once the hardware is external (like a firewire enclosure), we lose that leverage. At 12:42 PM -0500 2/20/05, Greg Stark wrote: >Dominic Giampaolo <dbg@apple.com> writes: > >> > In most cases you do not need such a heavy handed operation and fsync() is >> > good enough. > >Really? Can you think of a single application for which this definition of >fsync is useful? > >Kernel buffers are transparent to the application, just as the disk buffer is. >It doesn't matter to an application whether the data is sitting in a kernel >buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee >the writes actually end up on non-volatile disk then as far as the application >is concerned it's just an expensive noop. I think the intent of fsync() is closer to what you describe, but the convention is that fsync() hands responsibility to the disk hardware. That's how every other Unix seems to handle fsync() too. This gives you good performance, and if you combine a smart fsync()ing application with reliable storage hardware (like an XServe RAID that battery backs it's own write caches), you get the best combination. If you know you have unreliable hardware, and critical reliability issues, then you can use the fcntl, which seems to be more control than other OSes give. -pmb
Peter Bierman <bierman@apple.com> writes: > I think the intent of fsync() is closer to what you describe, but the > convention is that fsync() hands responsibility to the disk hardware. The "convention" was also that the hardware didn't confirm the command until it had actually been executed... None of this matters to the application. A specification for fsync(2) that says it forces the data to be shuffled around under the hood but fundamentally the doesn't change the semantics (that the data isn't guaranteed to be in non-volatile storage) means that fsync didn't really do anything. -- greg
On Sun, Feb 20, 2005 at 10:50:35PM -0500, Greg Stark wrote: > > Peter Bierman <bierman@apple.com> writes: > > > I think the intent of fsync() is closer to what you describe, but the > > convention is that fsync() hands responsibility to the disk hardware. > > The "convention" was also that the hardware didn't confirm the command until > it had actually been executed... > > None of this matters to the application. A specification for fsync(2) that > says it forces the data to be shuffled around under the hood but fundamentally > the doesn't change the semantics (that the data isn't guaranteed to be in > non-volatile storage) means that fsync didn't really do anything. The real issue is this isn't specific to OS X. I know FreeBSD enables write-caching on IDE drives by default, and I suspect linux does as well. It's probably worth adding a big, fat WARNING in the docs in strategic places about this. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
I think we should add a new wal_sync_method that will use Darwin's F_FULLFSYNC fcntl(). From <sys/fnctl.h>: #define F_FULLFSYNC 51 /* fsync + ask the drive to flush to the media */ This fcntl() will basically perform an fsync() on the file, then flush the write cache of the disk. I'll attempt to work up the patch. It should be trivial. Might need some help on the configure tests though (it should #include <sys/fcntl.h> and make sure F_FULLFSYNC is defined). What's an appropriate name? It seems equivalent to "fsync_writethrough". I suggest "fsync_full", "fsync_flushdisk", or something. Is there a reason we're not indicating the supported platform in the name of the method? Would "fsync_darwinfull" be better? Let users know that it's only available for Darwin? Should we do the same thing with win32-specific methods? I think both fsync() and F_FULLFSYNC should both be available as options on Darwin. Currently in the code, "fsync" and "fsync_writethrough" set sync_method to SYNC_METHOD_FSYNC, so there's no way to distinguish between them. Unsure which one would be the best default. fsync() matches the semantics on other platforms. And conscientious users could specify the F_FULLFSYNC fcntl() method if they want to make sure it goes through the write cache. Comments? Thanks! - Chris