Thread: POSIX file updates
(Declaration of interest: I'm researching for a publication on OLTP system design) I have a question about file writes, particularly on POSIX. This arose while considering the extent to which cache memory and command queueing on disk drives can help improve performance. Is it correct that POSIX requires that the updates to a single file are serialised in the filesystem layer? So, if we have a number of dirty pages to write back to a single file in the database (whether a table or index) then we cannot pass these through the POSIX filesystem layer into the TCQ/NCQ system on the disk drive, so it can reorder them? I have seen suggestions that on Solaris this can be relaxed. I *assume* that PostgreSQL's lack of threads or AIO and the single bgwriter means that PostgreSQL 8.x does not normally attempt to make any use of such a relaxation but could do so if the bgwriter fails to keep up and other backends initiate flushes. Does anyone know (perhaps from other systems) whether it is valuable to attempt to take advantage of such a relaxation where it is available? Does the serialisation for file update apply in the case where the file contents have been memory-mapped and we try an msync (or equivalent)?
James Mansion wrote: > (Declaration of interest: I'm researching for a publication > on OLTP system design) > > I have a question about file writes, particularly on POSIX. > This arose while considering the extent to which cache memory > and command queueing on disk > drives can help improve performance. > > Is it correct that POSIX requires that the updates to a single > file are serialised in the filesystem layer? Is there anything in POSIX that seems to suggest this? :-) (i.e. why are you going under the assumption that the answer is yes - did you read something?) > So, if we have a number of dirty pages to write back to a single > file in the database (whether a table or index) then we cannot > pass these through the POSIX filesystem layer into the TCQ/NCQ > system on the disk drive, so it can reorder them? I don't believe POSIX has any restriction such as you describe - or if it does, and I don't know about it, then most UNIX file systems (if not most file systems on any platform) are not POSIX compliant. Linux itself, even without NCQ, might choose to reorder the writes. If you use ext2, the pressure to push pages out is based upon last used time rather than last write time. It can choose to push out pages at any time, and it's only every 5 seconds or so the the system task (bdflush?) tries to force out all dirty file system pages. NCQ exaggerates the situation, but I believe the issue pre-exists NCQ or the SCSI equivalent of years past. The rest of your email relies on the premise that POSIX enforces such a thing, or that systems are POSIX compliant. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > Is there anything in POSIX that seems to suggest this? :-) (i.e. why > are you going under the assumption that the answer is yes - did you > read something?) > It was something somewhere on the Sun web site, relating to tuning Solaris filesystems. Or databases. Or ZFS. :-( Needless to say I can't find a search string that finds it now. I remember being surprised though, since I wasn't aware of it either. > I don't believe POSIX has any restriction such as you describe - or if > it does, and I don't know about it, then most UNIX file systems (if > not most file systems on any platform) are not POSIX compliant. That, I can believe. > > Linux itself, even without NCQ, might choose to reorder the writes. If > you use ext2, the pressure to push pages out is based upon last used > time rather than last write time. It can choose to push out pages at > any time, and it's only every 5 seconds or so the the system task > (bdflush?) tries to force out all dirty file system pages. NCQ > exaggerates the situation, but I believe the issue pre-exists NCQ or > the SCSI equivalent of years past. Indeed there do seem to be issues with Linux and fsync. Its one of things I'm trying to get a handle on as well - the relationship between fsync and flushes of controller and/or disk caches. > > The rest of your email relies on the premise that POSIX enforces such > a thing, or that systems are POSIX compliant. :-) > True. I'm hoping someone (Jignesh?) will be prompted to remember. It may have been something in a blog related to ZFS vs other filesystems, but so far I'm coming up empty in google. doesn't feel like something I imagined though. James
> I don't believe POSIX has any restriction such as you describe - or if > it does, and I don't know about it, then most UNIX file systems (if > not most file systems on any platform) are not POSIX compliant. > I suspect that indeed there are two different issues here in that the file mutex relates to updates to the file, not passing the buffers through into the drive, which indeed might be delayed. Been using direct io too much recently. :-(
Mark Mielke wrote: > Is there anything in POSIX that seems to suggest this? :-) (i.e. why > are you going under the assumption that the answer is yes - did you > read something?) > Perhaps it was just this: http://kevinclosson.wordpress.com/2007/01/18/yes-direct-io-means-concurrent-writes-oracle-doesnt-need-write-ordering/ Whichof course isn't on Sun.
On Mon, 31 Mar 2008, James Mansion wrote: > Is it correct that POSIX requires that the updates to a single > file are serialised in the filesystem layer? Quoting from Lewine's "POSIX Programmer's Guide": "After a write() to a regular file has successfully returned, any successful read() from each byte position in the file that was modified by that write() will return the data that was written by the write()...a similar requirement applies to multiple write operations to the same file position" That's the "contract" that has to be honored. How your filesystem actually implements this contract is none of a POSIX write() call's business, so long as it does. It is the case that multiple writers to the same file can get serialized somewhere because of how this call is implemented though, so you're correct about that aspect of the practical impact being a possibility. > So, if we have a number of dirty pages to write back to a single > file in the database (whether a table or index) then we cannot > pass these through the POSIX filesystem layer into the TCQ/NCQ > system on the disk drive, so it can reorder them? As long as the reordering mechanism also honors that any reads that come after a write to a block reflect that write, they can be reordered. The filesystem and drives are already doing elevator sorting and similar mechanisms underneath you to optimize things. Unless you use a sync operation or some sort of write barrier, you don't really know what has happened. > I have seen suggestions that on Solaris this can be relaxed. There's some good notes in this area at: http://www.solarisinternals.com/wiki/index.php/Direct_I/O and http://www.solarisinternals.com/wiki/index.php/ZFS_Performance It's clear that such relaxation has benefits with some of Oracle's mechanisms as described. But amusingly, PostgreSQL doesn't even support Solaris's direct I/O method right now unless you override the filesystem mounting options, so you end up needing to split it out and hack at that level regardless. > I *assume* that PostgreSQL's lack of threads or AIO and the > single bgwriter means that PostgreSQL 8.x does not normally > attempt to make any use of such a relaxation but could do so if the > bgwriter fails to keep up and other backends initiate flushes. PostgreSQL writes transactions to the WAL. When they have reached disk, confirmed by a successful f[data]sync or a completed syncronous write, that transactions is now committed. Eventually the impacted items in the buffer cache will be written as well. At checkpoint time, things are reconciled such that all dirty buffers at that point have been written, and now f[data]sync is called on each touched file to make sure those changes have made it to disk. Writes are assumed to be lost in some memory (kernel, filesystem or disk cache) until they've been confirmed to be written to disk via the sync mechanism. When a backend flushes a buffer out, as soon as the OS caches that write the database backend moves on without being concerned about how it's eventually going to get to disk one day. As long as the newly written version comes back again if it's read, the database doesn't worry about what's happening until it specifically asks for a sync that proves everything is done. So if the backends or the background writer are spewing updates out, they don't care if the OS doesn't guarantee the order they hit disk until checkpoint time; it's only the synchronous WAL writes that do. Also note that it's usually the case that backends write a substantial percentage of the buffers out themselves. You should assume that's the case unless you've done some work to prove the background writer is handling most writes (which is difficult to even know before 8.3, much less tune for). That how I understand everything to work at least. I will add the disclaimer that I haven't looked at the archive recovery code much yet. Maybe there's some expectation it has for general database write ordering in order for the WAL replay mechanism to work correctly, I can't imagine how that could work though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > Quoting from Lewine's "POSIX Programmer's Guide": > "After a write() to a regular file has successfully returned, any > successful read() from each byte position in the file that was modified by > that write() will return the data that was written by the write()...a > similar requirement applies to multiple write operations to the same file > position" Yeah, I imagine this is what the OP is thinking of. But note that what it describes is the behavior of concurrent write() and read() calls within a normally-functioning system. I see nothing there that constrains the order in which writes hit physical disk, nor (to put it another way) that promises anything much about the state of the filesystem after a crash. As you stated, PG is largely independent of these issues anyway. As long as the filesystem honors its spec-required contract that it won't claim fsync() is complete before all the referenced data is safely on persistent store, we are OK. regards, tom lane
Greg Smith wrote: > "After a write() to a regular file has successfully returned, any > successful read() from each byte position in the file that was > modified by that write() will return the data that was written by the > write()...a similar requirement applies to multiple write operations > to the same file position" > Yes, but that doesn't say anything about simultaneous read and write from multiple threads from the same or different processes with descriptors on the same file. No matter, I was thinking about a case with direct unbuffered IO. Too many years using Sybase on raw devices. :-( Though, some of the performance studies relating to UFS directio suggest that there are indeed benefits to managing the write through rather than using the OS as a poor man's background thread to do it. SQLServer allows config based on deadline scheduling for checkpoint completion I believe. This seems to me a very desirable feature, but it does need more active scheduling of the write-back. > > It's clear that such relaxation has benefits with some of Oracle's > mechanisms as described. But amusingly, PostgreSQL doesn't even > support Solaris's direct I/O method right now unless you override the > filesystem mounting options, so you end up needing to split it out and > hack at that level regardless. Indeed that's a shame. Why doesn't it use the directio? > PostgreSQL writes transactions to the WAL. When they have reached > disk, confirmed by a successful f[data]sync or a completed syncronous > write, that transactions is now committed. Eventually the impacted > items in the buffer cache will be written as well. At checkpoint > time, things are reconciled such that all dirty buffers at that point > have been written, and now f[data]sync is called on each touched file > to make sure those changes have made it to disk. Yes but fsync and stable on disk isn't the same thing if there is a cache anywhere is it? Hence the fuss a while back about Apple's control of disk caches. Solaris and Windows do it too. Isn't allowing the OS to accumulate an arbitrary number of dirty blocks without control of the rate at which they spill to media just exposing a possibility of an IO storm when it comes to checkpoint time? Does bgwriter attempt to control this with intermediate fsync (and push to media if available)? It strikes me as odd that fsync_writethrough isn't the most preferred option where it is implemented. The postgres approach of *requiring* that there be no cache below the OS is problematic, especially since the battery backup on internal array controllers is hardly the handiest solution when you find the mobo has died. And especially when the inability to flush caches on modern SATA and SAS drives would appear to be more a failing in some operating systems than in the drives themselves.. The links I've been accumulating into my bibliography include: http://www.h2database.com/html/advanced.html#transaction_isolation http://lwn.net/Articles/270891/ http://article.gmane.org/gmane.linux.kernel/646040 http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html http://brad.livejournal.com/2116715.html And your handy document on wal tuning, of course. James
Am Mittwoch, den 02.04.2008, 20:10 +0100 schrieb James Mansion: > It strikes me as odd that fsync_writethrough isn't the most preferred > option where > it is implemented. The postgres approach of *requiring* that there be no > cache > below the OS is problematic, especially since the battery backup on internal > array controllers is hardly the handiest solution when you find the mobo > has died. Well, that might sound brutal, but I'm having today a brute day. There are some items here. 1.) PostgreSQL relies on filesystem semantics. Which might be better or worse then the raw devices other RDBMS use as an interface, but in the end it is just an interface. How well that works out depends strongly on your hardware selection, your OS selection and so on. DB tuning is an scientific art form ;) Worse the fact that raw devices work better on hardware X/os Y than say filesystems is only of limited interest, only if you happen to have already an investement in X or Y. In the end the questions are is the performance good enough, is the data safety good enough, and at which cost (in money, work, ...). 2.) data safety requirements vary strongly. In many (if not most) cases, the recovery of the data on a failed hardware is not critical. Hint: being down till somebody figures out what failed, if the rest of the system is still stable, and so on are not acceptable at all. Meaning the moment that the database server has any problem, one of the hot standbys takes over. The thing you worry about is if all data has made it to the replication servers, not if some data might get lost in the hardware cache of a controller. (Actually, talk to your local computer forensics guru, there are a number of way to keep the current to electronics while moving them.) 3.) a controller cache is an issue if you have a filesystem in your data path or not. If you do raw io, and the stupid hardware do cache writes, well it's about as stupid as it would be if it would have cached filesystem writes. Andreas > And especially when the inability to flush caches on modern SATA and SAS > drives would appear to be more a failing in some operating systems than in > the drives themselves.. > > The links I've been accumulating into my bibliography include: > > http://www.h2database.com/html/advanced.html#transaction_isolation > http://lwn.net/Articles/270891/ > http://article.gmane.org/gmane.linux.kernel/646040 > http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html > http://brad.livejournal.com/2116715.html > > And your handy document on wal tuning, of course. > > James > >
Attachment
Andreas Kostyrka wrote: > takes over. The thing you worry about is if all data has made it to the > replication servers, not if some data might get lost in the hardware > cache of a controller. (Actually, talk to your local computer forensics > guru, there are a number of way to keep the current to electronics while > moving them.) > But it doesn't, unless you use a synchronous rep at block level - which is why we have SRDF. Log-based reps are async and will lose committed transactions. Even if you failed over, its still extra-ordinarily useful to be able to see what the primary tried to do - its the only place the e-comm transactions are stored, and the customer will still expect delivery. I'm well aware that there are battery-backed caches that can be detached from controllers and moved. But you'd better make darn sure you move all the drives and plug them in in exactly the right order and make sure they all spin up OK with the replaced cache, because its expecting them to be exactly as they were last time they were on the bus. > 3.) a controller cache is an issue if you have a filesystem in your data > path or not. If you do raw io, and the stupid hardware do cache writes, > well it's about as stupid as it would be if it would have cached > filesystem writes. > Only if the OS doesn't know how to tell the cache to flush. SATA and SAS both have that facility. But the semantics of *sync don't seem to be defined to require it being exercised, at least as far as many operating systems implement it. You would think hard drives could have enough capacitor store to dump cache to flash or the drive - if only to a special dump zone near where the heads park. They are spinning already after all. On small systems in SMEs its inevitable that large drives will be shared with filesystem use, even if the database files are on their own slice. If you can allow the drive to run with writeback cache turned on, then the users will be a lot happier, even if dbms commits force *all* that cache to flush to the platters.
On Wed, 2 Apr 2008, James Mansion wrote: >> But amusingly, PostgreSQL doesn't even support Solaris's direct I/O >> method right now unless you override the filesystem mounting options, >> so you end up needing to split it out and hack at that level >> regardless. > Indeed that's a shame. Why doesn't it use the directio? You turn on direct I/O differently under Solaris then everywhere else, and nobody has bothered to write the patch (trivial) and OS-specific code to turn it on only when appropriate (slightly tricker) to handle this case. There's not a lot of pressure on PostgreSQL to handle this case correctly when Solaris admins are used to doing direct I/O tricks on filesystems already, so they don't complain about it much. > Yes but fsync and stable on disk isn't the same thing if there is a > cache anywhere is it? Hence the fuss a while back about Apple's control > of disk caches. Solaris and Windows do it too. If your caches don't honor fsync by making sure it's on disk or a battery-backed cache, you can't use them and expect PostgreSQL to operate reliably. Back to that "doesn't honor the contract" case. The code that implements fsync_writethrough on both Windows and Mac OS handles those two cases by writing with the appropriate flags to not get cached in a harmful way. I'm not aware of Solaris doing anything stupid here--the last two Solaris x64 systems I've tried that didn't have a real controller write cache ignored the drive cache and blocked at fsync just as expected, limiting commits to the RPM of the drive. Seen it on UFS and ZFS, both seem to do the right thing here. > Isn't allowing the OS to accumulate an arbitrary number of dirty blocks > without control of the rate at which they spill to media just exposing a > possibility of an IO storm when it comes to checkpoint time? Does > bgwriter attempt to control this with intermediate fsync (and push to > media if available)? It can cause exactly such a storm. If you haven't noticed my other paper at http://www.westnet.com/~gsmith/content/linux-pdflush.htm yet it goes over this exact issue as far as how Linux handles it. Now that it's easy to get even a home machine to have 8GB of RAM in it, Linux will gladly buffer ~800MB worth of data for you and cause a serious storm at fsync time. It's not pretty when that happens into a single SATA drive because there's typically plenty of seeks in that write storm too. There was a prototype implementation plan that wasn't followed completely through in 8.3 to spread fsyncs out a bit better to keep this from being as bad. That optimization might make it into 8.4 but I don't know that anybody is working on it. The spread checkpoints in 8.3 are so much better than 8.2 that many are happy to at least have that. > It strikes me as odd that fsync_writethrough isn't the most preferred > option where it is implemented. It's only available on Win32 and Mac OS X (the OSes that might get it wrong without that nudge). I believe every path through the code uses it by default on those platforms, there's a lot of remapping in there. You can get an idea of what code was touched by looking at the patch that added the OS X version of fsync_writethrough (it was previously only Win32): http://archives.postgresql.org/pgsql-patches/2005-05/msg00208.php > The postgres approach of *requiring* that there be no cache below the OS > is problematic, especially since the battery backup on internal array > controllers is hardly the handiest solution when you find the mobo has > died. If the battery backup cache doesn't survive being moved to another machine after a motherboard failure, it's not very good. The real risk to be concerned about is what happens if the card itself dies. If that happens, you can't help but lose transactions. You seem to feel that there is an alternative here that PostgreSQL could take but doesn't. There is not. You either wait until writes hit disk, which by physical limitations only happens at RPM speed and therefore is too slow to commit for many cases, or you cache in the most reliable memory you've got and hope for the best. No software approach can change any of that. > And especially when the inability to flush caches on modern SATA and SAS > drives would appear to be more a failing in some operating systems than > in the drives themselves.. I think you're extrapolating too much from the Win32/Apple cases here. There are plenty of cases where the so-called "lying" drives themselves are completely stupid on their own regardless of operating system. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 2 Apr 2008, James Mansion wrote: > I'm well aware that there are battery-backed caches that can be detached > from controllers and moved. But you'd better make darn sure you move > all the drives and plug them in in exactly the right order and make sure > they all spin up OK with the replaced cache, because its expecting them > to be exactly as they were last time they were on the bus. The better controllers tag the drives with a unique ID number so they can route pending writes correctly even after such a disaster. This falls into the category of tests people should do more often but don't: write something into the cache, pull the power, rearrange the drives, and see if everything still recovers. > You would think hard drives could have enough capacitor store to dump > cache to flash or the drive - if only to a special dump zone near where > the heads park. They are spinning already after all. The free market seems to have established that the preferred design model for hard drives is that they be cheap and fast rather than focused on reliability. I rather doubt the tiny percentage of the world who cares as much about disk write integrity as database professionals do can possibly make a big enough market to bother increasing the cost and design complexity of the drive to do this. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > You turn on direct I/O differently under Solaris then everywhere else, > and nobody has bothered to write the patch (trivial) and OS-specific > code to turn it on only when appropriate (slightly tricker) to handle > this case. There's not a lot of pressure on PostgreSQL to handle this > case correctly when Solaris admins are used to doing direct I/O tricks > on filesystems already, so they don't complain about it much. I'm not sure that this will survive use of PostgreSQL on Solaris with more users on Indiana though. Which I'm hoping will happen > RPM of the drive. Seen it on UFS and ZFS, both seem to do the right > thing here. But ZFS *is* smart enough to manage the cache, albeit sometimes with unexpected consequences as with the 2530 here http://milek.blogspot.com/. > You seem to feel that there is an alternative here that PostgreSQL > could take but doesn't. There is not. You either wait until writes > hit disk, which by physical limitations only happens at RPM speed and > therefore is too slow to commit for many cases, or you cache in the > most reliable memory you've got and hope for the best. No software > approach can change any of that. Indeed I do, but the issue I have is that the problem is that some popular operating systems (lets try to avoid the flame war) fail to expose control of disk caches and the so the code assumes that the onus is on the admin and the documentation rightly says so. But this is as much a failure of the POSIX API and operating systems to expose something that's necessary and it seems to me rather valuable that the application be able to work with such facilities as they become available. Exposing the flush cache mechanisms isn't dangerous and can improve performance for non-dbms users of the same drives. I think manipulation of this stuff is a major concern for a DBMS that might be used by amateur SAs, and if at all possible it should work out of the box on common hardware. So far as I can tell, SQLServerExpress makes a pretty good attempt at it, for example It might be enough for initdb to whinge and fail if it thinks the disks are behaving insanely unless the wouldbe dba sets a 'my_disks_really_are_that_fast' flag in the config. At the moment anyone can apt-get themselves a DBMS which may become a liability. At the moment: - casual use is likely to be unreliable - uncontrolled deferred IO can result in almost DOS-like checkpoints These affect other systems than PostgreSQL too - but would be avoidable if the drive cache flush was better exposed and the IO was staged to use it. There's no reason to block on anything but the final IO in a WAL commit after all, and with the deferred commit feature (which I really like for workflow engines) intermediate WAL writes of configured chunk size could let the WAL drives get on with it. Admitedly I'm assuming a non-blocking write through - direct IO from a background thread (process if you must) or aio. > There are plenty of cases where the so-called "lying" drives > themselves are completely stupid on their own regardless of operating > system. With modern NCQ capable drive firmware? Or just with older PATA stuff? There's an awful lot of fud out there about SCSI vs IDE still. James
On Mon, 31 Mar 2008, James Mansion wrote: > I have a question about file writes, particularly on POSIX. In other reading I just came across this informative article on this issue, which amusingly was written the same day you asked about this: http://jeffr-tech.livejournal.com/20707.html -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD