Thread: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
Hi pgsql-performance, I was doing mass insertions on my desktop machine and getting at most 1 MB/s disk writes (apart from occasional bursts of 16MB). Inserting 1 million rows with a single integer (data+index 56 MB total) took over 2 MINUTES! The only tuning I had done was shared_buffers=256MB. So I got around to tuning the WAL writer and found that wal_buffers=16MB works MUCH better. wal_sync_method=fdatasync also got similar results. First of all, I'm running PostgreSQL 9.0.1 on Arch Linux * Linux kernel 2.6.36 (also tested with 2.6.35. * Quad-core Phenom II * a single Seagate 7200RPM SATA drive (write caching on) * ext4 FS over LVM, with noatime, data=writeback I am creating a table like: create table foo(id integer primary key); Then measuring performance with the query: insert into foo (id) select generate_series(1, 1000000); 130438,011 ms wal_buffers=64kB, wal_sync_method=open_datasync (all defaults) 29306,847 ms wal_buffers=1MB, wal_sync_method=open_datasync 4641,113 ms wal_buffers=16MB, wal_sync_method=open_datasync ^ from 130s to 4.6 seconds by just changing wal_buffers. 5528,534 ms wal_buffers=64kB, wal_sync_method=fdatasync 4856,712 ms wal_buffers=16MB, wal_sync_method=fdatasync ^ fdatasync works well even with small wal_buffers 2911,265 ms wal_buffers=16MB, fsync=off ^ Not bad, getting 60% of ideal throughput These defaults are not just hurting bulk-insert performance, but also everyone who uses synchronus_commit=off Unless fdatasync is unsafe, I'd very much want to see it as the default for 9.1 on Linux (I don't know about other platforms). I can't see any reasons why each write would need to be sync-ed if I don't commit that often. Increasing wal_buffers probably has the same effect wrt data safety. Also, the tuning guide on wiki is understating the importance of these tunables. Reading it I got the impression that some people change wal_sync_method but it's dangerous and it even literally claims about wal_buffers that "1MB is enough for some large systems" But the truth is that if you want any write throughput AT ALL on a regular Linux desktop, you absolutely have to change one of these. If the defaults were better, it would be enough to set synchronous_commit=off to get all that your hardware has to offer. I was reading mailing list archives and didn't find anything against it either. Can anyone clarify the safety of wal_sync_method=fdatasync? Are there any reasons why it shouldn't be the default? Regards, Marti
Marti Raudsepp wrote: > Unless fdatasync is unsafe, I'd very much want to see it as the > default for 9.1 on Linux (I don't know about other platforms). I > can't see any reasons why each write would need to be sync-ed if I > don't commit that often. Increasing wal_buffers probably has the same > effect wrt data safety. > Writes only are sync'd out when you do a commit, or the database does a checkpoint. This issue is a performance difference introduced by a recent change to Linux. open_datasync support was just added to Linux itself very recently. It may be more safe than fdatasync on your platform. As new code it may have bugs so that it doesn't really work at all under heavy load. No one has really run those tests yet. See http://wiki.postgresql.org/wiki/Reliable_Writes for some background, and welcome to the fun of being an early adopter. The warnings in the tuning guide are there for a reason--you're in untested territory now. I haven't finished validating whether I consider 2.6.32 safe for production use or not yet, and 2.6.36 is a solid year away from being on my list for even considering it as a production database kernel. You should proceed presuming that all writes are unreliable until proven otherwise. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sunday 31 October 2010 20:59:31 Greg Smith wrote: > Writes only are sync'd out when you do a commit, or the database does a > checkpoint. Hm? WAL is written out to disk after an the space provided by wal_buffers(def 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb which you reach pretty quickly - especially after a checkpoint. With O_D?SYNC that will synchronously get written out during a normal XLogInsert if hits a page boundary. *Additionally* its gets written out at a commit if sync commit is not on. Not having a real O_DSYNC on linux until recently makes it even more dubious to have it as a default... Andres
On Sun, Oct 31, 2010 at 21:59, Greg Smith <greg@2ndquadrant.com> wrote: > open_datasync support was just added to Linux itself very recently. Oh I didn't realize it was a new feature. Indeed O_DSYNC support was added in 2.6.33 It seems like bad behavior on PostgreSQL's part to default to new, untested features. I have updated the tuning wiki page with my understanding of the problem: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server#wal_sync_method_wal_buffers Regards, Marti
On 01/11/10 08:59, Greg Smith wrote:
Greg,
Your reply is possibly a bit confusingly worded - Marti was suggesting that fdatasync be the default - so he wouldn't be a new adopter, since this call has been implemented in the kernel for ages. I guess you were wanting to stress that *open_datasync* is the new kid, so watch out to see if he bites...
Cheers
Mark
Marti Raudsepp wrote:Unless fdatasync is unsafe, I'd very much want to see it as the
default for 9.1 on Linux (I don't know about other platforms). I
can't see any reasons why each write would need to be sync-ed if I
don't commit that often. Increasing wal_buffers probably has the same
effect wrt data safety.
Writes only are sync'd out when you do a commit, or the database does a checkpoint.
This issue is a performance difference introduced by a recent change to Linux. open_datasync support was just added to Linux itself very recently. It may be more safe than fdatasync on your platform. As new code it may have bugs so that it doesn't really work at all under heavy load. No one has really run those tests yet. See http://wiki.postgresql.org/wiki/Reliable_Writes for some background, and welcome to the fun of being an early adopter. The warnings in the tuning guide are there for a reason--you're in untested territory now. I haven't finished validating whether I consider 2.6.32 safe for production use or not yet, and 2.6.36 is a solid year away from being on my list for even considering it as a production database kernel. You should proceed presuming that all writes are unreliable until proven otherwise.
Greg,
Your reply is possibly a bit confusingly worded - Marti was suggesting that fdatasync be the default - so he wouldn't be a new adopter, since this call has been implemented in the kernel for ages. I guess you were wanting to stress that *open_datasync* is the new kid, so watch out to see if he bites...
Cheers
Mark
Andres Freund wrote: > On Sunday 31 October 2010 20:59:31 Greg Smith wrote: > >> Writes only are sync'd out when you do a commit, or the database does a >> checkpoint. >> > Hm? WAL is written out to disk after an the space provided by wal_buffers(def > 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb which you reach > pretty quickly - especially after a checkpoint. Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that I forget sometimes that people actually run with the default where this becomes an important consideration. > Not having a real O_DSYNC on linux until recently makes it even more dubious > to have it as a default... > If Linux is now defining O_DSYNC, and it's buggy, that's going to break more software than just PostgreSQL. It wasn't defined before because it didn't work. If the kernel developers have made changes to claim it's working now, but it doesn't really, I would think they'd consider any reports of actual bugs here as important to fix. There's only so much the database can do in the face of incorrect information reported by the operating system. Anyway, I haven't actually seen reports that proves there's any problem here, I was just pointing out that we haven't seen any positive reports about database stress testing on these kernel versions yet either. The changes here are theoretically the right ones, and defaulting to safe writes that flush out write caches is a long-term good thing. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Fri, Nov 5, 2010 at 23:10, Greg Smith <greg@2ndquadrant.com> wrote: >> Not having a real O_DSYNC on linux until recently makes it even more >> dubious to have it as a default... >> > > If Linux is now defining O_DSYNC Well, Linux always defined both O_SYNC and O_DSYNC, but they used to have the same value. The defaults changed due to an unfortunate heuristic in PostgreSQL, which boils down to: #if O_DSYNC != O_SYNC #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC #else #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC (see src/include/access/xlogdefs.h for details) In fact, I was wrong in my earlier post. Linux always offered O_DSYNC behavior. What's new is POSIX-compliant O_SYNC, and the fact that these flags are now distinguished. Here's the change in Linux: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3 Regards, Marti
On Friday 05 November 2010 22:10:36 Greg Smith wrote: > Andres Freund wrote: > > On Sunday 31 October 2010 20:59:31 Greg Smith wrote: > >> Writes only are sync'd out when you do a commit, or the database does a > >> checkpoint. > > > > Hm? WAL is written out to disk after an the space provided by > > wal_buffers(def 8) * XLOG_BLCKSZ (def 8192) is used. The default is 64kb > > which you reach pretty quickly - especially after a checkpoint. > Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that > I forget sometimes that people actually run with the default where this > becomes an important consideration. If you have relatively frequent checkpoints (quite a sensible in some environments given the burstiness/response time problems you can get) even a 16MB wal_buffers can cause significantly more synchronous writes with O_DSYNC because of the amounts of wal traffic due to full_page_writes. For one the background wal writer wont keep up and for another all its writes will be synchronous... Its simply a pointless setting. > > Not having a real O_DSYNC on linux until recently makes it even more > > dubious to have it as a default... > If Linux is now defining O_DSYNC, and it's buggy, that's going to break > more software than just PostgreSQL. It wasn't defined before because it > didn't work. If the kernel developers have made changes to claim it's > working now, but it doesn't really, I would think they'd consider any > reports of actual bugs here as important to fix. There's only so much > the database can do in the face of incorrect information reported by the > operating system. I don't see it being buggy so far. Its just doing what it should. Which is simply a terrible thing for our implementation. Generally. Independent from linux. > Anyway, I haven't actually seen reports that proves there's any problem > here, I was just pointing out that we haven't seen any positive reports > about database stress testing on these kernel versions yet either. The > changes here are theoretically the right ones, and defaulting to safe > writes that flush out write caches is a long-term good thing. I have seen several database which run under 2.6.33 with moderate to high load for some time now. And two 2.6.35. Loads of problems, but none kernel related so far ;-) Andres
Marti Raudsepp wrote: > In fact, I was wrong in my earlier post. Linux always offered O_DSYNC > behavior. What's new is POSIX-compliant O_SYNC, and the fact that > these flags are now distinguished. > While I appreciate that you're trying to help here, I'm unconvinced you've correctly diagnosed a couple of components to what's going on here properly yet. Please refrain from making changes to popular documents like the tuning guide on the wiki based on speculation about what's happening. There's definitely at least one mistake in what you wrote there, and I just reverted the whole set of changes you made accordingly until this is sorted out better. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that > I forget sometimes that people actually run with the default where this > becomes an important consideration. Do you have any testing in favor of 16mb vs. lower/higher? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sat, Nov 6, 2010 at 00:06, Greg Smith <greg@2ndquadrant.com> wrote: > Please refrain from making changes to popular documents like the > tuning guide on the wiki based on speculation about what's happening. I will grant you that the details were wrong, but I stand by the conclusion. I can state for a fact that PostgreSQL's default wal_sync_method varies depending on the <fcntl.h> header. I have two PostgreSQL 9.0.1 builds, one with older /usr/include/bits/fcntl.h and one with newer. When I run "show wal_sync_method;" on one instance, I get fdatasync. On the other one I get open_datasync. So let's get down to code. Older fcntl.h has: #define O_SYNC 010000 # define O_DSYNC O_SYNC /* Synchronize data. */ Newer has: #define O_SYNC 04010000 # define O_DSYNC 010000 /* Synchronize data. */ So you can see that in the older header, O_DSYNC and O_SYNC are equal. src/include/access/xlogdefs.h does: #if defined(O_SYNC) #define OPEN_SYNC_FLAG O_SYNC ... #if defined(OPEN_SYNC_FLAG) /* O_DSYNC is distinct? */ #if O_DSYNC != OPEN_SYNC_FLAG #define OPEN_DATASYNC_FLAG O_DSYNC ^ it's comparing O_DSYNC != O_SYNC #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC #elif defined(HAVE_FDATASYNC) #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC ^ depending on whether O_DSYNC and O_SYNC were equal, the default wal_sync_method will change. Regards, Marti
>> Fair enough; I'm so used to bumping wal_buffers up to 16MB nowadays that >> I forget sometimes that people actually run with the default where this >> becomes an important consideration. > > Do you have any testing in favor of 16mb vs. lower/higher? From some tests I had done some time ago, using separate spindles (RAID1) for xlog, no battery, on 8.4, with stuff that generates lots of xlog (INSERT INTO SELECT) : When using a small wal_buffers, there was a problem when switching from one xlog file to the next. Basically a fsync was issued, but most of the previous log segment was still not written. So, postgres was waiting for the fsync to finish. Of course, the default 64 kB of wal_buffers is quickly filled up, and all writes wait for the end of this fsync. This caused hiccups in the xlog traffic, and xlog throughput wassn't nearly as high as the disks would allow. Sticking a sthetoscope on the xlog harddrives revealed a lot more random accesses that I would have liked (this is a much simpler solution than tracing the IOs, lol) I set wal writer delay to a very low setting (I dont remember which, perhaps 1 ms) so the walwriter was in effect constantly flushing the wal buffers to disk. I also used fdatasync instead of fsync. Then I set wal_buffers to a rather high value, like 32-64 MB. Throughput and performance were a lot better, and the xlog drives made a much more "linear-access" noise. What happened is that, since wal_buffers was larger than what the drives can write in 1-2 rotations, it could absorb wal traffic during the time postgres waits for fdatasync / wal segment change, so the inserts would not have to wait. And lowering the walwriter delay made it write something on each disk rotation, so that when a COMMIT or segment switch came, most of the time, the WAL was already synced and there was no wait. Just my 2 c ;)
Marti Raudsepp wrote: > I will grant you that the details were wrong, but I stand by the conclusion. > I can state for a fact that PostgreSQL's default wal_sync_method > varies depending on the <fcntl.h> header. > Yes; it's supposed to, and that logic works fine on some other platforms. The question is exactly what the new Linux O_DSYNC behavior is doing, in regards to whether it flushes drive caches out or not. Until you've quantified which of the cases do that--which is required for reliable operation of PostgreSQL--and which don't, you don't have any data that can be used to draw a conclusion from. If some setups are faster because they write less reliably, that doesn't automatically make them the better choice. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Monday 08 November 2010 00:35:29 Greg Smith wrote: > Marti Raudsepp wrote: > > I will grant you that the details were wrong, but I stand by the > > conclusion. I can state for a fact that PostgreSQL's default > > wal_sync_method varies depending on the <fcntl.h> header. > > Yes; it's supposed to, and that logic works fine on some other > platforms. The question is exactly what the new Linux O_DSYNC behavior > is doing, in regards to whether it flushes drive caches out or not. > Until you've quantified which of the cases do that--which is required > for reliable operation of PostgreSQL--and which don't, you don't have > any data that can be used to draw a conclusion from. If some setups are > faster because they write less reliably, that doesn't automatically make > them the better choice. I think thats FUD. Sorry. Can you explain to me why fsync() may/should/could be *any* less reliable than O_DSYNC? On *any* platform. Or fdatasync() in the special way its used with pg, namely completely preallocated files. I think the reasons why O_DSYNC is, especially, but not only, in combination with a small wal_buffers setting, slow in most circumstances are pretty clear. Making a setting which is only supported on a small range of systems highest in the preferences list is even more doubtfull than the already strange choice of making O_DSYNC the default given the way it works (i.e. no reordering, synchronous writes in the bgwriter, synchronous writes on wal_buffers pressure etc). Greetings, Andres
Andres Freund wrote: > I think thats FUD. Sorry. > Yes, there's plenty of uncertainty and doubt here, but not from me. The test reports given so far have been so riddled with errors I don't trust any of them. As a counter example showing my expectations here, the "Testing Sandforce SSD" tests done by Yeb Havinga: http://archives.postgresql.org/message-id/4C4A9452.9070100@gmail.com followed the right method for confirming both write integrity and performance including pull the plug situations. Those I trusted. What Marti had posted, and what Phoronix investigated, just aren't that thorough. > Can you explain to me why fsync() may/should/could be *any* less reliable than > O_DSYNC? On *any* platform. Or fdatasync() in the special way its used with > pg, namely completely preallocated files. > If the Linux kernel has done extra work so that O_DSYNC writes are forced to disk including a cache flush, but that isn't done for just fdatasync() calls, there could be difference here. The database still wouldn't work right in that case, because checkpoint writes are still going to be using fdatasync. I'm not sure what the actual behavior is supposed to be, but ultimately it doesn't matter. The history of the Linux kernel developers in this area has been so completely full of bugs and incomplete implementations that I am working from the assumption that we know nothing about what actually works and what doesn't without doing careful real-world testing. > I think the reasons why O_DSYNC is, especially, but not only, in combination > with a small wal_buffers setting, slow in most circumstances are pretty clear. > Where's your benchmarks proving it then? If you're right about this, and I'm not saying you aren't, it should be obvious in simple bechmarks by stepping through various sizes for wal_buffers and seeing the throughput/latency situation improve. But since I haven't seen that done, this one is still in the uncertainty & doubt bucket too. You're assuming one of the observed problems corresponds to this theorized cause. But you can't prove a performance change on theory. You have to isolate it and then you'll know. So long as there are multiple uncertainties going on here, I don't have any conclusion yet, just a list of things to investigate that's far longer than the list of what's been looked at so far. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, Nov 8, 2010 at 01:35, Greg Smith <greg@2ndquadrant.com> wrote: > Yes; it's supposed to, and that logic works fine on some other platforms. No, the logic was broken to begin with. Linux technically supported O_DSYNC all along. PostgreSQL used fdatasync as the default. Now, because Linux added proper O_SYNC support, PostgreSQL suddenly prefers O_DSYNC over fdatasync? > Until you've > quantified which of the cases do that--which is required for reliable > operation of PostgreSQL--and which don't, you don't have any data that can > be used to draw a conclusion from. If some setups are faster because they > write less reliably, that doesn't automatically make them the better choice. I don't see your point. If fdatasync worked on Linux, AS THE DEFAULT, all the time until recently, then how does it all of a sudden need proof NOW? If anything, the new open_datasync should be scrutinized because it WASN'T the default before and it hasn't gotten as much testing on Linux. Regards, Marti
On Mon, Nov 8, 2010 at 02:05, Greg Smith <greg@2ndquadrant.com> wrote: > Where's your benchmarks proving it then? If you're right about this, and > I'm not saying you aren't, it should be obvious in simple bechmarks by > stepping through various sizes for wal_buffers and seeing the > throughput/latency situation improve. Since benchmarking is the easy part, I did that. I plotted the time taken by inserting 2 million rows to a table with a single integer column and no indexes (total 70MB). Entire script is attached. If you don't agree with something in this benchmark, please suggest improvements. Chart: http://ompldr.org/vNjNiNQ/wal_sync_method1.png Spreadsheet: http://ompldr.org/vNjNiNg/wal_sync_method1.ods (the 2nd worksheet has exact measurements) This is a different machine from the original post, but similar configuration. One 1TB 7200RPM Seagate Barracuda, no disk controller cache, 4G RAM, Phenom X4, Linux 2.6.36, PostgreSQL 9.0.1, Arch Linux. This time I created a separate 20GB ext4 partition specially for PostgreSQL, with all default settings (shared_buffers=32MB). The partition is near the end of the disk, so hdparm gives a sequential read throughput of ~72 MB/s. I'm getting frequent checkpoint warnings, should I try larger checkpoing_segments too? The partition is re-created and 'initdb' is re-ran for each test, to prevent file system allocation from affecting results. I did two runs of all benchmarks. The points on the graph show a sum of INSERT time + COMMIT time in seconds. One surprising thing on the graph is a "plateau", where open_datasync performs almost equally with wal_buffers=128kB and 256kB. Another noteworthy difference (not visible on the graph) is that with open_datasync -- but not fdatasync -- and wal_buffers=128M, INSERT time keeps shrinking, but COMMIT takes longer. The total INSERT+COMMIT time remains the same, however. ---- I have a few expendable hard drives here so I can test reliability by pulling the SATA cable as well. Is this kind of testing useful? What workloads do you suggest? Regards, Marti
Attachment
On Nov 7, 2010, at 6:35 PM, Marti Raudsepp wrote: > On Mon, Nov 8, 2010 at 01:35, Greg Smith <greg@2ndquadrant.com> wrote: >> Yes; it's supposed to, and that logic works fine on some other platforms. > > No, the logic was broken to begin with. Linux technically supported > O_DSYNC all along. PostgreSQL used fdatasync as the default. Now, > because Linux added proper O_SYNC support, PostgreSQL suddenly prefers > O_DSYNC over fdatasync? > >> Until you've >> quantified which of the cases do that--which is required for reliable >> operation of PostgreSQL--and which don't, you don't have any data that can >> be used to draw a conclusion from. If some setups are faster because they >> write less reliably, that doesn't automatically make them the better choice. > > I don't see your point. If fdatasync worked on Linux, AS THE DEFAULT, > all the time until recently, then how does it all of a sudden need > proof NOW? > > If anything, the new open_datasync should be scrutinized because it > WASN'T the default before and it hasn't gotten as much testing on > Linux. > I agree. Im my opinion, the burden of proof lies with those contending that the default value should _change_ from fdatasyncto O_DSYNC on linux. If the default changes, all power-fail testing and other reliability tests done prior on ahardware configuration may become invalid without users even knowing. Unfortunately, a code change in postgres is required to _prevent_ the default from changing when compiled and run againstthe latest kernels. Summary: Until recently, there was code with a code comment in the Linux kernel that said "For now, when the user asks for O_SYNC,we'll actually give O_DSYNC". Linux has had O_DSYNC forever and ever, but not O_SYNC. If O_DSYNC is preferred over fdatasync for Postgres xlog (as the code indicates), it should have been the preferred for yearson Linux as well. If fdatasync has been the preferred method on Linux, and the O_SYNC = O_DSYNC test was for that,then the purpose behind the test has broken. No matter how you slice it, the default on Linux is implicitly changing and the choice is to either: * Return the default to fdatasync * Let it implicitly change to O_DSYNC The latter choice is the one that requires testing to prove that it is the proper and preferred default from the performanceand data reliability POV. The former is the status quo -- but requires a code change. > Regards, > Marti > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey <scott@richrelevance.com> writes: > No matter how you slice it, the default on Linux is implicitly changing and the choice is to either: > * Return the default to fdatasync > * Let it implicitly change to O_DSYNC > The latter choice is the one that requires testing to prove that it is the proper and preferred default from the performanceand data reliability POV. And, in fact, the game plan is to do that testing and see which default we want. I think it's premature to argue further about this until we have some test results. regards, tom lane
Scott Carey wrote: > Im my opinion, the burden of proof lies with those contending that the default value should _change_ from fdatasync toO_DSYNC on linux. If the default changes, all power-fail testing and other reliability tests done prior on a hardwareconfiguration may become invalid without users even knowing. > This seems to be ignoring the fact that unless you either added a non-volatile cache or specifically turned off all write caching on your drives, the results of all power-fail testing done on earlier versions of Linux was that it failed. The default configuration of PostgreSQL on Linux has been that any user who has a simple SATA drive gets unsafe writes, unless they go out of their way to prevent them. Whatever newer kernels do by default cannot be worse. The open question is whether it's still broken, in which case we might as well favor the known buggy behavior rather than the new one, or whether everything has improved enough to no longer be unsafe with the new defaults. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Hi, On Monday 08 November 2010 23:12:57 Greg Smith wrote: > This seems to be ignoring the fact that unless you either added a > non-volatile cache or specifically turned off all write caching on your > drives, the results of all power-fail testing done on earlier versions > of Linux was that it failed. The default configuration of PostgreSQL on > Linux has been that any user who has a simple SATA drive gets unsafe > writes, unless they go out of their way to prevent them. Which is about *no* argument in favor of any of the options, right? > Whatever newer kernels do by default cannot be worse. The open question > is whether it's still broken, in which case we might as well favor the > known buggy behavior rather than the new one, or whether everything has > improved enough to no longer be unsafe with the new defaults. Either I majorly misunderstand you, or ... I dont know. There simply *is* no new implementation relevant for this discussion. Full Stop. What changed is that O_DSYNC is defined differently from O_SYNC these days and O_SYNC actually does what it should. Which causes pg to move open_datasync first in the preference list doing what the option with the lowest preference did up to now. That does not *at all* change the earlier fdatasync() or fsync() implementations/tests. It simply makes open_datasync the default doing what open_sync did earlier. For that note that open_sync was the method of *least* preference till now... And that fdatasync() thus was the default till now. Which it is not anymore. I don't argue *at all* that we have to test the change moving fdatasync before open_datasync on the *other* operating systems. What I completely don't get is all that talking about data consistency on linux. Its simply irrelevant in that context. Andres
On Mon, Nov 8, 2010 at 20:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The latter choice is the one that requires testing to prove that it is the proper and preferred default from the performanceand data reliability POV. > > And, in fact, the game plan is to do that testing and see which default > we want. I think it's premature to argue further about this until we > have some test results. Who will be doing that testing? You said you're relying on Greg Smith to manage the testing, but he's obviously uninterested, so it seems unlikely that this will go anywhere. I posted my results with the simple INSERT test, but nobody cared. I could do some pgbench runs, but I have no idea what parameters would give useful results. Meanwhile, PostgreSQL performance is regressing and there's still no evidence that open_datasync is any safer. Regards, Marti
Marti Raudsepp <marti@juffo.org> writes: > On Mon, Nov 8, 2010 at 20:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> And, in fact, the game plan is to do that testing and see which default >> we want. I think it's premature to argue further about this until we >> have some test results. > Who will be doing that testing? You said you're relying on Greg Smith > to manage the testing, but he's obviously uninterested, so it seems > unlikely that this will go anywhere. What's your basis for asserting he's uninterested? Please have a little patience. regards, tom lane
On Sat, Nov 13, 2010 at 20:01, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What's your basis for asserting he's uninterested? Please have a little > patience. My apologies, I was under the impression that he hadn't answered your request, but he did in the -hackers thread. Regards, Marti
Time for a deeper look at what's going on here...I installed RHEL6 Beta 2 yesterday, on the presumption that since the release version just came out this week it was likely the same version Marti tested against. Also, it was the one I already had a DVD to install for. This was on a laptop with 7200 RPM hard drive, already containing an Ubuntu installation for comparison sake. Initial testing was done with the PostgreSQL test_fsync utility, just to get a gross idea of what situations the drives involved were likely flushing data to disk correctly during, and which it was impossible for that to be true. 7200 RPM = 120 rotations/second, which puts an upper limit of 120 true fsync executions per second. The test_fsync released with PostgreSQL 9.0 now reports its value on the right scale that you can directly compare against that (earlier versions reported seconds/commit, not commits/second). First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD checkout: $ cd [PostgreSQL source code tree] $ cd src/tools/fsync/ $ make And I started with looking at the Ubuntu system running ext3, which represents the status quo we've been seeing the past few years. Initially the drive write cache was turned on: Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 2010 i686 GNU/Linux $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=9.04 DISTRIB_CODENAME=jaunty DISTRIB_DESCRIPTION="Ubuntu 9.04" /dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro) $ ./test_fsync Loops = 10000 Simple write: 8k write 88476.784/second Compare file sync methods using one write: (unavailable: open_datasync) open_sync 8k write 1192.135/second 8k write, fdatasync 1222.158/second 8k write, fsync 1097.980/second Compare file sync methods using two writes: (unavailable: open_datasync) 2 open_sync 8k writes 527.361/second 8k write, 8k write, fdatasync 1105.204/second 8k write, 8k write, fsync 1084.050/second Compare open_sync with different sizes: open_sync 16k write 966.047/second 2 open_sync 8k writes 529.565/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 1064.177/second 8k write, close, fsync 1042.337/second Two notable things here. One, there is no open_datasync defined in this older kernel. Two, all methods of commit give equally inflated commit rates, far faster than the drive is capable of. This proves this setup isn't flushing the drive's write cache after commit. You can get safe behavior out of the old kernel by disabling its write cache: $ sudo /sbin/hdparm -W0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) Loops = 10000 Simple write: 8k write 89023.413/second Compare file sync methods using one write: (unavailable: open_datasync) open_sync 8k write 106.968/second 8k write, fdatasync 108.106/second 8k write, fsync 104.238/second Compare file sync methods using two writes: (unavailable: open_datasync) 2 open_sync 8k writes 51.637/second 8k write, 8k write, fdatasync 109.256/second 8k write, 8k write, fsync 103.952/second Compare open_sync with different sizes: open_sync 16k write 109.562/second 2 open_sync 8k writes 52.752/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 107.179/second 8k write, close, fsync 106.923/second And now results are as expected: just under 120/second. Onto RHEL6. Setup for this initial test was: $ uname -a Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.0 Beta (Santiago) $ mount /dev/sda7 on / type ext4 (rw) And I started with the write cache off to see a straight comparison against the above: $ sudo hdparm -W0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) $ ./test_fsync Loops = 10000 Simple write: 8k write 104194.886/second Compare file sync methods using one write: open_datasync 8k write 97.828/second open_sync 8k write 109.158/second 8k write, fdatasync 109.838/second 8k write, fsync 20.872/second Compare file sync methods using two writes: 2 open_datasync 8k writes 53.902/second 2 open_sync 8k writes 53.721/second 8k write, 8k write, fdatasync 109.731/second 8k write, 8k write, fsync 20.918/second Compare open_sync with different sizes: open_sync 16k write 109.552/second 2 open_sync 8k writes 54.116/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 20.800/second 8k write, close, fsync 20.868/second A few changes then. open_datasync is available now. It looks slightly slower than the alternatives on this test, but I didn't see that on the later tests so I'm thinking that's just occasional run to run variation. For some reason regular fsync is dramatically slower in this kernel than earlier ones. Perhaps a lot more metadata being flushed all the way to the disk in that case now? The issue that I think Marti has been concerned about is highlighted in this interesting subset of the data: Compare file sync methods using two writes: 2 open_datasync 8k writes 53.902/second 8k write, 8k write, fdatasync 109.731/second The results here aren't surprising; if you do two dsync writes, that will take two disk rotations, while two writes followed a single sync only takes one. But that does mean that in the case of small values for wal_buffers, like the default, you could easily end up paying a rotation sync penalty more than once per commit. Next question is what happens if I turn the drive's write cache back on: $ sudo hdparm -W1 /dev/sda /dev/sda: setting drive write-caching to 1 (on) write-caching = 1 (on) $ ./test_fsync [gsmith@meddle fsync]$ ./test_fsync Loops = 10000 Simple write: 8k write 104198.143/second Compare file sync methods using one write: open_datasync 8k write 110.707/second open_sync 8k write 110.875/second 8k write, fdatasync 110.794/second 8k write, fsync 28.872/second Compare file sync methods using two writes: 2 open_datasync 8k writes 55.731/second 2 open_sync 8k writes 55.618/second 8k write, 8k write, fdatasync 110.551/second 8k write, 8k write, fsync 28.843/second Compare open_sync with different sizes: open_sync 16k write 110.176/second 2 open_sync 8k writes 55.785/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 28.779/second 8k write, close, fsync 28.855/second This is nice to see from a reliability perspective. On all three of the viable sync methods here, the speed seen suggests the drive's volatile write cache is being flushed after every commit. This is going to be bad for people who have gotten used to doing development on systems where that's not honored and they don't care, because this looks like a 90% drop in performance on those systems. But since the new behavior is safe and the earlier one was not, it's hard to get mad about it. Developers probably just need to be taught to turn synchronous_commit off to speed things up when playing with test data. test_fsync writes to /var/tmp/test_fsync.out by default, not paying attention to what directory you're in. So to use it to test another filesystem, you have to make sure to give it an explicit full path. Next I tested against the old Ubuntu partition that was formatted with ext3, with the write cache still on: # mount | grep /ext3 /dev/sda5 on /ext3 type ext3 (rw) # ./test_fsync -f /ext3/test_fsync.out Loops = 10000 Simple write: 8k write 100943.825/second Compare file sync methods using one write: open_datasync 8k write 106.017/second open_sync 8k write 108.318/second 8k write, fdatasync 108.115/second 8k write, fsync 105.270/second Compare file sync methods using two writes: 2 open_datasync 8k writes 53.313/second 2 open_sync 8k writes 54.045/second 8k write, 8k write, fdatasync 55.291/second 8k write, 8k write, fsync 53.243/second Compare open_sync with different sizes: open_sync 16k write 54.980/second 2 open_sync 8k writes 53.563/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 105.032/second 8k write, close, fsync 103.987/second Strange...it looks like ext3 is executing cache flushes, too. Note that all of the "Compare file sync methods using two writes" results are half speed now; it's as if ext3 is flushing the first write out immediately? This result was unexpected, and I don't trust it yet; I want to validate this elsewhere. What about XFS? That's a first class filesystem on RHEL6 too: [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out Loops = 10000 Simple write: 8k write 71878.324/second Compare file sync methods using one write: open_datasync 8k write 36.303/second open_sync 8k write 35.714/second 8k write, fdatasync 35.985/second 8k write, fsync 35.446/second I stopped that there, sick of waiting for it, as there's obviously some serious work (mounting options or such at a minimum) that needs to be done before XFS matches the other two. Will return to that later. So, what have we learned so far: 1) On these newer kernels, both ext4 and ext3 seem to be pushing data out through the drive write caches correctly. 2) On single writes, there's no performance difference between the main three methods you might use, with the straight fsync method having a serious regression in this use case. 3) WAL writes that are forced by wal_buffers filling will turn into a commit-length write when using the new, default open_datasync. Using the older default of fdatasync avoids that problem, in return for causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC writes over fdatasync ones is avoiding the OS cache. I want to next go through and replicate some of the actual database level tests before giving a full opinion on whether this data proves it's worth changing the wal_sync_method detection. So far I'm torn between whether that's the right approach, or if we should just increase the default value for wal_buffers to something more reasonable. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Tue, Nov 16, 2010 at 3:39 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I want to next go through and replicate some of the actual database level > tests before giving a full opinion on whether this data proves it's worth > changing the wal_sync_method detection. So far I'm torn between whether > that's the right approach, or if we should just increase the default value > for wal_buffers to something more reasonable. How about both? open_datasync seems problematic for a number of reasons - you get an immediate write-through whether you need it or not, including, as you point out, the case where the you want to write several blocks at once and then force them all out together. And 64kB for a ring buffer just seems awfully small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/16/10 12:39 PM, Greg Smith wrote: > I want to next go through and replicate some of the actual database > level tests before giving a full opinion on whether this data proves > it's worth changing the wal_sync_method detection. So far I'm torn > between whether that's the right approach, or if we should just increase > the default value for wal_buffers to something more reasonable. We'd love to, but wal_buffers uses sysV shmem. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 11/16/10 12:39 PM, Greg Smith wrote: >> I want to next go through and replicate some of the actual database >> level tests before giving a full opinion on whether this data proves >> it's worth changing the wal_sync_method detection. So far I'm torn >> between whether that's the right approach, or if we should just increase >> the default value for wal_buffers to something more reasonable. > We'd love to, but wal_buffers uses sysV shmem. Well, we're not going to increase the default to gigabytes, but we could very probably increase it by a factor of 10 or so without anyone squawking. It's been awhile since I heard of anyone trying to run PG in 4MB shmmax. How much would a change of that size help? regards, tom lane
On Wed, Nov 17, 2010 at 01:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, we're not going to increase the default to gigabytes, but we could > very probably increase it by a factor of 10 or so without anyone > squawking. It's been awhile since I heard of anyone trying to run PG in > 4MB shmmax. How much would a change of that size help? In my testing, when running a large bulk insert query with fdatasync on ext4, changing wal_buffers has very little effect: http://ompldr.org/vNjNiNQ/wal_sync_method1.png (More details at http://archives.postgresql.org/pgsql-performance/2010-11/msg00094.php ) It would take some more testing to say this conclusively, but looking at the raw data, there only seems to be an effect when moving from 8 to 16MB. Could be different on other file systems though. Regards, Marti
Josh Berkus wrote: > On 11/16/10 12:39 PM, Greg Smith wrote: > >> I want to next go through and replicate some of the actual database >> level tests before giving a full opinion on whether this data proves >> it's worth changing the wal_sync_method detection. So far I'm torn >> between whether that's the right approach, or if we should just increase >> the default value for wal_buffers to something more reasonable. >> > > We'd love to, but wal_buffers uses sysV shmem. > > Speaking of the SYSV SHMEM, is it possible to use huge pages? -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
On Wednesday 17 November 2010 00:31:34 Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > On 11/16/10 12:39 PM, Greg Smith wrote: > >> I want to next go through and replicate some of the actual database > >> level tests before giving a full opinion on whether this data proves > >> it's worth changing the wal_sync_method detection. So far I'm torn > >> between whether that's the right approach, or if we should just increase > >> the default value for wal_buffers to something more reasonable. > > > > We'd love to, but wal_buffers uses sysV shmem. > > Well, we're not going to increase the default to gigabytes Especially not as I don't think it will have any effect after wal_segment_size as that will force a write-out anyway. Or am I misremembering the implementation? Andres
Andres Freund <andres@anarazel.de> writes: > On Wednesday 17 November 2010 00:31:34 Tom Lane wrote: >> Well, we're not going to increase the default to gigabytes > Especially not as I don't think it will have any effect after wal_segment_size > as that will force a write-out anyway. Or am I misremembering the > implementation? Well, there's a forced fsync after writing the last page of an xlog file, but I don't believe that proves that more than 16MB of xlog buffers is useless. Other processes could still be busy filling the buffers. regards, tom lane
On Wednesday 17 November 2010 01:51:28 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Wednesday 17 November 2010 00:31:34 Tom Lane wrote: > >> Well, we're not going to increase the default to gigabytes > > > > Especially not as I don't think it will have any effect after > > wal_segment_size as that will force a write-out anyway. Or am I > > misremembering the implementation? > > Well, there's a forced fsync after writing the last page of an xlog > file, but I don't believe that proves that more than 16MB of xlog > buffers is useless. Other processes could still be busy filling the > buffers. Maybe I am missing something, but I think the relevant AdvanceXLInsertBuffer() is currently called with WALInsertLock held? Andres
Andres Freund <andres@anarazel.de> writes: > On Wednesday 17 November 2010 01:51:28 Tom Lane wrote: >> Well, there's a forced fsync after writing the last page of an xlog >> file, but I don't believe that proves that more than 16MB of xlog >> buffers is useless. Other processes could still be busy filling the >> buffers. > Maybe I am missing something, but I think the relevant AdvanceXLInsertBuffer() > is currently called with WALInsertLock held? The fsync is associated with the write, which is not done with insert lock held. We're not quite that dumb. regards, tom lane
On Wednesday 17 November 2010 02:04:28 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Wednesday 17 November 2010 01:51:28 Tom Lane wrote: > >> Well, there's a forced fsync after writing the last page of an xlog > >> file, but I don't believe that proves that more than 16MB of xlog > >> buffers is useless. Other processes could still be busy filling the > >> buffers. > > > > Maybe I am missing something, but I think the relevant > > AdvanceXLInsertBuffer() is currently called with WALInsertLock held? > > The fsync is associated with the write, which is not done with insert > lock held. We're not quite that dumb. Ah, I see. The XLogWrite in AdvanceXLInsertBuffer is only happening if the head of the buffer gets to the tail - which is more likely if the wal buffers are small... Andres
> Well, we're not going to increase the default to gigabytes, but we could > very probably increase it by a factor of 10 or so without anyone > squawking. It's been awhile since I heard of anyone trying to run PG in > 4MB shmmax. How much would a change of that size help? Last I checked, though, this comes out of the allocation available to shared_buffers. And there definitely are several OSes (several linuxes, OSX) still limited to 32MB by default. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
I wrote: > The fsync is associated with the write, which is not done with insert > lock held. We're not quite that dumb. But wait --- are you thinking of the call path where a write (and possible fsync) is forced during AdvanceXLInsertBuffer because there's no WAL buffer space left? If so, that's *exactly* the scenario that can be expected to be less common with more buffer space. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: >> Well, we're not going to increase the default to gigabytes, but we could >> very probably increase it by a factor of 10 or so without anyone >> squawking. It's been awhile since I heard of anyone trying to run PG in >> 4MB shmmax. How much would a change of that size help? > Last I checked, though, this comes out of the allocation available to > shared_buffers. And there definitely are several OSes (several linuxes, > OSX) still limited to 32MB by default. Sure, but the current default is a measly 64kB. We could increase that 10x for a relatively small percentage hit in the size of shared_buffers, if you suppose that there's 32MB available. The current default is set to still work if you've got only a couple of MB in SHMMAX. What we'd want is for initdb to adjust the setting as part of its probing to see what SHMMAX is set to. regards, tom lane
On Tue, Nov 16, 2010 at 6:25 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 11/16/10 12:39 PM, Greg Smith wrote: >> I want to next go through and replicate some of the actual database >> level tests before giving a full opinion on whether this data proves >> it's worth changing the wal_sync_method detection. So far I'm torn >> between whether that's the right approach, or if we should just increase >> the default value for wal_buffers to something more reasonable. > > We'd love to, but wal_buffers uses sysV shmem. <places tongue firmly in cheek> Gee, too bad there's not some other shared-memory implementation we could use... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Nov 16, 2010, at 4:05 PM, Mladen Gogala wrote: > Josh Berkus wrote: >> On 11/16/10 12:39 PM, Greg Smith wrote: >> >>> I want to next go through and replicate some of the actual database >>> level tests before giving a full opinion on whether this data proves >>> it's worth changing the wal_sync_method detection. So far I'm torn >>> between whether that's the right approach, or if we should just increase >>> the default value for wal_buffers to something more reasonable. >>> >> >> We'd love to, but wal_buffers uses sysV shmem. >> >> > Speaking of the SYSV SHMEM, is it possible to use huge pages? RHEL 6 and friends have transparent hugepage support. I'm not sure if they yet transparently do it for SYSV SHMEM, butthey do for most everything else. Sequential traversal of a process heap is several times faster with hugepages. Unfortunately,postgres doesn't organize its blocks in its shared_mem to be sequential for a relation. So it might not mattermuch. > > -- > > Mladen Gogala > Sr. Oracle DBA > 1500 Broadway > New York, NY 10036 > (212) 329-5251 > http://www.vmsinfo.com > The Leader in Integrated Media Intelligence Solutions > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Nov 16, 2010, at 12:39 PM, Greg Smith wrote: > > $ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 88476.784/second > > Compare file sync methods using one write: > (unavailable: open_datasync) > open_sync 8k write 1192.135/second > 8k write, fdatasync 1222.158/second > 8k write, fsync 1097.980/second > > Compare file sync methods using two writes: > (unavailable: open_datasync) > 2 open_sync 8k writes 527.361/second > 8k write, 8k write, fdatasync 1105.204/second > 8k write, 8k write, fsync 1084.050/second > > Compare open_sync with different sizes: > open_sync 16k write 966.047/second > 2 open_sync 8k writes 529.565/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 1064.177/second > 8k write, close, fsync 1042.337/second > > Two notable things here. One, there is no open_datasync defined in this > older kernel. Two, all methods of commit give equally inflated commit > rates, far faster than the drive is capable of. This proves this setup > isn't flushing the drive's write cache after commit. Nit: there is no open_sync, only open_dsync. Prior to recent kernels, only (semantically) open_dsync exists, labeled asopen_sync. New kernels move that code to open_datasync and nave a NEW open_sync that supposedly flushes metadata properly. > > You can get safe behavior out of the old kernel by disabling its write > cache: > > $ sudo /sbin/hdparm -W0 /dev/sda > > /dev/sda: > setting drive write-caching to 0 (off) > write-caching = 0 (off) > > Loops = 10000 > > Simple write: > 8k write 89023.413/second > > Compare file sync methods using one write: > (unavailable: open_datasync) > open_sync 8k write 106.968/second > 8k write, fdatasync 108.106/second > 8k write, fsync 104.238/second > > Compare file sync methods using two writes: > (unavailable: open_datasync) > 2 open_sync 8k writes 51.637/second > 8k write, 8k write, fdatasync 109.256/second > 8k write, 8k write, fsync 103.952/second > > Compare open_sync with different sizes: > open_sync 16k write 109.562/second > 2 open_sync 8k writes 52.752/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 107.179/second > 8k write, close, fsync 106.923/second > > And now results are as expected: just under 120/second. > > Onto RHEL6. Setup for this initial test was: > > $ uname -a > Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 > x86_64 x86_64 x86_64 GNU/Linux > $ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 6.0 Beta (Santiago) > $ mount > /dev/sda7 on / type ext4 (rw) > > And I started with the write cache off to see a straight comparison > against the above: > > $ sudo hdparm -W0 /dev/sda > > /dev/sda: > setting drive write-caching to 0 (off) > write-caching = 0 (off) > $ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 104194.886/second > > Compare file sync methods using one write: > open_datasync 8k write 97.828/second > open_sync 8k write 109.158/second > 8k write, fdatasync 109.838/second > 8k write, fsync 20.872/second fsync is working now! flushing metadata properly reduces performance. However, shouldn't open_sync slow down vs open_datasync too and be similar to fsync? Did you recompile your test on the RHEL6 system? Code compiled on newer kernels will see O_DSYNC and O_SYNC as two separate sentinel values, lets call them 1 and 2 respectively. Code compiled against earlier kernels will see both O_DSYNC and O_SYNC as the same value, 1. So code compiledagainst older kernels, asking for O_SYNC on a newer kernel will actually get O_DSYNC behavior! This was intended. I can't find the link to the mail, but it was Linus' idea to make old code that expected the 'faster but incorrect'behavior to retain it on newer kernels. Only a recompile with newer header files will trigger the new behaviorand expose the 'correct' open_sync behavior. This will be 'fun' for postgres packagers and users -- data reliability behavior differs based on what kernel it is compiledagainst. Luckily, the xlogs only need open_datasync semantics. > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.902/second > 2 open_sync 8k writes 53.721/second > 8k write, 8k write, fdatasync 109.731/second > 8k write, 8k write, fsync 20.918/second > > Compare open_sync with different sizes: > open_sync 16k write 109.552/second > 2 open_sync 8k writes 54.116/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 20.800/second > 8k write, close, fsync 20.868/second > > A few changes then. open_datasync is available now. Again, noting the detail that it is open_sync that is new (depending on where it is compiled). The old open_sync is relabeledto the new open_datasync. > It looks slightly > slower than the alternatives on this test, but I didn't see that on the > later tests so I'm thinking that's just occasional run to run > variation. For some reason regular fsync is dramatically slower in this > kernel than earlier ones. Perhaps a lot more metadata being flushed all > the way to the disk in that case now? > > The issue that I think Marti has been concerned about is highlighted in > this interesting subset of the data: > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.902/second > 8k write, 8k write, fdatasync 109.731/second > > The results here aren't surprising; if you do two dsync writes, that > will take two disk rotations, while two writes followed a single sync > only takes one. But that does mean that in the case of small values for > wal_buffers, like the default, you could easily end up paying a rotation > sync penalty more than once per commit. > > Next question is what happens if I turn the drive's write cache back on: > > $ sudo hdparm -W1 /dev/sda > > /dev/sda: > setting drive write-caching to 1 (on) > write-caching = 1 (on) > > $ ./test_fsync > > [gsmith@meddle fsync]$ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 104198.143/second > > Compare file sync methods using one write: > open_datasync 8k write 110.707/second > open_sync 8k write 110.875/second > 8k write, fdatasync 110.794/second > 8k write, fsync 28.872/second > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 55.731/second > 2 open_sync 8k writes 55.618/second > 8k write, 8k write, fdatasync 110.551/second > 8k write, 8k write, fsync 28.843/second > > Compare open_sync with different sizes: > open_sync 16k write 110.176/second > 2 open_sync 8k writes 55.785/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 28.779/second > 8k write, close, fsync 28.855/second > > This is nice to see from a reliability perspective. On all three of the > viable sync methods here, the speed seen suggests the drive's volatile > write cache is being flushed after every commit. This is going to be > bad for people who have gotten used to doing development on systems > where that's not honored and they don't care, because this looks like a > 90% drop in performance on those systems. > But since the new behavior is > safe and the earlier one was not, it's hard to get mad about it. I would love to see the same tests in this detail for RHEL 5.5 (which has ext3, ext4, and xfs). I think this data reliabilityissue that requires turning off write cache was in the kernel ~2.6.26 to 2.6.31 range. Ubuntu doesn't reallycare about this stuff which is one reason I avoid it for a prod db. I know that xfs with the right settings on RHEL5.5 does not require disabling the write cache. > Developers probably just need to be taught to turn synchronous_commit > off to speed things up when playing with test data. > Absolutely. > test_fsync writes to /var/tmp/test_fsync.out by default, not paying > attention to what directory you're in. So to use it to test another > filesystem, you have to make sure to give it an explicit full path. > Next I tested against the old Ubuntu partition that was formatted with > ext3, with the write cache still on: > > # mount | grep /ext3 > /dev/sda5 on /ext3 type ext3 (rw) > # ./test_fsync -f /ext3/test_fsync.out > Loops = 10000 > > Simple write: > 8k write 100943.825/second > > Compare file sync methods using one write: > open_datasync 8k write 106.017/second > open_sync 8k write 108.318/second > 8k write, fdatasync 108.115/second > 8k write, fsync 105.270/second > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.313/second > 2 open_sync 8k writes 54.045/second > 8k write, 8k write, fdatasync 55.291/second > 8k write, 8k write, fsync 53.243/second > > Compare open_sync with different sizes: > open_sync 16k write 54.980/second > 2 open_sync 8k writes 53.563/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 105.032/second > 8k write, close, fsync 103.987/second > > Strange...it looks like ext3 is executing cache flushes, too. Note that > all of the "Compare file sync methods using two writes" results are half > speed now; it's as if ext3 is flushing the first write out immediately? > This result was unexpected, and I don't trust it yet; I want to validate > this elsewhere. > > What about XFS? That's a first class filesystem on RHEL6 too: and available on later RHEL 5's. > > [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out > Loops = 10000 > > Simple write: > 8k write 71878.324/second > > Compare file sync methods using one write: > open_datasync 8k write 36.303/second > open_sync 8k write 35.714/second > 8k write, fdatasync 35.985/second > 8k write, fsync 35.446/second > > I stopped that there, sick of waiting for it, as there's obviously some > serious work (mounting options or such at a minimum) that needs to be > done before XFS matches the other two. Will return to that later. > Yes, XFS requires some fiddling. Its metadata operations are also very slow. > So, what have we learned so far: > > 1) On these newer kernels, both ext4 and ext3 seem to be pushing data > out through the drive write caches correctly. > I suspect that some older kernels are partially OK here too. The kernel not flushing properly appeared near 2.6.25 ish. > 2) On single writes, there's no performance difference between the main > three methods you might use, with the straight fsync method having a > serious regression in this use case. I'll ask again -- did you compile the test on RHEL6 for the RHEL6 tests? The behavior in later kernels for this dependson what kernel it was compiled against for open_sync. For fsync, its not a regression, its actually flushing metadataproperly and therefore actually robust if there is a power failure during a write. Even the write cache disabledcase on the ubuntu kernel could leave a filesystem with corrupt data if the power failed in a metadata intensivewrite situation. > > 3) WAL writes that are forced by wal_buffers filling will turn into a > commit-length write when using the new, default open_datasync. Using > the older default of fdatasync avoids that problem, in return for > causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC > writes over fdatasync ones is avoiding the OS cache. > > I want to next go through and replicate some of the actual database > level tests before giving a full opinion on whether this data proves > it's worth changing the wal_sync_method detection. So far I'm torn > between whether that's the right approach, or if we should just increase > the default value for wal_buffers to something more reasonable. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > Did you recompile your test on the RHEL6 system? On both systems I showed, I checked out a fresh copy of the PostgreSQL 9.1 HEAD from the git repo, and compiled that on the server, to make sure I was pulling in the appropriate kernel headers. I wasn't aware of exactly how the kernel sync stuff was refactored though, thanks for the concise update on that. I can do similar tests on a RHEL5 system, but not on the same hardware. Can only make my laptop boot so many operating systems at a time usefully. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, Nov 17, 2010 at 3:24 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Scott Carey wrote: >> >> Did you recompile your test on the RHEL6 system? > > On both systems I showed, I checked out a fresh copy of the PostgreSQL 9.1 > HEAD from the git repo, and compiled that on the server, to make sure I was > pulling in the appropriate kernel headers. I wasn't aware of exactly how > the kernel sync stuff was refactored though, thanks for the concise update > on that. I can do similar tests on a RHEL5 system, but not on the same > hardware. Can only make my laptop boot so many operating systems at a time > usefully. One thing to note is that where on a disk things sit can make a /huge/ difference - depending on if Ubuntu is /here/ and RHEL is /there/ and so on can make a factor of 2 or more difference. The outside tracks of most modern SATA disks can do around 120MB/s. The inside tracks aren't even half of that. -- Jon
Jon Nelson wrote: > One thing to note is that where on a disk things sit can make a /huge/ > difference - depending on if Ubuntu is /here/ and RHEL is /there/ and > so on can make a factor of 2 or more difference. The outside tracks > of most modern SATA disks can do around 120MB/s. The inside tracks > aren't even half of that. > You're talking about changes in sequential read and write speed due to Zone Bit Recording (ZBR) AKA Zone Constant Angular Velocity (ZCAV). What I was measuring was commit latency time on small writes. That doesn't change as you move around the disk, since it's tied to the raw rotation speed of the drive rather than density of storage in any zone. If I get to something that's impacted by sequential transfers rather than rotation time, I'll be sure to use the same section of disk for that. It wasn't really necessary to get these initial gross numbers anyway. What I was looking for is the about 10:1 speedup seen on this hardware when the write cache is used, which could easily be seen even were there ZBR differences involved. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Nov 17, 2010, at 1:24 PM, Greg Smith wrote: > Scott Carey wrote: >> Did you recompile your test on the RHEL6 system? > > On both systems I showed, I checked out a fresh copy of the PostgreSQL > 9.1 HEAD from the git repo, and compiled that on the server, to make > sure I was pulling in the appropriate kernel headers. I wasn't aware of > exactly how the kernel sync stuff was refactored though, thanks for the > concise update on that. Thanks! So this could be another bug in Linux. Not entirely surprising. Since fsync/fdatasync relative performance isn't similar to open_sync/open_datasync relative performance on this test thereis probably a bug that either hurts fsync, or one that is preventing open_sync from dealing with metadata properly. Luckily for the xlog, both of those can be avoided -- the real choice is fdatasync vs open_datasync. And bothwork in newer kernels or break in certain older ones. > I can do similar tests on a RHEL5 system, but > not on the same hardware. Can only make my laptop boot so many > operating systems at a time usefully. Yeah, I understand. I might throw this at a RHEL5 system if I get a chance but I need one without a RAID card that is notin use. Hopefully it doesn't turn out that fdatasync is write-cache safe but open_sync/open_datasync isn't on that platform. It could impact the choice of a default value. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books >
On Tue, Nov 16, 2010 at 8:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >>> Well, we're not going to increase the default to gigabytes, but we could >>> very probably increase it by a factor of 10 or so without anyone >>> squawking. It's been awhile since I heard of anyone trying to run PG in >>> 4MB shmmax. How much would a change of that size help? > >> Last I checked, though, this comes out of the allocation available to >> shared_buffers. And there definitely are several OSes (several linuxes, >> OSX) still limited to 32MB by default. > > Sure, but the current default is a measly 64kB. We could increase that > 10x for a relatively small percentage hit in the size of shared_buffers, > if you suppose that there's 32MB available. The current default is set > to still work if you've got only a couple of MB in SHMMAX. > > What we'd want is for initdb to adjust the setting as part of its > probing to see what SHMMAX is set to. > > regards, tom lane > > In all the performance tests that I have done, generally I get a good bang for the buck with wal_buffers set to 512kB in low memory cases and mostly I set it to 1MB which is probably enough for most of the cases even with high memory. That 1/2 MB wont make drastic change on shared_buffers anyway (except for edge cases) but will relieve the stress quite a bit on wal buffers. Regards, Jignesh