Thread: O_DSYNC broken on MacOS X?
Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG came out 5 times faster. The benchmark isn't very thoroughly described, but it turns out not to matter. http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html Apparently, the reason we're faster is that wal_sync_method = open_datasync, which is the default on MacOS X, doesn't actually work. [rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 25 query mode: simple number of clients: 4 number of threads: 4 number of transactions per client: 100000 number of transactions actually processed: 400000/400000 tps = 1292.258304 (including connections establishing) tps = 1292.281493 (excluding connections establishing) Clearly we're not getting 1292 (or even 1292/4) fsync per second out of whatever HD is in my laptop. So what happens if we change to fsync_writethrough, which is the equivalent of what InnoDB apparently does out of the box? [rhaas pgsql]$ pg_ctl reload server signaled LOG: received SIGHUP, reloading configuration files LOG: parameter "wal_sync_method" changed to "fsync_writethrough" [rhaas pgbench]$ pgbench -t 100000 -j 4 -c 4 pgbench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 25 query mode: simple number of clients: 4 number of threads: 4 number of transactions per client: 100000 number of transactions actually processed: 400000/400000 tps = 27.845797 (including connections establishing) tps = 27.845809 (excluding connections establishing) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG > came out 5 times faster. The benchmark isn't very thoroughly > described, but it turns out not to matter. > > http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html > > Apparently, the reason we're faster is that wal_sync_method = > open_datasync, which is the default on MacOS X, doesn't actually work. That might be true, but if you check the comments, Jayant replied to say: @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux does not support fsync_writethrough http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm So I don't think that invalidates his benchmark. Something else might of course... -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG > came out 5 times faster. The benchmark isn't very thoroughly > described, but it turns out not to matter. > > http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html > > Apparently, the reason we're faster is that wal_sync_method = > open_datasync, which is the default on MacOS X, doesn't actually work. That might be true, but if you check the comments, Jayant replied to say: @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux does not support fsync_writethrough http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm So I don't think that invalidates his benchmark. Something else might of course... -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Sep 30, 2010 at 4:09 PM, Dave Page <dpage@pgadmin.org> wrote: > On Thu, Sep 30, 2010 at 8:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Jayant Kumar did some benchmarking of InnoDB vs. PostgreSQL and PG >> came out 5 times faster. The benchmark isn't very thoroughly >> described, but it turns out not to matter. >> >> http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html >> >> Apparently, the reason we're faster is that wal_sync_method = >> open_datasync, which is the default on MacOS X, doesn't actually work. > > That might be true, but if you check the comments, Jayant replied to say: > > @Andrew : I am running linux - ubuntu 10.04 - kernel 2.6.32-24. Linux > does not support fsync_writethrough > http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm > > So I don't think that invalidates his benchmark. Something else might > of course... Oh, I missed that. Actually, I wasn't really so concerned with whether his benchmark is correct. I *am* concerned about being broken out of the box on MacOS X. (I also suspect problems with the benchmark. It's hard to believe we're 5x faster than InnoDB on an apples-to-apples comparison on trivial queries. I'd believe 20% either way, but 5x is a lot. But that's a question for another day.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > Oh, I missed that. Actually, I wasn't really so concerned with > whether his benchmark is correct. I *am* concerned about being broken > out of the box on MacOS X. Actually, the problem with OSX is that OSX is broken out of the box, at least by that standard. The system's normal configuration is that fsync() does nothing, so it's hardly surprising that O_DSYNC is no better. You have to use wal_sync_method = fsync_writethrough to get actual bits-to-the-platter behavior. I'm not sure whether we should select fsync_writethrough as the default on OSX. We don't make an equivalent attempt to prevent OS or storage malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit ahead of the game in that you *can* force writethrough without resorting to arcane hacks with hdparm or some such. We could definitely stand to be a bit more verbose about documenting the platform-specific issues in this area. regards, tom lane
On Sep 30, 2010, at 5:02 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Oh, I missed that. Actually, I wasn't really so concerned with >> whether his benchmark is correct. I *am* concerned about being broken >> out of the box on MacOS X. > > Actually, the problem with OSX is that OSX is broken out of the box, > at least by that standard. The system's normal configuration is that > fsync() does nothing, That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meetsthe specification and is distinctly different from doing nothing. "The fsync() function can be used by an application to indicate that all data for the open file description named by fildesis to be transferred to the storage device associated with the file described by fildes in an implementation-dependentmanner." http://opengroup.org/onlinepubs/007908799/xsh/fsync.html "On MacOS X, fsync() always has and always will flush all file data from host memory to the drive on which the file resides." http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html > I'm not sure whether we should select fsync_writethrough as the default > on OSX. We don't make an equivalent attempt to prevent OS or storage > malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit > ahead of the game in that you *can* force writethrough without resorting > to arcane hacks with hdparm or some such. > > We could definitely stand to be a bit more verbose about documenting > the platform-specific issues in this area. Not only is this issue platform-specific, it is also bus-, controller- and disk-specific. Luckily, hardware that ships fromApple responds properly to F_FULLFSYNC. It's too bad there is no cross-platform way to ask what level of hardware-syncingis available. Cheers, M
On Thu, Sep 30, 2010 at 5:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Oh, I missed that. Actually, I wasn't really so concerned with >> whether his benchmark is correct. I *am* concerned about being broken >> out of the box on MacOS X. > > Actually, the problem with OSX is that OSX is broken out of the box, > at least by that standard. The system's normal configuration is that > fsync() does nothing, so it's hardly surprising that O_DSYNC is no > better. You have to use wal_sync_method = fsync_writethrough to get > actual bits-to-the-platter behavior. > > I'm not sure whether we should select fsync_writethrough as the default > on OSX. We don't make an equivalent attempt to prevent OS or storage > malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit > ahead of the game in that you *can* force writethrough without resorting > to arcane hacks with hdparm or some such. > > We could definitely stand to be a bit more verbose about documenting > the platform-specific issues in this area. I think some documentation is definitely in order, at the least. It's certainly astonishing that the default settings aren't crash-safe. I'd really like to understand how this shakes out on different plaforms. Whether we should try to work around them is a trickier question, but I'm somewhat inclined to say yes. If we're trying to have the system be performant in the default config, turning off synchronous_commit would be saner than failing to make use of a system call which we know absolutely for sure to be necessary to avoid the possibility of database corruption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Sep 30, 2010 at 2:22 PM, A.M. <agentm@themactionfaction.com> wrote: > That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meetsthe specification and is distinctly different from doing nothing. How exactly is it different from doing nothing? That is, in what situation does doing this change in any way the behaviour from the user's point of view versus doing nothing? People keep saying it "meets the specification" but it's a useless interpretation of the specification. And it requires a pretty malicious reading of "transferred to the storage device" to read it as "transferred it from one set of ram buffers to another more closely associated with the actual persistent storage". > It's too bad there is no cross-platform way to ask what level of hardware-syncing is available. Why would the user want to ask this? As far as the user is concerned either there are only two "levels": synced or not synced. If it's not guaranteed to persist after a power failure it's not synced. It doesn't matter whether it's in kernel buffers, drive buffers, or anywhere else -- they're all the same from the user's point of view -- they're non-persistent. The other useful syncing behaviour would be to get write barriers. But that's a whole different api, not just a behaviour that can be attached to fsync. -- greg
Tom Lane wrote: > I'm not sure whether we should select fsync_writethrough as the default > on OSX. We don't make an equivalent attempt to prevent OS or storage > malfeasance on other Unixoid platforms --- in fact, I'd say OSX is a bit > ahead of the game in that you *can* force writethrough without resorting > to arcane hacks with hdparm or some such. > The same issue exists on Windows too, with the same workaround: normally unsafe by default, have to switch to fsync_writethrough in order to get commit safety and write caching for regular writes, still ahead of most Unixes because at least it's that easy. The main downside to switching the default on either OS X or Windows is developers using those platforms for test deployments will suffer greatly from a performance drop for data they don't really care about. As those two in particular are much more likely to be client development platforms, too, that's a scary thing to consider. As for the documentation, I wrote http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm a while ago trying to address this better, but never got to merging some of those notes into the official docs. At the time, such heavy linking to external URLs was particularly frowned upon in the official docs. While that's eased up a bit, what I think I could do now is convert that article into something that's on the Wiki instead, and we could point the "Reliability" section of the docs toward there as something that could stay live as platform-specific changes happen. I'm thinking of the whole ext4 behavior change when I say that; stuff like that will only be addressable in a live document, since changes were happening in between major releases in that area. More recently, I've written a few things covering this area in what approaches excruciating detail, for this little document you might have been spammed in somebody's signature about. I'm going to ask the publisher to make that specific material the public sample chapter for the book, in hopes of making it easier for people to find a detailed discussion of this topic. Regardless of whether that works out, I have to clean up documentation in this whole area up for the checkpoint changes I'm working on for 9.1 anyway. That tinkers with the timing on fsync calls, and whether that works or not is very filesystem specific. I'm basically stuck with documenting exactly what happens for most possibilities in order to provide a comprehensive manual section covering that. I have a deadline for when I have to get my work in progress organized to share with the world now: https://www.postgresqlconference.org/content/righting-your-writes , so after I get back from that conference in November I'll see what I can do about the docs too. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
A.M. wrote: > That is not correct. fsync and friends on Darwin synchronizes I/O and flushes dirty kernel caches to the disk which meetsthe specification and is distinctly different from doing nothing... > "On MacOS X, fsync() always has and always will flush all file data > from host memory to the drive on which the file resides." > http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html > You didn't quote the next part of that, which says "fsync() is not sufficient to guarantee that your data is on stable storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask the drive to flush all buffered data to stable storage." That's exactly what turning on fsync_writethrough does in PostgreSQL. See http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the first post on this topic that ultimately led to that behavior being implemented. From the perspective of the database, whether or not the behavior is standards compliant isn't the issue. Whether pages make it to physical disk or not when fsync is called, or when O_DSYNC writes are done on platforms that support them, is the important part. If you the OS doesn't do that, it is doing nothing useful from the perspective of the database's expectations. And that's not true on Darwin unless you specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL. It only does that when you switch wal_sync_method=fsync_writethrough -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Greg Smith wrote: > You didn't quote the next part of that, which says "fsync() is not > sufficient to guarantee that your data is on stable > storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC, to ask > the drive to flush all buffered data to stable storage." That's exactly > what turning on fsync_writethrough does in PostgreSQL. See > http://archives.postgresql.org/pgsql-hackers/2005-04/msg00390.php as the > first post on this topic that ultimately led to that behavior being > implemented. > > From the perspective of the database, whether or not the behavior is > standards compliant isn't the issue. Whether pages make it to physical > disk or not when fsync is called, or when O_DSYNC writes are done on > platforms that support them, is the important part. If you the OS > doesn't do that, it is doing nothing useful from the perspective of the > database's expectations. And that's not true on Darwin unless you > specify F_FULLFSYNC, which doesn't happen by default in PostgreSQL. It > only does that when you switch wal_sync_method=fsync_writethrough Greg Smith also wrote: > The main downside to switching the default on either OS X or Windows is developers using those platforms for test deployments will suffer greatly from a performance drop for data they don't really care about. As those two in particular are much more likely to be client development platforms, too, that's a scary thing to consider. I think that, bottom line, Postgres should be defaulting to whatever the safest and most reliable behavior is, per each platform, because data integrity is the most important thing, ensuring that a returning commit has actually written data to disk. If performance is worse, then so what? Code that does nothing has the best performance of all, and is also generally useless. Whenever there is a tradeoff to be made, reliability for speed, then users should have to explicitly choose the less reliable option, which would demonstrate they know what they're doing. Let the testers explicitly choose a faster and less reliable option for the data they don't care about, and otherwise by default users who don't better should get the safest option, for data they likely care about. That is a DBMS priority. This matter reminds me of a discussion on the SQLite list years ago about whether pragma synchronous=normal or synchronous=full should be the default, and thankfully 'full' won. -- Darren Duncan
Darren Duncan wrote: > This matter reminds me of a discussion on the SQLite list years ago > about whether pragma synchronous=normal or synchronous=full should be > the default, and thankfully 'full' won. Right now, when I see deployments in the field, serious database servers setup by professional DBAs tend to use the right hardware and setup to do the correct thing with PostgreSQL. And trivial installs done for testing purposes cheat, but many of those users don't really care because they know they are not running a real server, and expect that their desktop is neither reliable nor fast at database work. The SQLite situation has a slightly different context than this, because the places it's put into don't so regularly have a DBA involved at all in situations where the data is important. It's often just system software sitting in the background nobody even is aware of. I also remember when SQLite did come out of the background, when it was crucified for being the cause of Firefox slowdowns actually linked to changed kernel fsync behavior. That's the sort of bad press this project really doesn't need right now, when it actually doesn't matter on so many production database servers. You may not be aware that there's already such a change floating around out there. PostgreSQL installs on Linux kernel 2.6.32 or later using ext4 are dramatically slower out of the box than they used to be, because the OS started doing the right thing by default; no change in the database code. I remain in mild terror that this news is going to break in a bad way and push this community into damage control. So far I've only seen that reported on Phoronix, and that included a testimony from a kernel developer that they introduced the regression so it wasn't so bad. The next such publicized report may not be so informed. Some of this works out to when to change things rather than what to change. PostgreSQL is at a somewhat critical spot right now. If people grab a new version, and performance sucks compared to earlier ones, they're not going to think "oh, maybe they changed an option and the new version is tuned for safety better". They're going to say "performance sucks on this database now" and give up on it. Many evals are done on hardware that isn't representative of a real database server, and if we make a change that only hurts those people--while not actually impacting production quality hardware--that needs to be done carefully. And that's exactly what I think would happen here if this was just changed all of the sudden. I don't think anyone is seriously opposed to changing the defaults for safety instead of performance. The problem is that said change would need to be *preceeded* by a major update to the database documentation, and perhaps even some code changes to issue warnings when you create a cluster with what is going to turn out to now be a slow configuration. We'd need to make it really obvious to people who upgrade and notice that performance tanks that it's because of a configuration change made for safety reasons, one that they can undo for test deployments. That particular area, giving people better advice about what they should do to properly tune a new install for its intended workload, is something that's been making slow progress but still needs a lot of work. I think if some better tools there come along, so that most people are expected to follow a path that involves a tuning tool, it will be much easier to stomach the idea of changing the default--knowing that something that will undo that change is likely to appears to the user that suggests the possibility is available. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote: > > It's too bad there is no cross-platform way to ask what level of > hardware-syncing is available. > > Why would the user want to ask this? As far as the user is concerned > either there are only two "levels": synced or not synced. If it's not > guaranteed to persist after a power failure it's not synced. It > doesn't matter whether it's in kernel buffers, drive buffers, or > anywhere else -- they're all the same from the user's point of view -- > they're non-persistent. Well, it's not really useful, but that's how it works "everywhere". On Linux, fsync carries the stuff from the kernel's RAM to the disk controller's RAM, and then it depends on some hdparm magic or something what happens next.
On Fri, Oct 1, 2010 at 3:50 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On tor, 2010-09-30 at 15:07 -0700, Greg Stark wrote: >> > It's too bad there is no cross-platform way to ask what level of >> hardware-syncing is available. >> >> Why would the user want to ask this? As far as the user is concerned >> either there are only two "levels": synced or not synced. If it's not >> guaranteed to persist after a power failure it's not synced. It >> doesn't matter whether it's in kernel buffers, drive buffers, or >> anywhere else -- they're all the same from the user's point of view -- >> they're non-persistent. > > Well, it's not really useful, but that's how it works "everywhere". On > Linux, fsync carries the stuff from the kernel's RAM to the disk > controller's RAM, and then it depends on some hdparm magic or something > what happens next. That's a bit vaguer than I'd like. TFD says "The aim of WAL is to ensure that the log is written before database records are altered, but this can be subverted by disk drives that falsely report a successful write to the kernel, when in fact they have only cached the data and not yet stored it on the disk. A power failure in such a situation might lead to irrecoverable data corruption. Administrators should try to ensure that disks holding PostgreSQL's WAL log files do not make such false reports." This leaves open the question of how they should attempt to do this; we should say what we know about that. I also notice the following sentence in our documentation, which now appears to me to be flat-out wrong: "The wal_sync_method parameter determines how PostgreSQL will ask the kernel to force WAL updates out to disk. All the options should be the same in terms of reliability, but it's quite platform-specific which one will be the fastest." Obviously, we know now (if we didn't before) that this isn't the case, per my OP. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote: > > Well, it's not really useful, but that's how it works "everywhere". On > > Linux, fsync carries the stuff from the kernel's RAM to the disk > > controller's RAM, and then it depends on some hdparm magic or something > > what happens next. > > That's a bit vaguer than I'd like. TFD says "The aim of WAL is to > ensure that the log is written before database records are altered, > but this can be subverted by disk drives that falsely report a > successful write to the kernel, when in fact they have only cached the > data and not yet stored it on the disk. A power failure in such a > situation might lead to irrecoverable data corruption. Administrators > should try to ensure that disks holding PostgreSQL's WAL log files do > not make such false reports." This leaves open the question of how > they should attempt to do this; we should say what we know about that. That is explained in section 29.1 "Reliability". > I also notice the following sentence in our documentation, which now > appears to me to be flat-out wrong: "The wal_sync_method parameter > determines how PostgreSQL will ask the kernel to force WAL updates > out to disk. All the options should be the same in terms of > reliability, but it's quite platform-specific which one will be the > fastest." Obviously, we know now (if we didn't before) that this > isn't the case, per my OP. Right. It was true before fsync_writethrough was invented.
On Tue, Oct 5, 2010 at 8:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On mån, 2010-10-04 at 23:41 -0400, Robert Haas wrote: >> > Well, it's not really useful, but that's how it works "everywhere". On >> > Linux, fsync carries the stuff from the kernel's RAM to the disk >> > controller's RAM, and then it depends on some hdparm magic or something >> > what happens next. >> >> That's a bit vaguer than I'd like. TFD says "The aim of WAL is to >> ensure that the log is written before database records are altered, >> but this can be subverted by disk drives that falsely report a >> successful write to the kernel, when in fact they have only cached the >> data and not yet stored it on the disk. A power failure in such a >> situation might lead to irrecoverable data corruption. Administrators >> should try to ensure that disks holding PostgreSQL's WAL log files do >> not make such false reports." This leaves open the question of how >> they should attempt to do this; we should say what we know about that. > > That is explained in section 29.1 "Reliability". > >> I also notice the following sentence in our documentation, which now >> appears to me to be flat-out wrong: "The wal_sync_method parameter >> determines how PostgreSQL will ask the kernel to force WAL updates >> out to disk. All the options should be the same in terms of >> reliability, but it's quite platform-specific which one will be the >> fastest." Obviously, we know now (if we didn't before) that this >> isn't the case, per my OP. > > Right. It was true before fsync_writethrough was invented. Proposed doc patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
Robert Haas wrote: > Proposed doc patch attached. > Looks accurate to me. I like the additional linking to the Reliability page you put in there too. Heavily referencing that important page from related areas is a good thing, particularly now that it's got a lot more details than it used to. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Robert Haas <robertmhaas@gmail.com> writes: > Proposed doc patch attached. "discusesed"? Otherwise +1 regards, tom lane
On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Proposed doc patch attached. > > "discusesed"? Otherwise +1 Woops, thanks. Committed with that change. I back-patched it back to 8.3, which is as far as it applied with only minor conflicts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Robert Haas wrote: >> Proposed doc patch attached. > > Looks accurate to me. I like the additional linking to the Reliability page > you put in there too. Heavily referencing that important page from related > areas is a good thing, particularly now that it's got a lot more details > than it used to. Cool, thanks for the fast review. I suspect there are more details that could stand to be added to the WAL reliability page as well, but I don't know what they are so I can't add them. I still have the feeling that we have not put quite a large enough red, blinking light around this issue, but I don't have a concrete suggestion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Oct 7, 2010, at 12:26 PM, Robert Haas wrote: > On Thu, Oct 7, 2010 at 11:45 AM, Greg Smith <greg@2ndquadrant.com> wrote: >> Robert Haas wrote: >>> Proposed doc patch attached. >> >> Looks accurate to me. I like the additional linking to the Reliability page >> you put in there too. Heavily referencing that important page from related >> areas is a good thing, particularly now that it's got a lot more details >> than it used to. > > Cool, thanks for the fast review. I suspect there are more details > that could stand to be added to the WAL reliability page as well, but > I don't know what they are so I can't add them. > > I still have the feeling that we have not put quite a large enough > red, blinking light around this issue, but I don't have a concrete > suggestion. I think the general problem is that there is no simple way to verify that a PostgreSQL commit is pushing the bits to persistentstorage. It would be helpful if there were a platform-specific, volume-specific tool to deduce this. Currently,there is no warning light that goes on when commits are not persistent. On Linux, a tool could check filesystem parameters, hdparm (if relevant), and hard drive and controller specs (possibly againsta blacklist of known liars). Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are within therealm of possibility. How else can a DBA today ensure that a commit is a commit? Cheers, M
A.M. wrote: > Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are withinthe realm of possibility. > This is what the test_fsync utility that already ships with the database should be useful for. The way Bruce changed it to report numbers in commits/second for 9.0 makes it a lot easier to use for this purpose than it used to be. I think there's still some additional improvements that could be made there, but it's a tricky test to run accurately. The current code is probably too detailed in some ways (it delivers a lot of output not relevant to this use-case) and not detailed enough in others. Providing a summary that understands things like fsync_writethrough on platforms that support it was the first refactoring I had in my mind. If that thing came back and said "fsync_writethrough works for you, so don't even consider the other possibilities if you want reliability even though they are faster", that would be nice for example. > How else can a DBA today ensure that a commit is a commit? > You can't ensure a commit is a commit without running a pull the plug test. And I think the best way to do that accurately is using a "remote witness" server focusing on finding this particular problem to look for glitches, rather than than using the database as your test program and seeing if you happen to hit corruption or not. The documentation for 9.0 now suggests running the diskchecker.pl program for this exact purpose. I've seen enough reports of it finding even subtle cache loss situations to believe that encouraging heavier use of that would be enough to make people much safer than they typically are today. What we probably need to do next is provide people with an exact walkthrough of setting up and using the program, showing what a passing result looks like, and what a failing one looks like. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Robert Haas wrote: > On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> Proposed doc patch attached. > > > > "discusesed"? ?Otherwise +1 > > Woops, thanks. Committed with that change. I back-patched it back to > 8.3, which is as far as it applied with only minor conflicts. I have applied the attached patch which mentions tools/fsync for testing fsync method performance, and clarified the new paragraph about sync methods. I am glad to see we are beefing up this area of the docs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 87d182c..f178835 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1569,13 +1569,13 @@ SET ENABLE_SEQSCAN TO OFF; </itemizedlist> <para> Not all of these choices are available on all platforms. - The default is the first method in the above list that is supported - by the platform. The default is not necessarily best; it may be - necessary to change this setting, or other aspects of your system - configuration, in order to create a crash-safe configuration, as - discussed in <xref linkend="wal-reliability">, or to achieve best - performance. The <literal>open_</>* options also use <literal>O_DIRECT</> if available. + The default is the first method in the above list that is supported + by the platform. The default is not necessarily ideal; it might be + necessary to change this setting or other aspects of your system + configuration in order to create a crash-safe configuration or + achieve optimal performance. + These aspects are discussed in <xref linkend="wal-reliability">. The utility <filename>src/tools/fsync</> in the PostgreSQL source tree can do performance testing of various fsync methods. This parameter can only be set in the <filename>postgresql.conf</> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 5678d6e..7b50bdd 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -530,11 +530,13 @@ <para> The <xref linkend="guc-wal-sync-method"> parameter determines how <productname>PostgreSQL</productname> will ask the kernel to force - <acronym>WAL</acronym> updates out to disk. - With the exception of <literal>fsync_writethrough</>, which can sometimes - force a flush of the disk cache even when other options do not do so, - all the options should be the same in terms of reliability. - However, it's quite platform-specific which one will be the fastest. + <acronym>WAL</acronym> updates out to disk. + All the options should be the same in terms of reliability, with + the exception of <literal>fsync_writethrough</>, which can sometimes + force a flush of the disk cache even when other options do not do so. + However, it's quite platform-specific which one will be the fastest; + you can test option speeds using the utility <filename>src/tools/fsync</> + in the PostgreSQL source tree. Note that this parameter is irrelevant if <varname>fsync</varname> has been turned off. </para>
Greg Smith wrote: > A.M. wrote: > > Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are withinthe realm of possibility. > > > > This is what the test_fsync utility that already ships with the database > should be useful for. The way Bruce changed it to report numbers in > commits/second for 9.0 makes it a lot easier to use for this purpose > than it used to be. I think there's still some additional improvements > that could be made there, but it's a tricky test to run accurately. The test_fsync was designed to test various things like whether several open-sync writes are better than two write and an fsync, and whether you can fsync data written on a different file descriptor. It is really a catch-all test right now, not one specific for choosing sync methods. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote: > Greg Smith wrote: >> A.M. wrote: >>> Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are withinthe realm of possibility. >>> >> >> This is what the test_fsync utility that already ships with the database >> should be useful for. The way Bruce changed it to report numbers in >> commits/second for 9.0 makes it a lot easier to use for this purpose >> than it used to be. I think there's still some additional improvements >> that could be made there, but it's a tricky test to run accurately. The > > test_fsync was designed to test various things like whether several > open-sync writes are better than two write and an fsync, and whether you > can fsync data written on a different file descriptor. It is really a > catch-all test right now, not one specific for choosing sync methods. I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser basedon the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for suspiciousness.The goal is to make it more accessible. I was also thinking about adding some notes along the lines of "Yourdrive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates ashigh as RAM- your fsync method may be wrong." Currently, the test tool is not even compiled by default. Thoughts? Cheers, M
A.M. wrote: > > On Oct 19, 2010, at 11:22 AM, Bruce Momjian wrote: > > > Greg Smith wrote: > >> A.M. wrote: > >>> Perhaps a simpler tool could run a basic fsyncs-per-second test and prompt the DBA to check that the numbers are withinthe realm of possibility. > >>> > >> > >> This is what the test_fsync utility that already ships with the database > >> should be useful for. The way Bruce changed it to report numbers in > >> commits/second for 9.0 makes it a lot easier to use for this purpose > >> than it used to be. I think there's still some additional improvements > >> that could be made there, but it's a tricky test to run accurately. The > > > > test_fsync was designed to test various things like whether several > > open-sync writes are better than two write and an fsync, and whether you > > can fsync data written on a different file descriptor. It is really a > > catch-all test right now, not one specific for choosing sync methods. > > I am working on simplifying the test_fsync tool and making it a contrib function which can be run by the superuser basedon the configured fsync method. That way, the list can ask a user to run it to report fsyncs-per-second for suspiciousness.The goal is to make it more accessible. I was also thinking about adding some notes along the lines of "Yourdrive fsync speed rates between a 5400 RPM SATA drive and a 7200 RPM SATA drive." or "Your drive fsync speed rates ashigh as RAM- your fsync method may be wrong." > > Currently, the test tool is not even compiled by default. > > Thoughts? Agreed. Let me know if you have any questions. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Oct 19, 2010 at 11:06 AM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Thu, Oct 7, 2010 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Robert Haas <robertmhaas@gmail.com> writes: >> >> Proposed doc patch attached. >> > >> > "discusesed"? ?Otherwise +1 >> >> Woops, thanks. Committed with that change. I back-patched it back to >> 8.3, which is as far as it applied with only minor conflicts. > > I have applied the attached patch which mentions tools/fsync for testing > fsync method performance, and clarified the new paragraph about sync > methods. > > I am glad to see we are beefing up this area of the docs. It seems we're still missing some relevant details, because hdparm doesn't seem to work on SCSI devices. Is sdparm the right utility in that case? Does anyone know what the correct incantations look like? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote: > It seems we're still missing some relevant details, because hdparm > doesn't seem to work on SCSI devices. Is sdparm the right utility in > that case? Does anyone know what the correct incantations look like? Search the sdparm man page for "Writeback Cache". It has detailed examples.
On Mon, Oct 25, 2010 at 12:51 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On mån, 2010-10-25 at 09:33 -0400, Robert Haas wrote: >> It seems we're still missing some relevant details, because hdparm >> doesn't seem to work on SCSI devices. Is sdparm the right utility in >> that case? Does anyone know what the correct incantations look like? > > Search the sdparm man page for "Writeback Cache". It has detailed > examples. Here's a patch. This adds a few more details about sdparm and makes it clear that it applies to both FreeBSD and Linux. But, perhaps more significantly, it rearranges what is currently a fairly long paragraph into a bulleted list, which I think is more readable. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company