Thread: Final(?) proposal for wal_sync_method changes
After reviewing the two ongoing threads about fixing the wal_sync_method fiasco, I think there is general agreement on these two points: 1. open_datasync shouldn't be the default choice 2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC What I suggest we do about the latter is to invent two new wal_sync_method values,open_datasync_directopen_sync_direct which are defined only on platforms that define O_DIRECT (and O_DSYNC or O_SYNC respectively). That puts it in the hands of the DBA whether we try to use O_DIRECT or not. We'll still keep the hard-wired optimization of disabling O_DIRECT when archiving or walreceiver are active. Dropping open_datasync as the first-choice default is something we have to back-patch, but I'm less sure about it being a good idea to back-patch the rearrangement of O_DIRECT management. Somebody who'd explicitly specified open_sync or open_datasync as wal_sync_method would find its behavior changing under him, which might be bad. Comments? regards, tom lane
On Tuesday 07 December 2010 17:24:14 Tom Lane wrote: > After reviewing the two ongoing threads about fixing the wal_sync_method > fiasco, I think there is general agreement on these two points: > > 1. open_datasync shouldn't be the default choice > 2. O_DIRECT shouldn't be forcibly bundled in with O_DSYNC/O_SYNC > > What I suggest we do about the latter is to invent two new > wal_sync_method values, > open_datasync_direct > open_sync_direct > which are defined only on platforms that define O_DIRECT (and O_DSYNC > or O_SYNC respectively). That puts it in the hands of the DBA whether > we try to use O_DIRECT or not. We'll still keep the hard-wired > optimization of disabling O_DIRECT when archiving or walreceiver are > active. > > Dropping open_datasync as the first-choice default is something we have > to back-patch, but I'm less sure about it being a good idea to > back-patch the rearrangement of O_DIRECT management. Somebody who'd > explicitly specified open_sync or open_datasync as wal_sync_method > would find its behavior changing under him, which might be bad. I vote for changing the order but not doing the O_DIRECT stuff on the backbranches. As I am not seeing myself or clients of mine ever using any O_*SYNC variant I am not strongly opionated about what to do there. But I guess adding those two variants is not really much work. Thanks, Andres
On 12/07/2010 08:24 AM, Tom Lane wrote: > Dropping open_datasync as the first-choice default is something we have > to back-patch, but I'm less sure about it being a good idea to > back-patch the rearrangement of O_DIRECT management. Somebody who'd > explicitly specified open_sync or open_datasync as wal_sync_method > would find its behavior changing under him, which might be bad. I agree for the backpatch that we should just swap to fdatasync as default, and should not attempt to add the extra options. In addition to the concerns above, adding new GUCS values in an update release is something we should only do if required for a critical security or data-loss bug. And this is neither. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > I agree for the backpatch that we should just swap to fdatasync as > default, and should not attempt to add the extra options. I noticed while updating the documentation for this that the current documentation is a flat-out lie. It claims that the preference order for wal_sync_method is open_datasync fdatasync fsync_writethrough fsync open_sync ie you get the first-listed method that is supported on a given platform. But this is not so: actually, fsync_writethrough will be selected as default ONLY on Windows. There are other platforms where the option exists, OS X being the one I have at hand. The misstatement is masked on OS X because it also has open_datasync and fdatasync. But since we are about to delete open_datasync from the list, it's possible there are platforms where it will be exposed. Oh, and just to add insult to injury, the above is what config.sgml says, but postgresql.conf.sample says something different. So I'm wondering whether we should correct the code to match the docs, or vice versa. The former would just be a matter of saying#elif defined(HAVE_FSYNC_WRITETHROUGH)#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH in place of#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH To do the latter we'd have to say something like "The default is fdatasync if it exists, else fsync, except on Windows where it is fsync_writethrough". Changing the code would result in a sudden, massive performance change if there are any platforms for which fsync_writethrough exists but not fdatasync. But I'm not sure if there are any. Another point here is that it's not clear why we're selecting a known-to-be-insecure default on OS X (where in fact all methods except fsync_writethrough fail to push data to disk). We've been around on that before, of course, and maybe now is not the time to change it. Thoughts? regards, tom lane
On 12/7/10 2:28 PM, Tom Lane wrote: > Another point here is that it's not clear why we're selecting a > known-to-be-insecure default on OS X (where in fact all methods except > fsync_writethrough fail to push data to disk). We've been around on > that before, of course, and maybe now is not the time to change it. Because nobody sane uses OSX on the server? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 12/7/10 2:28 PM, Tom Lane wrote: >> Another point here is that it's not clear why we're selecting a >> known-to-be-insecure default on OS X (where in fact all methods except >> fsync_writethrough fail to push data to disk). We've been around on >> that before, of course, and maybe now is not the time to change it. > Because nobody sane uses OSX on the server? Some of us would make the same remark about Windows. But we go out of our way to provide a safe default on that platform anyhow. regards, tom lane
On 12/07/2010 06:11 PM, Tom Lane wrote: > Josh Berkus<josh@agliodbs.com> writes: >> On 12/7/10 2:28 PM, Tom Lane wrote: >>> Another point here is that it's not clear why we're selecting a >>> known-to-be-insecure default on OS X (where in fact all methods except >>> fsync_writethrough fail to push data to disk). We've been around on >>> that before, of course, and maybe now is not the time to change it. >> Because nobody sane uses OSX on the server? > Some of us would make the same remark about Windows. But we go out of > our way to provide a safe default on that platform anyhow. > > In practice, though, Windows is used a lot on servers and OSX isn't. That means we are probably going to have lots less push on this sort of thing from the OSX community, which is not to say that we shouldn't try to be just as safe on OSX as we try to be everywhere else. cheers andrew
On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > On 12/7/10 2:28 PM, Tom Lane wrote: > >> Another point here is that it's not clear why we're selecting a > >> known-to-be-insecure default on OS X (where in fact all methods except > >> fsync_writethrough fail to push data to disk). We've been around on > >> that before, of course, and maybe now is not the time to change it. > > > Because nobody sane uses OSX on the server? > > Some of us would make the same remark about Windows. But we go out of > our way to provide a safe default on that platform anyhow. Not to mention the assertion that people don't use OSX on a server is patently false. They don't use it on big servers, but it is very popular for the SMB (big time capital S). JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
I wrote: > Some of us would make the same remark about Windows. But we go out of > our way to provide a safe default on that platform anyhow. Oh, wait: that's not the case at all. As of recent releases, we support open_datasync on Windows, and that's the default despite being unsafe. You have to go and choose some nondefault drive setting to make it safe: http://developer.postgresql.org/pgdocs/postgres/wal-reliability.html So if we drop open_datasync from the preference list then Windows users *will* see a sudden huge change in the default behavior. Because open_datasync does now exist on Windows, this code in xlogdefs.h:#elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY)#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH is actually dead code at the moment --- it can never be reached on any platform. I am unclear as to the reason why there is a test for HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also leftover from a previous vision of how this all works? Or does an fsync() call actually fail on Windows? I am now tempted to suggest that HAVE_FSYNC_WRITETHROUGH_ONLY should go away altogether. The documented and implemented behavior ought to be that the default is "fdatasync if it exists, else fsync", full stop, on every platform. On both Windows and OS X, you would need to switch to fsync_writethrough or change OS-level options to get safe behavior; which is the same as it is today. regards, tom lane
"Joshua D. Drake" <jd@commandprompt.com> writes: > On Tue, 2010-12-07 at 18:11 -0500, Tom Lane wrote: >> Josh Berkus <josh@agliodbs.com> writes: >>> Because nobody sane uses OSX on the server? >> Some of us would make the same remark about Windows. But we go out of >> our way to provide a safe default on that platform anyhow. > Not to mention the assertion that people don't use OSX on a server is > patently false. They don't use it on big servers, but it is very popular > for the SMB (big time capital S). Actually the previous discussions about this are coming back to me now. With the current code, we don't actually guarantee safe flush behavior by default on ANY of the common consumer platforms. In the Linux case we can't, because we can't monkey with hdparm settings. (I think the same is true on BSDen ... anybody know?) On Windows and OS X we default to open_datasync despite its not being safe on either platform. We previously debated switching those to fsync_writethrough which would make them safe by default, but decided not to, partly on grounds of the inevitable ZOMG ITS SLOW backlash and partly on grounds of keeping cross-platform consistency. I don't think it's a good idea to reopen the fsync_writethrough debate right now, certainly not for something we're contemplating back-patching. I think what we'd better do is ensure that that is *not* selected as the default, on either Windows or OS X. So we need to get rid of the HAVE_FSYNC_WRITETHROUGH_ONLY hack. regards, tom lane
> I am unclear as to the reason why there is a test for > HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also > leftover from a previous vision of how this all works? Or does an > fsync() call actually fail on Windows? No, fsync responds fine. It just don't actually sync to disk. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote: > Because nobody sane uses OSX on the server? The XServe running 10.5 server and 9.0.1 at the other end of the office takes your remark personally. :) -- -- Christophe Pettus xof@thebuild.com
Josh Berkus <josh@agliodbs.com> writes: >> I am unclear as to the reason why there is a test for >> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also >> leftover from a previous vision of how this all works? Or does an >> fsync() call actually fail on Windows? > No, fsync responds fine. It just don't actually sync to disk. Right, which is also an accurate description of its behavior on OS X, as well as Linux (if you didn't change hdparm settings). So the real question here is what's the point of treating Windows differently. regards, tom lane
> Right, which is also an accurate description of its behavior on OS X, > as well as Linux (if you didn't change hdparm settings). So the real > question here is what's the point of treating Windows differently. So, sounds like we should continue treating fsync_writethrough the same as we have been, and maybe add a doc patch covering some of the above? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> Right, which is also an accurate description of its behavior on OS X, >> as well as Linux (if you didn't change hdparm settings). So the real >> question here is what's the point of treating Windows differently. > So, sounds like we should continue treating fsync_writethrough the same > as we have been, and maybe add a doc patch covering some of the above? Yeah, this patch is shaping up to be about five lines of code change and a hundred of docs ... regards, tom lane
xof@thebuild.com (Christophe Pettus) writes: > On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote: >> Because nobody sane uses OSX on the server? > > The XServe running 10.5 server and 9.0.1 at the other end of the > office takes your remark personally. :) I'd heard that Apple had cancelled XServe. [Poking back at that...] Yep, they won't be carrying anything for sale that's particularly rack-mountable after next January. Not precisely "dead," but definitely moving on smelling funny... -- (format nil "~S@~S" "cbbrowne" "gmail.com") http://www3.sympatico.ca/cbbrowne/emacs.html Photons have mass? I didn't know they were catholic!
On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote: > xof@thebuild.com (Christophe Pettus) writes: > > On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote: > >> Because nobody sane uses OSX on the server? > > > > The XServe running 10.5 server and 9.0.1 at the other end of the > > office takes your remark personally. :) > > I'd heard that Apple had cancelled XServe. [Poking back at that...] > > Yep, they won't be carrying anything for sale that's particularly > rack-mountable after next January. Not precisely "dead," but definitely > moving on smelling funny... A bit off topic but Apple is actually marketing OSX Server on the mini (with RAID 1). Which honestly for 95% of the businesses out there would make a very nice, reasonably performant database server for say.... PostBooks, or Drupal. JD > -- > (format nil "~S@~S" "cbbrowne" "gmail.com") > http://www3.sympatico.ca/cbbrowne/emacs.html > Photons have mass? I didn't know they were catholic! > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On 12/07/2010 07:32 PM, Joshua D. Drake wrote: > On Tue, 2010-12-07 at 19:00 -0500, Chris Browne wrote: >> xof@thebuild.com (Christophe Pettus) writes: >>> On Dec 7, 2010, at 2:43 PM, Josh Berkus wrote: >>>> Because nobody sane uses OSX on the server? >>> The XServe running 10.5 server and 9.0.1 at the other end of the >>> office takes your remark personally. :) >> I'd heard that Apple had cancelled XServe. [Poking back at that...] >> >> Yep, they won't be carrying anything for sale that's particularly >> rack-mountable after next January. Not precisely "dead," but definitely >> moving on smelling funny... > A bit off topic but Apple is actually marketing OSX Server on the mini > (with RAID 1). Which honestly for 95% of the businesses out there would > make a very nice, reasonably performant database server for say.... > PostBooks, or Drupal. > Given the constant overheating I have had on my mini, I wouldn't use it to host anything much ;-) But I guess YMMV. cheers andrerw
Josh Berkus <josh@agliodbs.com> writes: >>> I am unclear as to the reason why there is a test for >>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also >>> leftover from a previous vision of how this all works? Or does an >>> fsync() call actually fail on Windows? >> No, fsync responds fine. It just don't actually sync to disk. Sigh ... The closer I look at the Windows code path here, the more of an inconsistent, badly documented spaghetti-heap it appears to be. So far as a quick Google search unearths, there is no fsync() primitive on Windows. What we have actually got is this gem in port/win32.h: /** Even though we don't support 'fsync' as a wal_sync_method,* we do fsync() a few other places where _commit() isjust fine.*/ #define fsync(fd) _commit(fd) So actually, there is no difference between selecting fsync and fsync_writethrough on Windows, this comment and the SGML documentation to the contrary. Both settings result in invoking _commit() and presumably are safe. One wonders why we bothered to invent a separate fsync_writethrough setting on Windows. What this means is that switching to a simple preference order "fdatasync, then fsync" will result in choosing fsync on Windows (since it hasn't got fdatasync), meaning _commit, meaning Windows users see a behavioral change after all. I'm afraid that if we don't want a major behavioral change, there's no option except having a Windows-specific rule for the choice of default. It'll have to be "fdatasync, then fsync, except on Windows where open_datasync is the default". Grumble. But it's not like Windows hasn't got a hundred other special cases already. Would someone verify via pgbench or similar test (*not* test_fsync) that on Windows, wal_sync_method = fsync or fsync_writethrough perform the same (ie tps ~= disk rotation rate) while open_datasync is too fast to be real? I'm losing confidence that I've found all the spaghetti ends here, and I don't have a Windows setup to try it myself. regards, tom lane
Tom Lane wrote: > So actually, there is no difference between selecting fsync and > fsync_writethrough on Windows, this comment and the SGML documentation > to the contrary. Both settings result in invoking _commit() and > presumably are safe. One wonders why we bothered to invent a separate > fsync_writethrough setting on Windows. > Quite; I documented some the details about mapping to _commit() long ago at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm but forgot to suggest fixing the mistakes in the docs afterwards (Windows is not exactly my favorite platform). http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php explains some of the history I think you're looking for here. > Would someone verify via pgbench or similar test (*not* test_fsync) that > on Windows, wal_sync_method = fsync or fsync_writethrough perform the > same (ie tps ~= disk rotation rate) while open_datasync is too fast to > be real? I'm losing confidence that I've found all the spaghetti ends > here, and I don't have a Windows setup to try it myself. > I can look into this tomorrow. The laptop I posted Ubuntu/RHEL6 test_fsync numbers from before also boots into Vista, so I can compare all those platforms on the same hardware. I just need to be aware of the slightly different sequential speeds on each partition of the drive. As far as your major battle plan goes, I think what we should do is find the simplest possible patch that just fixes the newer Linux kernel problem, preferrably without changing any other platform, then commit that to HEAD and appropriate backports. Then the larger O_DIRECT remapping can proceed forward after that, along with cleanup to the writethrough choices and unifying test_fsync against the server. I wrote a patch that shuffled around a lot of this code last night, but the first thing I coded was junk because of some mistaken assumptions. Have been coming to same realizations about how messy this really is you have. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >>>> I am unclear as to the reason why there is a test for >>>> HAVE_FSYNC_WRITETHROUGH_ONLY in pg_fsync(). Perhaps that is also >>>> leftover from a previous vision of how this all works? Or does an >>>> fsync() call actually fail on Windows? > >>> No, fsync responds fine. It just don't actually sync to disk. First of all a warning - I'm writing this on way too little sleep :-) Blame pgday.eu... > Sigh ... The closer I look at the Windows code path here, the more of an > inconsistent, badly documented spaghetti-heap it appears to be. So far > as a quick Google search unearths, there is no fsync() primitive on > Windows. What we have actually got is this gem in port/win32.h: Correct. > /* > * Even though we don't support 'fsync' as a wal_sync_method, > * we do fsync() a few other places where _commit() is just fine. > */ > #define fsync(fd) _commit(fd) > > So actually, there is no difference between selecting fsync and > fsync_writethrough on Windows, this comment and the SGML documentation > to the contrary. Both settings result in invoking _commit() and > presumably are safe. One wonders why we bothered to invent a separate > fsync_writethrough setting on Windows. IIRC, using _commit(fd) *is* fsync_writethrough. That's what we shipped with. It even writes through the cache on a RAID controller that has BBU'ed write-cache. We had to implement the *other* options in order to "lower" the safety (it doesn't actually lower the safety *if* you have a BBU, which is a very good use-case for those options) > What this means is that switching to a simple preference order > "fdatasync, then fsync" will result in choosing fsync on Windows (since > it hasn't got fdatasync), meaning _commit, meaning Windows users see > a behavioral change after all. _commit() != fsync() I think this is the discussion and subsequent changes: http://archives.postgresql.org/pgsql-patches/2005-03/msg00230.php > Would someone verify via pgbench or similar test (*not* test_fsync) that > on Windows, wal_sync_method = fsync or fsync_writethrough perform the > same (ie tps ~= disk rotation rate) while open_datasync is too fast to > be real? I'm losing confidence that I've found all the spaghetti ends > here, and I don't have a Windows setup to try it myself. Please note that if you're re-verifying this, verify both on crappy disk *and* on a proper BBU'ed RAID-controller. The reason for this originally was that we performed about the same in those two, wihch made no sense... Merlin, IIRC you did a lot of the testing around this - do you recall any more details? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [ win32.h says ] >> #define fsync(fd) _commit(fd) >> What this means is that switching to a simple preference order >> "fdatasync, then fsync" will result in choosing fsync on Windows (since >> it hasn't got fdatasync), meaning _commit, meaning Windows users see >> a behavioral change after all. > _commit() != fsync() Um, the macro quoted above makes them the same, no? One of us is confused. regards, tom lane
Given my concerns around exactly what is going on in the Windows code, I'm now afraid to mess with an all-platforms change to fdatasync as the preferred default; if we do that it should probably just be in HEAD not the back branches. So I've come around to the idea that Marti's proposal of a PLATFORM_DEFAULT_SYNC_METHOD symbol is the best way. (One reason for adopting that rather than some other way is that it seems quite likely we'll end up needing it for Windows.) I haven't touched the documentation yet, but attached is a proposed code patch against HEAD. This forces the default to fdatasync on Linux, and makes some cosmetic cleanups around the HAVE_FSYNC_WRITETHROUGH_ONLY confusion. regards, tom lane diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index fd5ec78..4f7dc39 100644 *** a/src/backend/storage/file/fd.c --- b/src/backend/storage/file/fd.c *************** static bool looks_like_temp_rel_name(con *** 260,271 **** int pg_fsync(int fd) { ! #ifndef HAVE_FSYNC_WRITETHROUGH_ONLY ! if (sync_method != SYNC_METHOD_FSYNC_WRITETHROUGH) ! return pg_fsync_no_writethrough(fd); else #endif ! return pg_fsync_writethrough(fd); } --- 260,272 ---- int pg_fsync(int fd) { ! /* #if is to skip the sync_method test if there's no need for it */ ! #if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC) ! if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ! return pg_fsync_writethrough(fd); else #endif ! return pg_fsync_no_writethrough(fd); } diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h index 18b214e..072096d 100644 *** a/src/include/access/xlogdefs.h --- b/src/include/access/xlogdefs.h *************** typedef uint32 TimeLineID; *** 123,134 **** #endif #endif ! #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC #elif defined(HAVE_FDATASYNC) #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC - #elif defined(HAVE_FSYNC_WRITETHROUGH_ONLY) - #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC_WRITETHROUGH #else #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC #endif --- 123,134 ---- #endif #endif ! #if defined(PLATFORM_DEFAULT_SYNC_METHOD) ! #define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD ! #elif defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC #elif defined(HAVE_FDATASYNC) #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC #else #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC #endif diff --git a/src/include/port/linux.h b/src/include/port/linux.h index b9498b2..bcaa42d 100644 *** a/src/include/port/linux.h --- b/src/include/port/linux.h *************** *** 12,14 **** --- 12,22 ---- * to have a kernel version test here. */ #define HAVE_LINUX_EIDRM_BUG + + /* + * Set the default wal_sync_method to fdatasync. With recent Linux versions, + * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't + * perform better and (b) causes outright failures on ext4 data=journal + * filesystems, because those don't support O_DIRECT. + */ + #define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC diff --git a/src/include/port/win32.h b/src/include/port/win32.h index 3417ab5..9c2ae4d 100644 *** a/src/include/port/win32.h --- b/src/include/port/win32.h *************** *** 34,47 **** /* Must be here to avoid conflicting with prototype in windows.h */ #define mkdir(a,b) mkdir(a) - #define HAVE_FSYNC_WRITETHROUGH - #define HAVE_FSYNC_WRITETHROUGH_ONLY #define ftruncate(a,b) chsize(a,b) /* ! * Even though we don't support 'fsync' as a wal_sync_method, ! * we do fsync() a few other places where _commit() is just fine. */ ! #define fsync(fd) _commit(fd) #define USES_WINSOCK --- 34,51 ---- /* Must be here to avoid conflicting with prototype in windows.h */ #define mkdir(a,b) mkdir(a) #define ftruncate(a,b) chsize(a,b) + + /* Windows doesn't have fsync() as such, use _commit() */ + #define fsync(fd) _commit(fd) + /* ! * For historical reasons, we allow setting wal_sync_method to ! * fsync_writethrough on Windows, even though it's really identical to fsync ! * (both code paths wind up at _commit()). */ ! #define HAVE_FSYNC_WRITETHROUGH ! #define FSYNC_WRITETHROUGH_IS_FSYNC #define USES_WINSOCK
Since any Windows refactoring has been postponed for now (I'll get back to performance checks on that platform later), during my testing time this week instead I did a round of pre-release review of the change Tom has now committed. All looks good to me, including the docs changes. I confirmed that: -Ubuntu system with an older kernel still has the same wal_sync_method (fdatasync) and performance after pulling the update -RHEL6 system changes as planned from using open_datasync to fdatasync once I updated to a HEAD after the commit On the RHEL6 system, I also checked the commit rate using pgbench with the attached INSERT only script, rather than relying on test_fsync. This is 7200 RPM drive, so theoretical max of 120 commits/second, on ext4; this is the same test setup I described in more detail back in http://archives.postgresql.org/message-id/4CE2EBF8.4040602@2ndquadrant.com $ psql -c "show wal_sync_method" wal_sync_method ----------------- fdatasync (1 row) $ pgbench -i -s 10 pgbench [gsmith@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6733 tps = 112.208795 (including connections establishing) tps = 112.216904 (excluding connections establishing) And then manually switched over to test performance of the troublesome old default: [gsmith@meddle ~]$ psql -c "show wal_sync_method" wal_sync_method ----------------- open_datasync [gsmith@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6672 tps = 111.185802 (including connections establishing) tps = 111.195089 (excluding connections establishing) This is interesting, because test_fsync consistently reported a rate of about half this when using open_datasync instead of the equal performance I'm getting from the database. I'll see if I can reproduce that further, but it's no reason to be concerned about the change that's been made I think. Just more evidence that test_fsync has quirks left to be sorted out. But that's not backbranch material, it should be part of 9.1 only refactoring, already in progress via the patch Josh submitted. There's a bit of time left to get that done. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books \set nbranches :scale \set ntellers 10 * :scale \set naccounts 100000 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
Greg, > This is interesting, because test_fsync consistently reported a rate of > about half this when using open_datasync instead of the equal > performance I'm getting from the database. I'll see if I can reproduce > that further, but it's no reason to be concerned about the change that's > been made I think. Just more evidence that test_fsync has quirks left > to be sorted out. But that's not backbranch material, it should be part > of 9.1 only refactoring, already in progress via the patch Josh > submitted. There's a bit of time left to get that done. Did you rerun test_sync with O_DIRECT entabled, using my patch? The figures you had from test_fsync earlier were without O_DIRECT. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > Did you rerun test_sync with O_DIRECT entabled, using my patch? The > figures you had from test_fsync earlier were without O_DIRECT. > No--I was just focused on testing the changes that had already been committed. The use of O_DIRECT in the server but not test_fsync could very well be the reason for the difference; don't know yet. I'm trying to get through this CF before I start getting distracted by newer patches, I'll get to yours soon I hope. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, Dec 8, 2010 at 15:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Wed, Dec 8, 2010 at 02:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> [ win32.h says ] >>> #define fsync(fd) _commit(fd) > >>> What this means is that switching to a simple preference order >>> "fdatasync, then fsync" will result in choosing fsync on Windows (since >>> it hasn't got fdatasync), meaning _commit, meaning Windows users see >>> a behavioral change after all. > >> _commit() != fsync() > > Um, the macro quoted above makes them the same, no? One of us > is confused. Uh, yeah. Sorry, that was the unclear:ness from being too preoccupied with the conference.. Pretty sure I'm the confused one. . _commit() is definitely the same as fsync() on the API level. And this correspond to fsync_writethrough, not fsync, when you talk about the wal_sync_method parameter. It will always sync through the write cache, even if it's hardware BBU'ed cache. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Josh Berkus wrote: > Greg, > > > This is interesting, because test_fsync consistently reported a rate of > > about half this when using open_datasync instead of the equal > > performance I'm getting from the database. I'll see if I can reproduce > > that further, but it's no reason to be concerned about the change that's > > been made I think. Just more evidence that test_fsync has quirks left > > to be sorted out. But that's not backbranch material, it should be part > > of 9.1 only refactoring, already in progress via the patch Josh > > submitted. There's a bit of time left to get that done. > > Did you rerun test_sync with O_DIRECT entabled, using my patch? The > figures you had from test_fsync earlier were without O_DIRECT. I have modified test_fsync with the attached, applied patch to report cases where we are testing without O_DIRECT when only O_DIRECT would be used by the server, and cases where O_DIRECT fails because of the file system type. Josh Berkus wanted the first case kept in case we decide to offer non-direct-io options on machines that support direct i/o. The new messages are: * This non-direct I/O option is not used by Postgres. ** This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. You can see the first one below in my output from Ubuntu: $ ./test_fsync Ops-per-test = 2000 Simple non-sync'ed write: 8k write 58.175 ops/sec Compare file sync methods using one write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a 8k write, fdatasync 68.425 ops/sec 8k write, fsync 63.932 ops/sec fsync_writethrough n/a open_sync 8k write* 73.785 ops/sec open_sync 8k direct I/O write 82.929 ops/sec * This non-direct I/O option is not used by Postgres. Compare file sync methods using two writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a 8k write, 8k write, fdatasync 42.728 ops/sec 8k write, 8k write, fsync 43.625 ops/sec fsync_writethrough n/a 2 open_sync 8k writes* 37.150 ops/sec 2 open_sync 8k direct I/O writes 43.722 ops/sec * This non-direct I/O option is not used by Postgres. Compare open_sync with different sizes: (This is designed to compare the cost of one large sync'ed write and two smaller sync'ed writes.) open_sync 16k write 46.428 ops/sec 2 open_sync 8k writes 38.703 ops/sec Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 65.744 ops/sec 8k write, close, fsync 63.077 ops/sec I believe test_fsync now matches the backend code. If we decide to change things, it can be adjusted. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c index cd2b1f2..562da66 100644 *** /tmp/8wSNdd_test_fsync.c Sat Jan 15 18:42:58 2011 --- src/tools/fsync/test_fsync.c Sat Jan 15 18:38:57 2011 *************** void *** 163,169 **** test_sync(int writes_per_op) { int tmpfile, ops, writes; ! if (writes_per_op == 1) printf("\nCompare file sync methods using one write:\n"); else --- 163,170 ---- test_sync(int writes_per_op) { int tmpfile, ops, writes; ! bool fs_warning = false; ! if (writes_per_op == 1) printf("\nCompare file sync methods using one write:\n"); else *************** test_sync(int writes_per_op) *** 176,184 **** */ #ifdef OPEN_DATASYNC_FLAG if (writes_per_op == 1) ! printf(LABEL_FORMAT, "open_datasync 8k write"); else ! printf(LABEL_FORMAT, "2 open_datasync 8k writes"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1) --- 177,193 ---- */ #ifdef OPEN_DATASYNC_FLAG if (writes_per_op == 1) ! printf(LABEL_FORMAT, "open_datasync 8k write" ! #if PG_O_DIRECT != 0 ! "*" ! #endif ! ); else ! printf(LABEL_FORMAT, "2 open_datasync 8k writes" ! #if PG_O_DIRECT != 0 ! "*" ! #endif ! ); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1) *************** test_sync(int writes_per_op) *** 201,207 **** */ #if PG_O_DIRECT != 0 if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) ! printf(NA_FORMAT, "o_direct", "n/a on this filesystem\n"); else { if (writes_per_op == 1) --- 210,219 ---- */ #if PG_O_DIRECT != 0 if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) ! { ! printf(NA_FORMAT, "o_direct", "n/a**\n"); ! fs_warning = true; ! } else { if (writes_per_op == 1) *************** test_sync(int writes_per_op) *** 321,329 **** */ #ifdef OPEN_SYNC_FLAG if (writes_per_op == 1) ! printf(LABEL_FORMAT, "open_sync 8k write"); else ! printf(LABEL_FORMAT, "2 open_sync 8k writes"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG, 0)) == -1) --- 333,349 ---- */ #ifdef OPEN_SYNC_FLAG if (writes_per_op == 1) ! printf(LABEL_FORMAT, "open_sync 8k write" ! #if PG_O_DIRECT != 0 ! "*" ! #endif ! ); else ! printf(LABEL_FORMAT, "2 open_sync 8k writes" ! #if PG_O_DIRECT != 0 ! "*" ! #endif ! ); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG, 0)) == -1) *************** test_sync(int writes_per_op) *** 352,358 **** fflush(stdout); if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1) ! printf(NA_FORMAT, "o_direct", "n/a on this filesystem\n"); else { gettimeofday(&start_t, NULL); --- 372,381 ---- fflush(stdout); if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1) ! { ! printf(NA_FORMAT, "o_direct", "n/a**\n"); ! fs_warning = true; ! } else { gettimeofday(&start_t, NULL); *************** test_sync(int writes_per_op) *** 375,380 **** --- 398,414 ---- #else printf(NA_FORMAT, "open_sync", "n/a\n"); #endif + + #if defined(OPEN_DATASYNC_FLAG) || defined(OPEN_SYNC_FLAG) + if (PG_O_DIRECT != 0) + printf("* This non-direct I/O option is not used by Postgres.\n"); + #endif + + if (fs_warning) + { + printf("** This file system and its mount options do not support direct\n"); + printf("I/O, e.g. ext4 in journaled mode.\n"); + } } void *************** test_open_syncs(void) *** 389,394 **** --- 423,430 ---- printf("(This is designed to compare the cost of one large\n"); printf("sync'ed write and two smaller sync'ed writes.)\n"); + /* XXX no PG_O_DIRECT */ + /* * Test open_sync with different size files */