Thread: initdb and fsync
It looks like initdb doesn't fsync all the files it creates, e.g. the PG_VERSION file. While it's unlikely that it would cause any real data loss, it can be inconvenient in some testing scenarios involving VMs. Thoughts? Would a patch to add a few fsync calls to initdb be accepted? Is a platform-independent fsync be available at initdb time? Regards, Jeff Davis
On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote: > It looks like initdb doesn't fsync all the files it creates, e.g. the > PG_VERSION file. > > While it's unlikely that it would cause any real data loss, it can be > inconvenient in some testing scenarios involving VMs. > > Thoughts? Would a patch to add a few fsync calls to initdb be accepted? +1. If I'm piloting "strace -f" right, initdb currently issues *no* syncs. We'd probably, then, want a way to re-disable the fsyncs for hacker benefit. > Is a platform-independent fsync be available at initdb time? Not sure. Thanks, nm
On 01/27/2012 11:52 PM, Noah Misch wrote: >> Is a platform-independent fsync be available at initdb time? > Not sure. > It's a macro on Windows that calls _commit(fd), so it should be portable enough. I'm curious what problem we're actually solving here, though. I've run the buildfarm countless thousands of times on different VMs, and five of my seven current animals run in VMs, and I don't think I've ever seen a failure ascribable to inadequately synced files from initdb. cheers andrew
On Sat, 2012-01-28 at 10:31 -0500, Andrew Dunstan wrote: > I'm curious what problem we're actually solving here, though. I've run > the buildfarm countless thousands of times on different VMs, and five of > my seven current animals run in VMs, and I don't think I've ever seen a > failure ascribable to inadequately synced files from initdb. I believe I have seen such a problem second hand in a situation where the VM was known to be killed harshly (not sure if you do that regularly). It's a little difficult for me to _prove_ that this would have solved the problem, and I think it was only observed once (though I could probably reproduce it if I tried). The symptom was a log message indicating that PG_VERSION was missing or corrupt on a system that was previously started and online (albeit briefly for a test). Regards,Jeff Davis
Andrew Dunstan <andrew@dunslane.net> writes: > I'm curious what problem we're actually solving here, though. I've run > the buildfarm countless thousands of times on different VMs, and five of > my seven current animals run in VMs, and I don't think I've ever seen a > failure ascribable to inadequately synced files from initdb. Yeah. Personally I would be sad if initdb got noticeably slower, and I've never seen or heard of a failure that this would fix. I wonder whether it wouldn't be sufficient to call sync(2) at the end, anyway, rather than cluttering the entire initdb codebase with fsync calls. regards, tom lane
On Sat, Jan 28, 2012 at 7:31 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 01/27/2012 11:52 PM, Noah Misch wrote: >>> >>> Is a platform-independent fsync be available at initdb time? >> >> Not sure. >> > > It's a macro on Windows that calls _commit(fd), so it should be portable > enough. > > I'm curious what problem we're actually solving here, though. I've run the > buildfarm countless thousands of times on different VMs, and five of my > seven current animals run in VMs, and I don't think I've ever seen a failure > ascribable to inadequately synced files from initdb. I wouldn't expect you to ever see that problem on the buildfarm. If the OS gets thunked during the middle of a regression test, when it comes back up the code is not going to try to pick up where it left off, it is just going to blow away the entire install and start over from scratch. So any crash-recoverability problems will never be detected. I would guess the original poster is doing a more stringent kind of test. Cheers, Jeff
On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > I'm curious what problem we're actually solving here, though. I've run > > the buildfarm countless thousands of times on different VMs, and five of > > my seven current animals run in VMs, and I don't think I've ever seen a > > failure ascribable to inadequately synced files from initdb. > > Yeah. Personally I would be sad if initdb got noticeably slower, and > I've never seen or heard of a failure that this would fix. > > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. I can always add a "sync" call to the test, also (rather than modifying initdb). Or, it could be an initdb option, which might be a good compromise. I don't have a strong opinion here. As machines get more memory and filesystems get more lazy, I wonder if it will be a more frequent occurrence, however. On the other hand, if filesystems are more lazy, that also increases the cost associated with extra "sync" calls. I think there would be a surprise factor if sometimes initdb had a long pause at the end and caused 10GB of data to be written out. Regards,Jeff Davis
On Sat, Jan 28, 2012 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I'm curious what problem we're actually solving here, though. I've run >> the buildfarm countless thousands of times on different VMs, and five of >> my seven current animals run in VMs, and I don't think I've ever seen a >> failure ascribable to inadequately synced files from initdb. > > Yeah. Personally I would be sad if initdb got noticeably slower, and > I've never seen or heard of a failure that this would fix. > > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. > > regards, tom lane Does sync(2) behave like sync(8) and flush the entire system cache, or does it just flush the files opened by the process which called it? The man page didn't enlighten me on that. sometimes sync(8) never returns. It doesn't just flush what was dirty at the time it was called, it actually keeps running until there are simultaneously no dirty pages anywhere in the system. On busy systems, this condition might never be reached. And it can't be interrupted, not even with kill -9. Cheers, Jeff
On 01/28/2012 01:46 PM, Jeff Davis wrote: > On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: >> Andrew Dunstan<andrew@dunslane.net> writes: >>> I'm curious what problem we're actually solving here, though. I've run >>> the buildfarm countless thousands of times on different VMs, and five of >>> my seven current animals run in VMs, and I don't think I've ever seen a >>> failure ascribable to inadequately synced files from initdb. >> Yeah. Personally I would be sad if initdb got noticeably slower, and >> I've never seen or heard of a failure that this would fix. >> >> I wonder whether it wouldn't be sufficient to call sync(2) at the end, >> anyway, rather than cluttering the entire initdb codebase with fsync >> calls. > I can always add a "sync" call to the test, also (rather than modifying > initdb). Or, it could be an initdb option, which might be a good > compromise. I don't have a strong opinion here. > > As machines get more memory and filesystems get more lazy, I wonder if > it will be a more frequent occurrence, however. On the other hand, if > filesystems are more lazy, that also increases the cost associated with > extra "sync" calls. I think there would be a surprise factor if > sometimes initdb had a long pause at the end and caused 10GB of data to > be written out. > -1 for that. A very quick look at initdb.c suggests to me that there are only two places where we'd need to put fsync(), right before we call fclose() in write_file() and write_version_file(). If we're going to do anything that seems to be the least painful and most portable way to go. cheers andrew
* Tom Lane: > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. We tried to do this in the Debian package mananger. It works as expected on Linux systems, but it can cause a lot of data to hit the disk, and there are kernel versions where sync(2) never completes if the system is rather busy. initdb is much faster with 9.1 than with 8.4. It's so fast that you can use it in test suites, instead of reusing an existing cluster. I think this is a rather desirable property.
On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: > Yeah. Personally I would be sad if initdb got noticeably slower, and > I've never seen or heard of a failure that this would fix. I worked up a patch, and it looks like it does about 6 file fsync's and a 7th for the PGDATA directory. That degrades the time from about 1.1s to 1.4s on my workstation. pg_test_fsync says this about my workstation (one 8kB write): open_datasync 117.495 ops/sec fdatasync 117.949 ops/sec fsync 25.530 ops/sec fsync_writethrough n/a open_sync 24.666 ops/sec 25 ops/sec means about 40ms per fsync, times 7 is about 280ms, so that seems like about the right degradation for fsync. I tried with fdatasync as well to see if it improved things, and I wasn't able to realize any difference (not sure exactly why). So, is it worth it? Should we make it an option that can be specified? > I wonder whether it wouldn't be sufficient to call sync(2) at the end, > anyway, rather than cluttering the entire initdb codebase with fsync > calls. It looks like there are only a few places, so I don't think clutter is really the problem with the simple patch at this point (unless there is a portability problem with just calling fsync). Regards, Jeff Davis
Attachment
On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote: > On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: > > Yeah. Personally I would be sad if initdb got noticeably slower, and > > I've never seen or heard of a failure that this would fix. > > I worked up a patch, and it looks like it does about 6 file fsync's and > a 7th for the PGDATA directory. That degrades the time from about 1.1s > to 1.4s on my workstation. > So, is it worth it? Should we make it an option that can be specified? If we add fsync calls to the initdb process, they should cover the entire data directory tree. This patch syncs files that initdb.c writes, but we ought to also sync files that bootstrap-mode backends had written. An optimization like the pg_flush_data() call in copy_file() may reduce the speed penalty. initdb should do these syncs by default and offer an option to disable them.
On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: > If we add fsync calls to the initdb process, they should cover the entire data > directory tree. This patch syncs files that initdb.c writes, but we ought to > also sync files that bootstrap-mode backends had written. It doesn't make sense for initdb to take responsibility to sync files created by the backend. If there are important files that the backend creates, it should be the backend's responsibility to fsync them (and their parent directory, if needed). And if they are unimportant to the backend, then there is no reason for initdb to fsync them. > An optimization > like the pg_flush_data() call in copy_file() may reduce the speed penalty. That worked pretty well. It took it down about 100ms on my machine, which closes the gap significantly. > initdb should do these syncs by default and offer an option to disable them. For test frameworks that run initdb often, that makes sense. But for developers, it doesn't make sense to spend 0.5s typing an option that saves you 0.3s. So, we'd need some more convenient way to choose the no-fsync option, like an environment variable that developers can set. Or maybe developers don't care about 0.3s? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: >> If we add fsync calls to the initdb process, they should cover the entire data >> directory tree. This patch syncs files that initdb.c writes, but we ought to >> also sync files that bootstrap-mode backends had written. > It doesn't make sense for initdb to take responsibility to sync files > created by the backend. No, but the more interesting question is whether bootstrap mode troubles to fsync its writes. I'm not too sure about that either way ... regards, tom lane
On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote: > On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote: > > If we add fsync calls to the initdb process, they should cover the entire data > > directory tree. This patch syncs files that initdb.c writes, but we ought to > > also sync files that bootstrap-mode backends had written. > > It doesn't make sense for initdb to take responsibility to sync files > created by the backend. If there are important files that the backend > creates, it should be the backend's responsibility to fsync them (and > their parent directory, if needed). And if they are unimportant to the > backend, then there is no reason for initdb to fsync them. I meant primarily to illustrate the need to be comprehensive, not comment on which executable should fsync a particular file. Bootstrap-mode backends do not sync anything during an initdb run on my system. With your patch, we'll fsync a small handful of files and leave nearly everything else vulnerable. That being said, having each backend fsync its own writes will mean syncing certain files several times within a single initdb run. If the penalty from that proves high enough, we may do well to instead have initdb.c sync everything just once. > > initdb should do these syncs by default and offer an option to disable them. > > For test frameworks that run initdb often, that makes sense. > > But for developers, it doesn't make sense to spend 0.5s typing an option > that saves you 0.3s. So, we'd need some more convenient way to choose > the no-fsync option, like an environment variable that developers can > set. Or maybe developers don't care about 0.3s? Developers have shell aliases/functions/scripts and command history. I wouldn't object to having an environment variable control it, but I would not personally find that more convenient than a command-line switch. Thanks, nm
On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote: > > initdb should do these syncs by default and offer an option to > disable them. > > For test frameworks that run initdb often, that makes sense. > > But for developers, it doesn't make sense to spend 0.5s typing an > option > that saves you 0.3s. So, we'd need some more convenient way to choose > the no-fsync option, like an environment variable that developers can > set. Or maybe developers don't care about 0.3s? > You can use https://launchpad.net/libeatmydata for those cases.
On Fri, Feb 10, 2012 at 3:57 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote: >> > initdb should do these syncs by default and offer an option to >> disable them. >> >> For test frameworks that run initdb often, that makes sense. >> >> But for developers, it doesn't make sense to spend 0.5s typing an >> option >> that saves you 0.3s. So, we'd need some more convenient way to choose >> the no-fsync option, like an environment variable that developers can >> set. Or maybe developers don't care about 0.3s? >> > You can use https://launchpad.net/libeatmydata for those cases. That's hilarious. But, a command-line option seems more convenient. It also seems entirely sufficient. The comments above suggest that it would take too long to type the option, but any PG developers who are worried about the speed difference surely know how to create shell aliases, shell functions, shell scripts, ... and if anyone's really concerned about it, we can provide a short form for the option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote: > I meant primarily to illustrate the need to be comprehensive, not comment on > which executable should fsync a particular file. Bootstrap-mode backends do > not sync anything during an initdb run on my system. With your patch, we'll > fsync a small handful of files and leave nearly everything else vulnerable. Thank you for pointing that out. With that in mind, I have a new version of the patch which just recursively fsync's the whole directory (attached). I also introduced a new option --nosync (-N) to disable this behavior. The bad news is that it introduces a lot more time to initdb -- it goes from about 1s to about 10s on my machine. I tried fsync'ing the whole directory twice just to make sure that the second was a no-op, and indeed it didn't make much difference (still about 10s). That's pretty inefficient considering that initdb -D data --nosync && sync only takes a couple seconds. Clearly batching the operation is a big help. Maybe there's some more efficient way to fsync a lot of files/directories? Or maybe I can mitigate it by avoiding files that don't really need to be fsync'd? Regards, Jeff Davis
Attachment
On Tuesday, March 13, 2012 04:49:40 AM Jeff Davis wrote: > On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote: > > I meant primarily to illustrate the need to be comprehensive, not comment > > on which executable should fsync a particular file. Bootstrap-mode > > backends do not sync anything during an initdb run on my system. With > > your patch, we'll fsync a small handful of files and leave nearly > > everything else vulnerable. > > Thank you for pointing that out. With that in mind, I have a new version > of the patch which just recursively fsync's the whole directory > (attached). > > I also introduced a new option --nosync (-N) to disable this behavior. > > The bad news is that it introduces a lot more time to initdb -- it goes > from about 1s to about 10s on my machine. I tried fsync'ing the whole > directory twice just to make sure that the second was a no-op, and > indeed it didn't make much difference (still about 10s). I suggest you try making it two loops: for recursively everything in dir: posix_fadvise(fd, POSIX_FADV_DONTNEED); for recursively everything in dir: fsync(fd); In my experience that gives way much better performance due to the fact that it does not force its own metadata/journal commit/transaction for every file but can be batched. copydir() does the same since some releases... Obviously its not that nice to use _DONTNEED but I havent found something that works equally well. You could try sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE) in the first loop but my experience with that hasn't been that good. Andres
On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote: > for recursively everything in dir: > posix_fadvise(fd, POSIX_FADV_DONTNEED); > > for recursively everything in dir: > fsync(fd); Wow, that made a huge difference! no sync: ~ 1.0s sync: ~10.0s fadvise+sync: ~ 1.3s Patch attached. Now I feel much better about it. Most people will either have fadvise, a write cache (rightly or wrongly), or actually need the sync. Those that have none of those can use -N. Regards, Jeff Davis
Attachment
On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote: > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote: > > for recursively everything in dir: > > posix_fadvise(fd, POSIX_FADV_DONTNEED); > > > > for recursively everything in dir: > > fsync(fd); > > Wow, that made a huge difference! > > no sync: ~ 1.0s > sync: ~10.0s > fadvise+sync: ~ 1.3s > > Patch attached. > > Now I feel much better about it. Most people will either have fadvise, a > write cache (rightly or wrongly), or actually need the sync. Those that > have none of those can use -N. Well, while the positive effect of this are rather large it also has the bad effect of pushing the whole new database out of the cache. Which is not so nice if you want to run tests on it seconds later. How are the results with sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE)? Andres
On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote: > On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote: > > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote: > > > for recursively everything in dir: > > > posix_fadvise(fd, POSIX_FADV_DONTNEED); > > > > > > for recursively everything in dir: > > > fsync(fd); > > > > Wow, that made a huge difference! > > > > no sync: ~ 1.0s > > sync: ~10.0s > > fadvise+sync: ~ 1.3s I take that back. There was something wrong with my test -- fadvise helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I hoped. > Well, while the positive effect of this are rather large it also has the bad > effect of pushing the whole new database out of the cache. Which is not so nice > if you want to run tests on it seconds later. I was unable to see a regression when it comes to starting it up after the fadvise+fsync. My test just started the server, created a table, then stopped the server. It was actually a hair faster with the directory that had been fadvise'd and then fsync'd, but I assume that was noise. Regardless, this doesn't look like an issue. > How are the results with sync_file_range(fd, 0, 0, > SYNC_FILE_RANGE_WRITE)? That is much faster than using fadvise. It goes down to ~2s. Unfortunately, that's non-portable. Any other ideas? 6.5s a little on the annoying side (and causes some disconcerting sounds to come from my disk), especially when we _know_ it can be done in 2s. Anyway, updated patch attached. Regards, Jeff Davis
Attachment
On Thursday, March 15, 2012 07:38:38 AM Jeff Davis wrote: > On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote: > > On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote: > > > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote: > > > > for recursively everything in dir: > > > > posix_fadvise(fd, POSIX_FADV_DONTNEED); > > > > > > > > for recursively everything in dir: > > > > fsync(fd); > > > > > > Wow, that made a huge difference! > > > > > > no sync: ~ 1.0s > > > sync: ~10.0s > > > fadvise+sync: ~ 1.3s > > I take that back. There was something wrong with my test -- fadvise > helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I > hoped. Thats surprising. I wouldn't expect such a big difference between fadvise + sync_file_range. Rather strange. > > Well, while the positive effect of this are rather large it also has the > > bad effect of pushing the whole new database out of the cache. Which is > > not so nice if you want to run tests on it seconds later. > > I was unable to see a regression when it comes to starting it up after > the fadvise+fsync. My test just started the server, created a table, > then stopped the server. It was actually a hair faster with the > directory that had been fadvise'd and then fsync'd, but I assume that > was noise. Regardless, this doesn't look like an issue. Hm. Ok. > > How are the results with sync_file_range(fd, 0, 0, > > SYNC_FILE_RANGE_WRITE)? > That is much faster than using fadvise. It goes down to ~2s. > Unfortunately, that's non-portable. Any other ideas? 6.5s a little on > the annoying side (and causes some disconcerting sounds to come from my > disk), especially when we _know_ it can be done in 2s. Its not like posix_fadvise is actually portable. So I personally don't see a problem with that, but... Andres
On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote: >> > How are the results with sync_file_range(fd, 0, 0, >> > SYNC_FILE_RANGE_WRITE)? >> That is much faster than using fadvise. It goes down to ~2s. > >> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on >> the annoying side (and causes some disconcerting sounds to come from my >> disk), especially when we _know_ it can be done in 2s. > Its not like posix_fadvise is actually portable. So I personally don't see a > problem with that, but... Well, sync_file_range only works on Linux, and will probably never work anywhere else. posix_fadvise() at least has a chance of being supported on other platforms, being a standard and all that. Though I see that my Mac has neither. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Friday, March 16, 2012 04:47:06 PM Robert Haas wrote: > On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote: > >> > How are the results with sync_file_range(fd, 0, 0, > >> > SYNC_FILE_RANGE_WRITE)? > >> > >> That is much faster than using fadvise. It goes down to ~2s. > >> > >> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on > >> the annoying side (and causes some disconcerting sounds to come from my > >> disk), especially when we _know_ it can be done in 2s. > > > > Its not like posix_fadvise is actually portable. So I personally don't > > see a problem with that, but... > > Well, sync_file_range only works on Linux, and will probably never > work anywhere else. posix_fadvise() at least has a chance of being > supported on other platforms, being a standard and all that. Though I > see that my Mac has neither. :-( I would suggest adding a wrapper function like: pg_hint_writeback_flush(fd, off, len); which then is something like #if HAVE_SYNC_FILE_RANGE sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE); #elseif HAVE_POSIX_FADVISE posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); #else #endif To my knowledge posix_fadvise currently is only supported on linux btw... Andres
On Fri, 2012-03-16 at 11:25 +0100, Andres Freund wrote: > > I take that back. There was something wrong with my test -- fadvise > > helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I > > hoped. > Thats surprising. I wouldn't expect such a big difference between fadvise + > sync_file_range. Rather strange. I discussed this with my colleague who knows linux internals, and he pointed me directly at the problem. fadvise and sync_file_range in this case are both trying to put the data in the io scheduler queue (still in the kernel, not on the device). The difference is that fadvise doesn't wait, and sync_file_range does (keep in mind, this is waiting to get in a queue to go to the device, not waiting for the device to write it or even receive it). He indicated that 4096 is a normal number that one might use for the queue size. But on my workstation at home (ubuntu 11.10), the queue is only 128. I bumped it up to 256 and now fadvise is just as fast! This won't be a problem on production systems, but that doesn't help us a lot. People setting up a production system don't care about 6.5 seconds of set-up time anyway. Casual users and developers do (the latter problem can be solved with the --nosync switch, but the former problem is the one we're discussing). So, it looks like fadvise is the "right" thing to do, but I expect we'll get some widely varying results from actual users. Then again, maybe casual users don't care much about ~10s for initdb anyway? It's a fairly rare operation for everyone except developers. Regards,Jeff Davis
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Levendredi 16 mars 2012 16:51:04, Andres Freund a écrit :<p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> On Friday,March 16, 2012 04:47:06 PM Robert Haas wrote:<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > On Fri, Mar 16, 2012 at 6:25 AM, AndresFreund <andres@anarazel.de> wrote:<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> > How are the results withsync_file_range(fd, 0, 0,<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > >> > SYNC_FILE_RANGE_WRITE)?<p style=" margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>> >> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > >> That is much faster than using fadvise. It goesdown to ~2s.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;-qt-user-state:0;">> > >> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> Unfortunately, that's non-portable.Any other ideas? 6.5s a little on<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> the annoying side (and causessome disconcerting sounds to come from<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> my disk), especially when we_know_ it can be done in 2s.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > > <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > > Its not like posix_fadviseis actually portable. So I personally don't<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > > see a problem with that, but...<pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;-qt-user-state:0;">> > <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > Well, sync_file_range only works on Linux,and will probably never<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > work anywhere else. posix_fadvise() at least has a chanceof being<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;-qt-user-state:0;">> > supported on other platforms, being a standard and all that. Though I<p style="margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>> see that my Mac has neither. :-(<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> <p style=" margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>I would suggest adding a wrapper function like:<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> pg_hint_writeback_flush(fd,off, len);<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> which then is something like<pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;-qt-user-state:0;">> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> #if HAVE_SYNC_FILE_RANGE<p style=" margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);<p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> #elseifHAVE_POSIX_FADVISE<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);<p style=" margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>#else<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> #endif<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> <p style=" margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>To my knowledge posix_fadvise currently is only supported on linux btw...<p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I agree with Andres.<p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I believe we should use sync_file_range(_before?) with linux.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">And we can use posix_fadvise_dontneed on other kernels.<p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">FADVISE_DONTNEED does start a writeback(it may decide not to do it too), but the primer objective of posix_fadvise_dontneed is not to make sync() faster.We just have writeback and sync() calls challenged together and we can face situation where linux does not handlethat so well. (depends on linux 2.6.18 or 32 or 3.2 or ...)<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><br /><p style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">-- <p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Cédric Villemain+33 (0)6 20 30 22 52<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;text-indent:0px; -qt-user-state:0;">http://2ndQuadrant.fr/<p style=" margin-top:0px; margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">PostgreSQL:Support 24x7 - Développement, Expertise et Formation
On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote: > I agree with Andres. > > > I believe we should use sync_file_range (_before?) with linux. > > And we can use posix_fadvise_dontneed on other kernels. > OK, updated patch attached. sync_file_range() is preferred, posix_fadvise() is a fallback. Regards, Jeff Davis
Attachment
On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote: > On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote: > > I agree with Andres. > > > > > > I believe we should use sync_file_range (_before?) with linux. > > > > And we can use posix_fadvise_dontneed on other kernels. > > > OK, updated patch attached. sync_file_range() is preferred, > posix_fadvise() is a fallback. > Rebased patch attached. No other changes. Regards, Jeff Davis
Attachment
On tis, 2012-06-12 at 21:09 -0700, Jeff Davis wrote: > On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote: > > On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote: > > > I agree with Andres. > > > > > > > > > I believe we should use sync_file_range (_before?) with linux. > > > > > > And we can use posix_fadvise_dontneed on other kernels. > > > > > OK, updated patch attached. sync_file_range() is preferred, > > posix_fadvise() is a fallback. > > > > Rebased patch attached. No other changes. The --help output for the -N option was copy-and-pasted wrongly. The message issued when using -N is also a bit content-free. Maybe something like "Running in nosync mode. The data directory might become corrupt if the operating system crashes.\n" Which leads to the question, how does one get out of this state? Is running sync(1) enough? Is starting the postgres server enough? There are no updates to the initdb man page included in your patch, which would be a suitable place to discuss this briefly.
On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote: > The --help output for the -N option was copy-and-pasted wrongly. > > The message issued when using -N is also a bit content-free. Maybe > something like > > "Running in nosync mode. The data directory might become corrupt if the > operating system crashes.\n" Thank you, fixed. > Which leads to the question, how does one get out of this state? Is > running sync(1) enough? Is starting the postgres server enough? sync(1) calls sync(2), and the man page says: "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)" So it looks like sync is enough if you are on linux *and* you have any unprotected write cache disabled. I don't think starting the postgres server is enough. Before, I think we were safe because we could assume that the OS would flush the buffers before you had time to store any important data. But now, that window can be much larger. > There are no updates to the initdb man page included in your patch, > which would be a suitable place to discuss this briefly. Thank you, I added that. Regards, Jeff Davis
Attachment
On Wednesday, June 13, 2012 06:53:17 PM Jeff Davis wrote: > On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote: > > The --help output for the -N option was copy-and-pasted wrongly. > > > > The message issued when using -N is also a bit content-free. Maybe > > something like > > > > "Running in nosync mode. The data directory might become corrupt if the > > operating system crashes.\n" > > Thank you, fixed. > > > Which leads to the question, how does one get out of this state? Is > > running sync(1) enough? Is starting the postgres server enough? > > sync(1) calls sync(2), and the man page says: > > "According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but may return before the actual writing is done. > However, since version 1.3.20 Linux does actually wait. (This still > does not guarantee data integrity: modern disks have large caches.)" > > So it looks like sync is enough if you are on linux *and* you have any > unprotected write cache disabled. Protection can include write barries, it doesn't need to be a BBU.... > I don't think starting the postgres server is enough. Agreed. > Before, I think we were safe because we could assume that the OS would > flush the buffers before you had time to store any important data. But > now, that window can be much larger. I think to a large degree we didn't see any problem because of the old ext3 "sync the whole world" type of behaviour which was common for a very long time. So on ext3 (with data=ordered, the default) any fsync (checkpoints, commit, ...) would lead to all the other files being synced as well which means starting the server once would have been enough on such a system... Quick review: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname instead of cluttering the main function with it Looks good otherwise! Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > - defaulting to initdb -N in the regression suite is not a good imo, > because that way the buildfarm won't catch problems in that area... > The regression test suite also starts postgres with fsync off. This is good, and I don't want to have more overhead in the tests. It's not like we already have awesome test coverage for OS interaction.
On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > Quick review: > - defaulting to initdb -N in the regression suite is not a good imo, because > that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be > unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. If you feel there will be a maintainability problem, I can give it another shot. > - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname > instead of cluttering the main function with it Done. Regards, Jeff Davis
Attachment
On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote: > On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > > - defaulting to initdb -N in the regression suite is not a good imo, > > because that way the buildfarm won't catch problems in that area... > > > The regression test suite also starts postgres with fsync off. This is > good, and I don't want to have more overhead in the tests. It's not > like we already have awesome test coverage for OS interaction. OK, changing back to using -N during regression tests due to performance concerns. Regards, Jeff Davis
Attachment
On Monday, June 18, 2012 08:39:47 PM Jeff Davis wrote: > On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > > Quick review: > > - defaulting to initdb -N in the regression suite is not a good imo, > > because that way the buildfarm won't catch problems in that area... > I removed the -N as you suggest. How much does performance matter on the > buildfarm? I don't think the difference in initdb cost is relevant when running the regression tests. Should it prove to be we can re-add -N after a week or two in the buildfarm machines. I just remember that there were several OS specific regression when adding the pre-syncing for createdb. > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al > > be unified? > There's a lot of backend-specific code in the copydir versions, like > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at > unifying them before, and concluded that it wouldn't add to the > readability, so I just commented where they came from. Ok. Sensible reasons. I dislike that we know have two files using different logic (copydir.c only using fadvise, initdb using sync_file_range if available). Maybe we should just move the fadvise and sync_file_range calls into its own common function? > If you feel there will be a maintainability problem, I can give it > another shot. Its not too bad yet I guess, so ... > > - I personally would find it way nicer to put USE_PRE_SYNC into > > pre_sync_fname instead of cluttering the main function with it > Done. Looks good to me. Btw, I just want to have said this, although I don't think its particularly relevant as it doesn't affect correctness: Its possible to have a system where sync_file_range is in the system headers but the kernel during runtime doesn't support it. It is relatively new (2.6.17). It would be possible to fallback to posix_fadvise which has been around far longer in that case... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: > > > - defaulting to initdb -N in the regression suite is not a good imo, > > > because that way the buildfarm won't catch problems in that area... > > I removed the -N as you suggest. How much does performance matter on the > > buildfarm? > I don't think the difference in initdb cost is relevant when running the > regression tests. Should it prove to be we can re-add -N after a week or two > in the buildfarm machines. I just remember that there were several OS specific > regression when adding the pre-syncing for createdb. That sounds reasonable to me. Both patches are out there, so we can figure out what the consensus is. > > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al > > > be unified? > > There's a lot of backend-specific code in the copydir versions, like > > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at > > unifying them before, and concluded that it wouldn't add to the > > readability, so I just commented where they came from. > Ok. Sensible reasons. I dislike that we know have two files using different > logic (copydir.c only using fadvise, initdb using sync_file_range if > available). Maybe we should just move the fadvise and sync_file_range calls > into its own common function? I don't see fadvise in copydir.c, it looks like it just uses fsync. It might speed it up to use a pre-sync call there, too -- database creation does take a while. If that's in the scope of this patch, I'll do it. > Btw, I just want to have said this, although I don't think its particularly > relevant as it doesn't affect correctness: Its possible to have a system where > sync_file_range is in the system headers but the kernel during runtime doesn't > support it. It is relatively new (2.6.17). It would be possible to fallback to > posix_fadvise which has been around far longer in that case... Interesting point, but I'm not too worried about it. Regards,Jeff Davis
On Monday, June 18, 2012 09:32:25 PM Jeff Davis wrote: > > > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et > > > > al be unified? > > > > > > There's a lot of backend-specific code in the copydir versions, like > > > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at > > > unifying them before, and concluded that it wouldn't add to the > > > readability, so I just commented where they came from. > > > > Ok. Sensible reasons. I dislike that we know have two files using > > different logic (copydir.c only using fadvise, initdb using > > sync_file_range if available). Maybe we should just move the fadvise and > > sync_file_range calls into its own common function? > > I don't see fadvise in copydir.c, it looks like it just uses fsync. It > might speed it up to use a pre-sync call there, too -- database creation > does take a while. > > If that's in the scope of this patch, I'll do it. It calls pg_flush_data inside of copy_file which does the posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Jeff Davis's message of lun jun 18 15:32:25 -0400 2012: > On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: > > Btw, I just want to have said this, although I don't think its particularly > > relevant as it doesn't affect correctness: Its possible to have a system where > > sync_file_range is in the system headers but the kernel during runtime doesn't > > support it. It is relatively new (2.6.17). It would be possible to fallback to > > posix_fadvise which has been around far longer in that case... > > Interesting point, but I'm not too worried about it. Yeah. 2.6.17 was released on June 2006. The latest stable release prior to 2.6.17 was 2.6.16.62 in 2008 and it was abandoned at that time in favor of 2.6.27 as the new stable branch. I don't think this is a problem any sane person is going to face; and telling them to upgrade to a newer kernel seems an acceptable answer. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: > It calls pg_flush_data inside of copy_file which does the posix_fadvise... So > maybe just put the sync_file_range in pg_flush_data? Oh, I didn't notice that, thank you. In that case, it may be good to combine them if possible. I will look into it. There may be performance implications when used one a larger amount of data though. I can do some brief testing. Regards,Jeff Davis
On Mon, Jun 18, 2012 at 09:34:30PM +0300, Peter Eisentraut wrote: > On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: > > - defaulting to initdb -N in the regression suite is not a good imo, > > because that way the buildfarm won't catch problems in that area... > > > The regression test suite also starts postgres with fsync off. This is > good, and I don't want to have more overhead in the tests. It's not > like we already have awesome test coverage for OS interaction. What sorts of coverage would we want? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: > It calls pg_flush_data inside of copy_file which does the posix_fadvise... So > maybe just put the sync_file_range in pg_flush_data? The functions in fd.c aren't linked to initdb, so it's a challenge to share that code (I remember now: that's why I didn't use pg_flush_data originally). I don't see a clean way to resolve that, do you? Also, it seems that sync_file_range() doesn't help with creating a database much (on my machine), so it doesn't seem important to use it there. Right now I'm inclined to leave the patch as-is. Regards,Jeff Davis
On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote: > On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: > > It calls pg_flush_data inside of copy_file which does the > > posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? > > The functions in fd.c aren't linked to initdb, so it's a challenge to > share that code (I remember now: that's why I didn't use pg_flush_data > originally). I don't see a clean way to resolve that, do you? > > Also, it seems that sync_file_range() doesn't help with creating a > database much (on my machine), so it doesn't seem important to use it > there. > > Right now I'm inclined to leave the patch as-is. Fine with that, I wanted to bring it up and see it documented. I have marked it with ready for committer. That committer needs to decide on - N in the regression tests or not, but that shouldn't be much of a problem ;) Cool! Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah@leadboat.com> wrote: > On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote: >> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote: >> > Yeah. Personally I would be sad if initdb got noticeably slower, and >> > I've never seen or heard of a failure that this would fix. >> >> I worked up a patch, and it looks like it does about 6 file fsync's and >> a 7th for the PGDATA directory. That degrades the time from about 1.1s >> to 1.4s on my workstation. > >> So, is it worth it? Should we make it an option that can be specified? > > If we add fsync calls to the initdb process, they should cover the entire data > directory tree. This patch syncs files that initdb.c writes, but we ought to > also sync files that bootstrap-mode backends had written. An optimization > like the pg_flush_data() call in copy_file() may reduce the speed penalty. > > initdb should do these syncs by default and offer an option to disable them. This may be a stupid question, by why is it initdb's job to fsync the files the server creates, rather than the server's job? Normally we rely on the server to make its own writes persistent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 22, 2012 at 10:04:23AM -0400, Robert Haas wrote: > On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah@leadboat.com> wrote: > > If we add fsync calls to the initdb process, they should cover the entire data > > directory tree. ?This patch syncs files that initdb.c writes, but we ought to > > also sync files that bootstrap-mode backends had written. ?An optimization > > like the pg_flush_data() call in copy_file() may reduce the speed penalty. > > > > initdb should do these syncs by default and offer an option to disable them. > > This may be a stupid question, by why is it initdb's job to fsync the > files the server creates, rather than the server's job? Normally we > rely on the server to make its own writes persistent. Modularity would dictate having the server fsync its own work product, but I expect that approach to perform materially worse. initdb runs many single-user server instances, and each would fsync independently. When N initdb steps change one file, it would see N fsyncs. Using sync_file_range to queue all writes is best for the typical interactive or quasi-interactive initdb user. It's not always a win for server fsyncs, so we would need to either define special cases such that it's used during initdb or forgo the optimization. On the other hand, the server could skip some files, like fsm forks, in a principled manner. Overall, I think it will hard to improve modularity while retaining the performance Jeff's approach achieves through exploiting initdb's big-picture perspective. So I favor how Jeff has implemented it. nm
On Fri, 2012-06-22 at 10:04 -0400, Robert Haas wrote: > This may be a stupid question, by why is it initdb's job to fsync the > files the server creates, rather than the server's job? Normally we > rely on the server to make its own writes persistent. That was my first reaction as well: http://archives.postgresql.org/message-id/1328468000.15224.32.camel@jdavis However, from the discussion it seems like it will be harder to do it that way (during normal operation, a checkpoint is what syncs the files; but as Tom points out, bootstrap mode does not). Also, I would expect the fsyncs to be in a less-controlled pattern, so it might perform more poorly. Regards,Jeff Davis
On mån, 2012-06-18 at 20:57 +0200, Andres Freund wrote: > I don't think the difference in initdb cost is relevant when running > the regression tests. Should it prove to be we can re-add -N after a > week or two in the buildfarm machines. Keep in mind that the regression tests are not only run on the buildfarm, and that pg_regress is not only used for the main regression tests. When you're dealing with things in contrib or src/pl or external modules, then adding 3 seconds to a 5 second test is not so nice. > I just remember that there were several OS specific regression when > adding the pre-syncing for createdb. Sure, it would be useful to test this, but then we should also turn on fsync in the backend. Make it a separate slow mode (or a separate fast mode).
Andres Freund <andres@2ndquadrant.com> writes: > On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote: >> Right now I'm inclined to leave the patch as-is. > Fine with that, I wanted to bring it up and see it documented. > I have marked it with ready for committer. That committer needs to decide on - > N in the regression tests or not, but that shouldn't be much of a problem ;) I'm picking up this patch now. What I'm inclined to do about the -N business is to commit without that, so that we get a round of testing in the buildfarm and find out about any portability issues, but then change to use -N after a week or so. I agree that in the long run we don't want regression tests to run with fsyncs by default. regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: >> Ok. Sensible reasons. I dislike that we know have two files using different >> logic (copydir.c only using fadvise, initdb using sync_file_range if >> available). Maybe we should just move the fadvise and sync_file_range calls >> into its own common function? > I don't see fadvise in copydir.c, it looks like it just uses fsync. It > might speed it up to use a pre-sync call there, too -- database creation > does take a while. No, that's incorrect: the fadvise is there, inside pg_flush_data() which is done during the writing phase. It seems to me that if we think sync_file_range is a win, we ought to be using it in pg_flush_data just as much as in initdb. However, I'm a bit confused because in http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php you said > So, it looks like fadvise is the "right" thing to do, but I expect we'll Was that backwards? If not, why are we bothering with taking any portability risks by adding use of sync_file_range? regards, tom lane
I wrote: > I'm picking up this patch now. What I'm inclined to do about the -N > business is to commit without that, so that we get a round of testing > in the buildfarm and find out about any portability issues, but then > change to use -N after a week or so. I agree that in the long run > we don't want regression tests to run with fsyncs by default. Committed without the -N in pg_regress (for now). I also stuck sync_file_range into the backend's pg_flush_data --- it would be interesting to hear measurements of whether that makes a noticeable difference for CREATE DATABASE. regards, tom lane
On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote: > No, that's incorrect: the fadvise is there, inside pg_flush_data() which > is done during the writing phase. Yeah, Andres pointed that out, also. > It seems to me that if we think > sync_file_range is a win, we ought to be using it in pg_flush_data just > as much as in initdb. It was pretty clearly a win for initdb, but I wasn't convinced it was a good idea for other use cases. The mechanism is outlined in the email you linked below. To paraphrase, fadvise tries to put the data in the io scheduler queue (still in the kernel before going to the device), and gives up if there is no room; sync_file_range waits for there to be room if none is available. For the case of initdb, the data needing to be fsync'd is effectively constant, and it's a lot of small files. If the requests don't make it to the io scheduler queue before fsync, the kernel doesn't have an opportunity to schedule them properly. But for larger amounts of data copying (like ALTER DATABASE SET TABLESPACE), it seemed like there was more risk that sync_file_range would starve out other processes by continuously filling up the io scheduler queue (I'm not sure if there are protections against that or not). Also, if the files are larger, a single fsync represents more data, and the kernel may be able to schedule it well enough anyway. I'm not an authority in this area though, so if you are comfortable extending sync_file_range to copydir() as well, that's fine with me. > However, I'm a bit confused because in > http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php > you said > > > So, it looks like fadvise is the "right" thing to do, but I expect we'll > > Was that backwards? If not, why are we bothering with taking any > portability risks by adding use of sync_file_range? That part of the email was less than conclusive, and I can't remember exactly what point I was trying to make. sync_file_range is a big practical win for the reasons I mentioned above (and also avoids some unpleasant noises coming from the disk drive). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > For the case of initdb, the data needing to be fsync'd is effectively > constant, and it's a lot of small files. If the requests don't make it > to the io scheduler queue before fsync, the kernel doesn't have an > opportunity to schedule them properly. > But for larger amounts of data copying (like ALTER DATABASE SET > TABLESPACE), it seemed like there was more risk that sync_file_range > would starve out other processes by continuously filling up the io > scheduler queue (I'm not sure if there are protections against that or > not). Also, if the files are larger, a single fsync represents more > data, and the kernel may be able to schedule it well enough anyway. > I'm not an authority in this area though, so if you are comfortable > extending sync_file_range to copydir() as well, that's fine with me. It could use some performance testing, which I don't have the time for (or proper equipment). Anyone? Also, I note that copy_file is set up so that it does sync_file_range or fadvise for each 64K chunk of data, which seems mighty small. I wonder if it'd be better to take that out of the loop and do one whole-file advise at the end of the copy loop. Or at least use some larger granularity for those calls. regards, tom lane
On Fri, 2012-07-13 at 17:35 -0400, Tom Lane wrote: > I wrote: > > I'm picking up this patch now. What I'm inclined to do about the -N > > business is to commit without that, so that we get a round of testing > > in the buildfarm and find out about any portability issues, but then > > change to use -N after a week or so. I agree that in the long run > > we don't want regression tests to run with fsyncs by default. > > Committed without the -N in pg_regress (for now). I also stuck > sync_file_range into the backend's pg_flush_data --- it would be > interesting to hear measurements of whether that makes a noticeable > difference for CREATE DATABASE. Thank you. One point about the commit message: fadvise does not block to go into the request queue, sync_file_range does. The problem with fadvise is that, when the request queue is small, it fills up so fast that most of the requests never make it in, and fadvise is essentially a no-op. sync_file_range waits for room in the queue, which is (based on my tests) enough to improve the scheduling a lot. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > One point about the commit message: fadvise does not block to go into > the request queue, sync_file_range does. The problem with fadvise is > that, when the request queue is small, it fills up so fast that most of > the requests never make it in, and fadvise is essentially a no-op. > sync_file_range waits for room in the queue, which is (based on my > tests) enough to improve the scheduling a lot. I see. I misunderstood your previous message. In that case, it seems quite likely that it might be helpful if copy_file were to aggregate the fadvise/sync_file_range calls over larger pieces of the file. (I'm assuming that the request queue isn't bright enough to aggregate by itself, though that might be wrong.) regards, tom lane