Thread: silent data loss with ext4 / all current versions
Hi, I've been doing some power failure tests (i.e. unexpectedly interrupting power) a few days ago, and I've discovered a fairly serious case of silent data loss on ext3/ext4. Initially i thought it's a filesystem bug, but after further investigation I'm pretty sure it's our fault. What happens is that when we recycle WAL segments, we rename them and then sync them using fdatasync (which is the default on Linux). However fdatasync does not force fsync on the parent directory, so in case of power failure the rename may get lost. The recovery won't realize those segments actually contain changes from "future" and thus does not replay them. Hence data loss. The recovery completes as if everything went OK, so the data loss is entirely silent. Reproducing this is rather trivial. I've prepared a simple C program simulating our WAL recycling, that I intended to send to ext4 mailing list to demonstrate the ext4 bug before (I realized it's most likely our bug and not theirs). The example program is called ext4-data-loss.c and is available here (along with other stuff mentioned in this message): https://github.com/2ndQuadrant/ext4-data-loss Compile it, run it (over ssh from another host), interrupt the power and after restart you should see some of the segments be lost (the rename reverted). The git repo also contains a bunch of python scripts that I initially used to reproduce this on PostgreSQL - insert.py, update.py and xlog-watch.py. I'm not going to explain the details here, it's a bit more complicated but the cause is exactly the same as with the C program, just demonstrated in database. See README for instructions. So, what's going on? The problem is that while the rename() is atomic, it's not guaranteed to be durable without an explicit fsync on the parent directory. And by default we only do fdatasync on the recycled segments, which may not force fsync on the directory (and ext4 does not do that, apparently). This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19, but I believe all versions since spread checkpoints were introduced are vulnerable). FWIW this has nothing to do with storage reliability - you may have good drives, RAID controller with BBU, reliable SSDs or whatever, and you're still not safe. This issue is at the filesystem level, not storage. I've done the same tests on xfs and that seems to be safe - I've been unable to reproduce the issue, so either the issue is not there or it's more difficult to hit it. I haven't tried on other file systems, because ext4 and xfs cover vast majority of deployments (at least on Linux), and thus issue on ext4 is serious enough I believe. It's possible to make ext3/ext4 safe with respect to this issue by using full journaling (data=journal) instead of the default (data=ordered) mode. However this comes at a significant performance cost and pretty much no one is using it with PostgreSQL because data=ordered is believed to be safe. It's also possible to mitigate this by setting wal_sync_method=fsync, but I don't think I've ever seen that change at a customer. This also comes with a significant performance penalty, comparable to setting data=journal. This has the advantage that this can be done without restarting the database (SIGHUP is enough). So pretty much everyone running on Linux + ext3/ext4 is vulnerable. It's also worth mentioning that the data is not actually lost - it's properly fsynced in the WAL segments, it's just the rename that got lost. So it's possible to survive this without losing data by manually renaming the segments, but this must happen before starting the cluster because the automatic recovery comes and discards all the data etc. I think this issue might also result in various other issues, not just data loss. For example, I wouldn't be surprised by data corruption due to flushing some of the changes in data files to disk (due to contention for shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL segment. Also, while I have only seen 1 to 3 segments getting lost, it might be possible that more segments can get lost, possibly making the recovery impossible. And of course, this might cause problems with WAL archiving due to archiving the same segment twice (before and after crash). Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure this needs to be backpatched to all backbranches. I've also attached a patch that adds pg_current_xlog_flush_location() function, which proved to be quite useful when debugging this issue. I'd also like to propose adding "last segment" to pg_controldata, next to the last checkpoint / restartpoint. We don't need to write this on every commit, once per segment (on the first write) is enough. This would make investigating the issue much easier, and it'd also make it possible to terminate the recovery with an error if the last found segment does not match the expectation (instead of just assuming we've found all segments, leading to data loss). Another useful change would be to allow pg_xlogdump to print segments even if the contents does not match the filename. Currently it's impossible to even look at the contents in that case, so renaming the existing segments is mostly guess work (find segments whrere pg_xlogdump fails, try renaming to next segments). And finally, I've done a quick review of all places that might suffer the same issue - some are not really interesting as the stuff is ephemeral anyway (like pgstat for example), but there are ~15 places that may need this fix: * src/backend/access/transam/timeline.c (2 matches) * src/backend/access/transam/xlog.c (9 matches) * src/backend/access/transam/xlogarchive.c (3 matches) * src/backend/postmaster/pgarch.c (1 match) Some of these places might be actually safe because a fsync happens somewhere immediately after the rename (e.g. in a caller), but I guess better safe than sorry. I plan to do more power failure testing soon, with more complex test scenarios. I suspect there might be other similar issues (e.g. when we rename a file before a checkpoint and don't fsync the directory - then the rename won't be replayed and will be lost). regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
> What happens is that when we recycle WAL segments, we rename them and then sync > them using fdatasync (which is the default on Linux). However fdatasync does not > force fsync on the parent directory, so in case of power failure the rename may > get lost. The recovery won't realize those segments actually contain changes Agree. Some time ago I faced with this, although it wasn't a postgres. > So, what's going on? The problem is that while the rename() is atomic, it's not > guaranteed to be durable without an explicit fsync on the parent directory. And > by default we only do fdatasync on the recycled segments, which may not force > fsync on the directory (and ext4 does not do that, apparently). > > This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and > also all supported PostgreSQL versions (tested on 9.1.19, but I believe all > versions since spread checkpoints were introduced are vulnerable). > > FWIW this has nothing to do with storage reliability - you may have good drives, > RAID controller with BBU, reliable SSDs or whatever, and you're still not safe. > This issue is at the filesystem level, not storage. Agree again. > I plan to do more power failure testing soon, with more complex test scenarios. > I suspect there might be other similar issues (e.g. when we rename a file before > a checkpoint and don't fsync the directory - then the rename won't be replayed > and will be lost). It would be very useful, but I hope you will not find a new bug :) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So, what's going on? The problem is that while the rename() is atomic, it's > not guaranteed to be durable without an explicit fsync on the parent > directory. And by default we only do fdatasync on the recycled segments, > which may not force fsync on the directory (and ext4 does not do that, > apparently). Yeah, that seems to be the way the POSIX spec clears things. "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion." http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html If I understand that right, it is guaranteed that the rename() will be atomic, meaning that there will be only one file even if there is a crash, but that we need to fsync() the parent directory as mentioned. > FWIW this has nothing to do with storage reliability - you may have good > drives, RAID controller with BBU, reliable SSDs or whatever, and you're > still not safe. This issue is at the filesystem level, not storage. The POSIX spec authorizes this behavior, so the FS is not to blame, clearly. At least that's what I get from it. > I've done the same tests on xfs and that seems to be safe - I've been unable > to reproduce the issue, so either the issue is not there or it's more > difficult to hit it. I haven't tried on other file systems, because ext4 and > xfs cover vast majority of deployments (at least on Linux), and thus issue > on ext4 is serious enough I believe. > > So pretty much everyone running on Linux + ext3/ext4 is vulnerable. > > It's also worth mentioning that the data is not actually lost - it's > properly fsynced in the WAL segments, it's just the rename that got lost. So > it's possible to survive this without losing data by manually renaming the > segments, but this must happen before starting the cluster because the > automatic recovery comes and discards all the data etc. Hm. Most users are not going to notice that, particularly where things are embedded. > I think this issue might also result in various other issues, not just data > loss. For example, I wouldn't be surprised by data corruption due to > flushing some of the changes in data files to disk (due to contention for > shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL > segment. Also, while I have only seen 1 to 3 segments getting lost, it might > be possible that more segments can get lost, possibly making the recovery > impossible. And of course, this might cause problems with WAL archiving due > to archiving the same > segment twice (before and after crash). Possible, the switch to .done is done after renaming the segment in xlogarchive.c. So this could happen in theory. > Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure > this needs to be backpatched to all backbranches. I've also attached a patch > that adds pg_current_xlog_flush_location() function, which proved to be > quite useful when debugging this issue. Agreed. We should be sure as well that the calls to fsync_fname get issued in a critical section with START/END_CRIT_SECTION(). It does not seem to be the case with your patch. > And finally, I've done a quick review of all places that might suffer the > same issue - some are not really interesting as the stuff is ephemeral > anyway (like pgstat for example), but there are ~15 places that may need > this fix: > > * src/backend/access/transam/timeline.c (2 matches) > * src/backend/access/transam/xlog.c (9 matches) > * src/backend/access/transam/xlogarchive.c (3 matches) > * src/backend/postmaster/pgarch.c (1 match) > > Some of these places might be actually safe because a fsync happens > somewhere immediately after the rename (e.g. in a caller), but I guess > better safe than sorry. I had a quick look at those code paths and indeed it would be safer to be sure that once rename() is called we issue those fsync calls. > I plan to do more power failure testing soon, with more complex test > scenarios. I suspect there might be other similar issues (e.g. when we > rename a file before a checkpoint and don't fsync the directory - then the > rename won't be replayed and will be lost). That would be great. -- Michael
On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I plan to do more power failure testing soon, with more complex test > scenarios. I suspect there might be other similar issues (e.g. when we > rename a file before a checkpoint and don't fsync the directory - then the > rename won't be replayed and will be lost). I'm curious how you're doing this testing. The easiest way I can think of would be to run a database on an LVM volume and take a large number of LVM snapshots very rapidly and then see if the database can start up from each snapshot. Bonus points for keeping track of the committed transactions before each snaphsot and ensuring they're still there I guess. That always seemed unsatisfactory because in the past we were mainly concerned with whether fsync was actually getting propagated to the physical media. But for testing whether we're fsyncing enough for the filesystem that would be good enough. -- greg
Hi, On 11/27/2015 02:28 PM, Greg Stark wrote: > On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I plan to do more power failure testing soon, with more complex >> test scenarios. I suspect there might be other similar issues (e.g. >> when we rename a file before a checkpoint and don't fsync the >> directory -then the rename won't be replayed and will be lost). > > I'm curious how you're doing this testing. The easiest way I can > think of would be to run a database on an LVM volume and take a large > number of LVM snapshots very rapidly and then see if the database can > start up from each snapshot. Bonus points for keeping track of the > committed transactions before each snaphsot and ensuring they're > still there I guess. I do have reliable storage (Intel SSD with power-loss protection), and I've connected the system to a sophisticated power-loss-making device called "the power switch" (image attached). In other words, in the last ~7 days the system got rebooted more times than in the previous ~5 years. > That always seemed unsatisfactory because in the past we were mainly > concerned with whether fsync was actually getting propagated to the > physical media. But for testing whether we're fsyncing enough for > the filesystem that would be good enough. Yeah. I considered some form of virtualized setup initially, but my original intent was to verify whether disabling write barriers really is safe (because I've heard numerous complaints that it's stupid). And as write barriers are more tightly coupled to the hardware, I opted for the more brutal approach. But I agree some form of virtualized setup might be more flexible, although I'm not sure LVM snapshots are good approach as snapshots may wait for I/O requests to complete and such. I think something qemu might work better when combined with "kill -9" and I plan to try reproducing the data loss issue on such setup. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 11/27/2015 02:18 PM, Michael Paquier wrote: > On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> So, what's going on? The problem is that while the rename() is atomic, it's >> not guaranteed to be durable without an explicit fsync on the parent >> directory. And by default we only do fdatasync on the recycled segments, >> which may not force fsync on the directory (and ext4 does not do that, >> apparently). > > Yeah, that seems to be the way the POSIX spec clears things. > "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall > force all currently queued I/O operations associated with the file > indicated by file descriptor fildes to the synchronized I/O completion > state. All I/O operations shall be completed as defined for > synchronized I/O file integrity completion." > http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html > If I understand that right, it is guaranteed that the rename() will be > atomic, meaning that there will be only one file even if there is a > crash, but that we need to fsync() the parent directory as mentioned. > >> FWIW this has nothing to do with storage reliability - you may have good >> drives, RAID controller with BBU, reliable SSDs or whatever, and you're >> still not safe. This issue is at the filesystem level, not storage. > > The POSIX spec authorizes this behavior, so the FS is not to blame, > clearly. At least that's what I get from it. The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX expert), but we should be prepared for the less favorable interpretation I think. > >> I think this issue might also result in various other issues, not just data >> loss. For example, I wouldn't be surprised by data corruption due to >> flushing some of the changes in data files to disk (due to contention for >> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL >> segment. Also, while I have only seen 1 to 3 segments getting lost, it might >> be possible that more segments can get lost, possibly making the recovery >> impossible. And of course, this might cause problems with WAL archiving due >> to archiving the same >> segment twice (before and after crash). > > Possible, the switch to .done is done after renaming the segment in > xlogarchive.c. So this could happen in theory. Yes. That's one of the suspicious places in my notes (haven't posted all the details, the message was long enough already). >> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure >> this needs to be backpatched to all backbranches. I've also attached a patch >> that adds pg_current_xlog_flush_location() function, which proved to be >> quite useful when debugging this issue. > > Agreed. We should be sure as well that the calls to fsync_fname get > issued in a critical section with START/END_CRIT_SECTION(). It does > not seem to be the case with your patch. Don't know. I've based that on code from replication/logical/ which does fsync_fname() on all the interesting places, without the critical section. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 28, 2015 at 3:01 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 11/27/2015 02:18 PM, Michael Paquier wrote: >> >> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> >>> So, what's going on? The problem is that while the rename() is atomic, >>> it's >>> not guaranteed to be durable without an explicit fsync on the parent >>> directory. And by default we only do fdatasync on the recycled segments, >>> which may not force fsync on the directory (and ext4 does not do that, >>> apparently). >> >> >> Yeah, that seems to be the way the POSIX spec clears things. >> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall >> force all currently queued I/O operations associated with the file >> indicated by file descriptor fildes to the synchronized I/O completion >> state. All I/O operations shall be completed as defined for >> synchronized I/O file integrity completion." >> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html >> If I understand that right, it is guaranteed that the rename() will be >> atomic, meaning that there will be only one file even if there is a >> crash, but that we need to fsync() the parent directory as mentioned. >> >>> FWIW this has nothing to do with storage reliability - you may have good >>> drives, RAID controller with BBU, reliable SSDs or whatever, and you're >>> still not safe. This issue is at the filesystem level, not storage. >> >> >> The POSIX spec authorizes this behavior, so the FS is not to blame, >> clearly. At least that's what I get from it. > > > The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX > expert), As I am understanding it, FS implementations are free to decide to make the rename persist on disk or not. > but we should be prepared for the less favorable interpretation I > think. Yep. I agree. And in case my previous words were not clear, that's the same line of thought here, we had better cover our backs and study carefully each code path that could be impacted. >>> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty >>> sure >>> this needs to be backpatched to all backbranches. I've also attached a >>> patch >>> that adds pg_current_xlog_flush_location() function, which proved to be >>> quite useful when debugging this issue. >> >> >> Agreed. We should be sure as well that the calls to fsync_fname get >> issued in a critical section with START/END_CRIT_SECTION(). It does >> not seem to be the case with your patch. > > > Don't know. I've based that on code from replication/logical/ which does > fsync_fname() on all the interesting places, without the critical section. For slot information in slot.c, there will be a PANIC when fsyncing pg_replslot at some points. It does not seem that weird to do the same for example after renaming the backup label file.. -- Michael
On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu> wrote:
On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I plan to do more power failure testing soon, with more complex test
> scenarios. I suspect there might be other similar issues (e.g. when we
> rename a file before a checkpoint and don't fsync the directory - then the
> rename won't be replayed and will be lost).
I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.
I've had a few tries at implementing a qemu-based crashtester where it hard kills the qemu instance at a random point then starts it back up.
I always got stuck on the validation part - actually ensuring that the DB state is how we expect. I think I could probably get that right now, it's been a while.
The VM can be started back up and killed again over and over quite quickly.
It's not as good as physical plug-pull, but it's a lot more practical.
On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
It's also possible to mitigate this by setting wal_sync_method=fsync
Are you sure?
https://lwn.net/Articles/322823/ tends to suggest that fsync() on the file is insufficient to ensure rename() is persistent, though it's somewhat old.
Hi, On 11/29/2015 02:38 PM, Craig Ringer wrote: > On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu > <mailto:stark@mit.edu>> wrote: > > On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> > wrote: > > I plan to do more power failure testing soon, with more complex test > > scenarios. I suspect there might be other similar issues (e.g. when we > > rename a file before a checkpoint and don't fsync the directory - then the > > rename won't be replayed and will be lost). > > I'm curious how you're doing this testing. The easiest way I can think > of would be to run a database on an LVM volume and take a large number > of LVM snapshots very rapidly and then see if the database can start > up from each snapshot. Bonus points for keeping track of the committed > transactions before each snaphsot and ensuring they're still there I > guess. > > > I've had a few tries at implementing a qemu-based crashtester where it > hard kills the qemu instance at a random point then starts it back up. I've tried to reproduce the issue by killing a qemu VM, and so far I've been unsuccessful. On bare HW it was easily reproducible (I'd hit the issue 9 out of 10 attempts), so either I'm doing something wrong or qemu somehow interacts with the I/O. > I always got stuck on the validation part - actually ensuring that the > DB state is how we expect. I think I could probably get that right now, > it's been a while. Weel, I guess we can't really check all the details, but I guess the checksums make checking the general consistency somewhat simpler. And then you have to design the workload in a way that makes the check easier - for example remembering the committed values etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/29/2015 02:41 PM, Craig Ringer wrote: > On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > It's also possible to mitigate this by setting wal_sync_method=fsync > > > Are you sure? > > https://lwn.net/Articles/322823/ tends to suggest that fsync() on the > file is insufficient to ensure rename() is persistent, though it's > somewhat old. Good point. I don't know, and I'm not any smarter after reading the LWN article. What I meant by "mitigate" is that I've been unable to reproduce the issue after setting wal_sync_method=fsync, so my conclusion is that it either fixes the issue or at least significantly reduces the probability of hitting it. It's pretty clear that the right fix is the additional fsync on pg_xlog. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/29/2015 03:33 PM, Tomas Vondra wrote: > Hi, > > On 11/29/2015 02:38 PM, Craig Ringer wrote: >> >> I've had a few tries at implementing a qemu-based crashtester where it >> hard kills the qemu instance at a random point then starts it back up. > > I've tried to reproduce the issue by killing a qemu VM, and so far I've > been unsuccessful. On bare HW it was easily reproducible (I'd hit the > issue 9 out of 10 attempts), so either I'm doing something wrong or qemu > somehow interacts with the I/O. Update: I've managed to reproduce the issue in the qemu setup - I think it needs slightly different timing due to the VM being slightly slower. I also tweaked vm.dirty_bytes and vm.dirty_background_bytes to values used on the bare hardware (I suspect it widens the window). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/27/15 8:18 AM, Michael Paquier wrote: > On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> > So, what's going on? The problem is that while the rename() is atomic, it's >> > not guaranteed to be durable without an explicit fsync on the parent >> > directory. And by default we only do fdatasync on the recycled segments, >> > which may not force fsync on the directory (and ext4 does not do that, >> > apparently). > Yeah, that seems to be the way the POSIX spec clears things. > "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall > force all currently queued I/O operations associated with the file > indicated by file descriptor fildes to the synchronized I/O completion > state. All I/O operations shall be completed as defined for > synchronized I/O file integrity completion." > http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html > If I understand that right, it is guaranteed that the rename() will be > atomic, meaning that there will be only one file even if there is a > crash, but that we need to fsync() the parent directory as mentioned. I don't see anywhere in the spec that a rename needs an fsync of the directory to be durable. I can see why that would be needed in practice, though. File system developers would probably be able to give a more definite answer.
On 12/01/2015 10:44 PM, Peter Eisentraut wrote: > On 11/27/15 8:18 AM, Michael Paquier wrote: >> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>>> So, what's going on? The problem is that while the rename() is atomic, it's >>>> not guaranteed to be durable without an explicit fsync on the parent >>>> directory. And by default we only do fdatasync on the recycled segments, >>>> which may not force fsync on the directory (and ext4 does not do that, >>>> apparently). >> Yeah, that seems to be the way the POSIX spec clears things. >> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall >> force all currently queued I/O operations associated with the file >> indicated by file descriptor fildes to the synchronized I/O completion >> state. All I/O operations shall be completed as defined for >> synchronized I/O file integrity completion." >> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html >> If I understand that right, it is guaranteed that the rename() will be >> atomic, meaning that there will be only one file even if there is a >> crash, but that we need to fsync() the parent directory as mentioned. > > I don't see anywhere in the spec that a rename needs an fsync of the > directory to be durable. I can see why that would be needed in > practice, though. File system developers would probably be able to > give a more definite answer. Yeah, POSIX is the smallest common denominator. In this case the spec seems not to require this durability guarantee (rename without fsync on directory), which allows a POSIX-compliant filesystem. At least that's my conclusion from reading https://lwn.net/Articles/322823/ However, as I explained in the original post, it's more complicated as this only seems to be problem with fdatasync. I've been unable to reproduce the issue with wal_sync_method=fsync. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attached is v2 of the patch, that (a) adds explicit fsync on the parent directory after all the rename() calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c (b) adds START/END_CRIT_SECTION around the new fsync_fname calls (except for those in timeline.c, as the START/END_CRIT_SECTION is not available there) The patch is fairly trivial and I've done some rudimentary testing, but I'm sure I haven't exercised all the modified paths. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Attached is v2 of the patch, that > > (a) adds explicit fsync on the parent directory after all the rename() > calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c > > (b) adds START/END_CRIT_SECTION around the new fsync_fname calls > (except for those in timeline.c, as the START/END_CRIT_SECTION is > not available there) > > The patch is fairly trivial and I've done some rudimentary testing, but I'm > sure I haven't exercised all the modified paths. I would like to have an in-depth look at that after finishing the current CF, I am the manager of this one after all... Could you register it to 2016-01 CF for the time being? I don't mind being beaten by someone else if this someone has some room to look at this patch.. -- Michael
On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Attached is v2 of the patch, that >> >> (a) adds explicit fsync on the parent directory after all the rename() >> calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c >> >> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls >> (except for those in timeline.c, as the START/END_CRIT_SECTION is >> not available there) >> >> The patch is fairly trivial and I've done some rudimentary testing, but I'm >> sure I haven't exercised all the modified paths. > > I would like to have an in-depth look at that after finishing the > current CF, I am the manager of this one after all... Could you > register it to 2016-01 CF for the time being? I don't mind being > beaten by someone else if this someone has some room to look at this > patch.. And please feel free to add my name as reviewer. -- Michael
On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> Attached is v2 of the patch, that >>> >>> (a) adds explicit fsync on the parent directory after all the rename() >>> calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c >>> >>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls >>> (except for those in timeline.c, as the START/END_CRIT_SECTION is >>> not available there) >>> >>> The patch is fairly trivial and I've done some rudimentary testing, but I'm >>> sure I haven't exercised all the modified paths. >> >> I would like to have an in-depth look at that after finishing the >> current CF, I am the manager of this one after all... Could you >> register it to 2016-01 CF for the time being? I don't mind being >> beaten by someone else if this someone has some room to look at this >> patch.. > > And please feel free to add my name as reviewer. Tomas, I am planning to have a look at that, because it seems to be important. In case it becomes lost on my radar, do you mind if I add it to the 2016-03 CF? -- Michael
On 01/19/2016 07:44 AM, Michael Paquier wrote: > On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra >>> <tomas.vondra@2ndquadrant.com> wrote: >>>> Attached is v2 of the patch, that >>>> >>>> (a) adds explicit fsync on the parent directory after all the rename() >>>> calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c >>>> >>>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls >>>> (except for those in timeline.c, as the START/END_CRIT_SECTION is >>>> not available there) >>>> >>>> The patch is fairly trivial and I've done some rudimentary testing, but I'm >>>> sure I haven't exercised all the modified paths. >>> >>> I would like to have an in-depth look at that after finishing the >>> current CF, I am the manager of this one after all... Could you >>> register it to 2016-01 CF for the time being? I don't mind being >>> beaten by someone else if this someone has some room to look at this >>> patch.. >> >> And please feel free to add my name as reviewer. > > Tomas, I am planning to have a look at that, because it seems to be > important. In case it becomes lost on my radar, do you mind if I add > it to the 2016-03 CF? Well, what else can I do? I have to admit I'm quite surprised this is still rotting here, considering it addresses a rather serious data loss / corruption issue on pretty common setup. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 01/19/2016 07:44 AM, Michael Paquier wrote: >> >> On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> >>> On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> >>>> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra >>>> <tomas.vondra@2ndquadrant.com> wrote: >>>>> >>>>> Attached is v2 of the patch, that >>>>> >>>>> (a) adds explicit fsync on the parent directory after all the rename() >>>>> calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c >>>>> >>>>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls >>>>> (except for those in timeline.c, as the START/END_CRIT_SECTION is >>>>> not available there) >>>>> >>>>> The patch is fairly trivial and I've done some rudimentary testing, but >>>>> I'm >>>>> sure I haven't exercised all the modified paths. >>>> >>>> >>>> I would like to have an in-depth look at that after finishing the >>>> current CF, I am the manager of this one after all... Could you >>>> register it to 2016-01 CF for the time being? I don't mind being >>>> beaten by someone else if this someone has some room to look at this >>>> patch.. >>> >>> >>> And please feel free to add my name as reviewer. >> >> >> Tomas, I am planning to have a look at that, because it seems to be >> important. In case it becomes lost on my radar, do you mind if I add >> it to the 2016-03 CF? > > > Well, what else can I do? I have to admit I'm quite surprised this is still > rotting here, considering it addresses a rather serious data loss / > corruption issue on pretty common setup. Well, I think you did what you could. And we need to be sure now that it gets in and that this patch gets a serious lookup. So for now my guess is that not loosing track of it would be a good first move. I have added it here to attract more attention: https://commitfest.postgresql.org/9/484/ -- Michael
On 01/19/2016 08:03 AM, Michael Paquier wrote: > On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> ... >>> Tomas, I am planning to have a look at that, because it seems to be >>> important. In case it becomes lost on my radar, do you mind if I add >>> it to the 2016-03 CF? >> >> >> Well, what else can I do? I have to admit I'm quite surprised this is still >> rotting here, considering it addresses a rather serious data loss / >> corruption issue on pretty common setup. > > Well, I think you did what you could. And we need to be sure now that > it gets in and that this patch gets a serious lookup. So for now my > guess is that not loosing track of it would be a good first move. I > have added it here to attract more attention: > https://commitfest.postgresql.org/9/484/ Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added it into the CF app). -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 19, 2016 at 4:20 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 01/19/2016 08:03 AM, Michael Paquier wrote: >> >> On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> >>> > ... >>>> >>>> Tomas, I am planning to have a look at that, because it seems to be >>>> important. In case it becomes lost on my radar, do you mind if I add >>>> it to the 2016-03 CF? >>> >>> >>> >>> Well, what else can I do? I have to admit I'm quite surprised this is >>> still >>> rotting here, considering it addresses a rather serious data loss / >>> corruption issue on pretty common setup. >> >> >> Well, I think you did what you could. And we need to be sure now that >> it gets in and that this patch gets a serious lookup. So for now my >> guess is that not loosing track of it would be a good first move. I >> have added it here to attract more attention: >> https://commitfest.postgresql.org/9/484/ > > > Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added > it into the CF app). So, I have been playing with a Linux VM with VMware Fusion and on ext4 with data=ordered the renames are getting lost if the root folder is not fsync. By killing-9 the VM I am able to reproduce that really easily. Here are some comments about your patch after a look at the code. Regarding the additions in fsync_fname() in xlog.c: 1) In InstallXLogFileSegment, rename() will be called only if HAVE_WORKING_LINK is not used, which happens only on Windows and cygwin. We could add it for consistency, but it should be within the #else/#endif block. It is not critical as of now. 2) The call in RemoveXlogFile is not necessary, the rename happening only on Windows. 3) In exitArchiveRecovery if the rename is not made durable I think it does not matter much. Even if recovery.conf is the one present once the node restarts node would move back again to recovery, and actually we had better move back to recovery in this case, no? 4) StartupXLOG for the tablespace map. I don't think this one is needed as well. Even if the tablespace map is not removed after a power loss user would get an error telling that the file should be removed. 5) For the one where haveBackupLabel || haveTblspcMap. If we do the fsync, we guarantee that there is no need to do again the recovery. But in case of a power loss, isn't it better to do the recovery again? 6) For the one after XLogArchiveNotify() for the last partial segment of the old timeline, it doesn't really matter to not make the change persistent as this is mainly done because it is useful to identify that it is a partial segment. 7) In CancelBackup, this one is not needed as well I think. We would surely want to get back to recovery if those files remain after a power loss. For the ones in xlogarchive.c: 1) For KeepFileRestoredFromArchive, it does not matter here, we are renaming a file with a temporary name to a permanent name. Once the node restarts, we would see an extra temporary file if the rename was not effective. 2) XLogArchiveForceDone, the only bad thing that would happen here is to leave behind a .ready file instead of a .done file. I guess that we could have it though as an optimization to not have to archive again this file. For the one in pgarch.c: 1) In pgarch_archiveDone, we could add it as an optimization to actually let the server know that the segment has been already archived, preventing a retry. In timeline.c: 1) writeTimeLineHistoryFile, it does not matter much. In the worst case we would have just a dummy temporary file, and startup process would come back here (in exitArchiveRecovery() we may finish with an unnamed backup file similarly). 2) In writeTimeLineHistoryFile, similarly we don't need to care much, in WalRcvFetchTimeLineHistoryFiles recovery would take again the same path Thoughts? -- Michael
Hi, On 01/22/2016 06:45 AM, Michael Paquier wrote: > So, I have been playing with a Linux VM with VMware Fusion and on > ext4 with data=ordered the renames are getting lost if the root > folder is not fsync. By killing-9 the VM I am able to reproduce that > really easily. Yep. Same experience here (with qemu-kvm VMs). > Here are some comments about your patch after a look at the code. > > Regarding the additions in fsync_fname() in xlog.c: > 1) In InstallXLogFileSegment, rename() will be called only if > HAVE_WORKING_LINK is not used, which happens only on Windows and > cygwin. We could add it for consistency, but it should be within the > #else/#endif block. It is not critical as of now. > 2) The call in RemoveXlogFile is not necessary, the rename happening > only on Windows. Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could there be some other platforms? And are we sure the file systems on those platforms are safe without the fsync call? That is, while the report references ext4, there may be other file systems with the same problem - ext4 was used mostly as it's the most widely used Linux file system. > 3) In exitArchiveRecovery if the rename is not made durable I think > it does not matter much. Even if recovery.conf is the one present > once the node restarts node would move back again to recovery, and > actually we had better move back to recovery in this case, no? I'm strongly against this "optimization" - I'm more than happy to exchange the one fsync for not having to manually fix the database after crash. I don't really see why switching back to recovery should be desirable in this case? Imagine you have a warm/hot standby, and that you promote it to master. The clients connect, start issuing commands and then the system crashes and loses the recovery.conf rename. The system reboots, database performs local recovery but then starts as a standby and starts rejecting writes. That seems really weird to me. > 4) StartupXLOG for the tablespace map. I don't think this one is > needed as well. Even if the tablespace map is not removed after a > power loss user would get an error telling that the file should be > removed. Please no, for the same reasons as in (3). > 5) For the one where haveBackupLabel || haveTblspcMap. If we do the > fsync, we guarantee that there is no need to do again the recovery. > But in case of a power loss, isn't it better to do the recovery again? Why would it be better? Why should we do something twice when we don't have to? Had this not be reliable, then the whole recovery process is fundamentally broken and we better fix it instead of merely putting a band-aid on it. > 6) For the one after XLogArchiveNotify() for the last partial > segment of the old timeline, it doesn't really matter to not make the > change persistent as this is mainly done because it is useful to > identify that it is a partial segment. OK, although I still don't quite see why that should be a reason not to do the fsync. It's not really going to give us any measurable performance advantage (how often we do those fsyncs), so I'd vote to keep it and make sure the partial segments are named accordingly. Less confusion is always better. > 7) In CancelBackup, this one is not needed as well I think. We would > surely want to get back to recovery if those files remain after a > power loss. I may be missing something, but why would we switch to recovery in this case? > > For the ones in xlogarchive.c: > 1) For KeepFileRestoredFromArchive, it does not matter here, we are > renaming a file with a temporary name to a permanent name. Once the > node restarts, we would see an extra temporary file if the rename > was not effective. So we'll lose the segment (won't have it locally under the permanent name), as we've already restored it and won't do that again. Is that really a good thing to do? > 2) XLogArchiveForceDone, the only bad thing that would happen here is > to leave behind a .ready file instead of a .done file. I guess that we > could have it though as an optimization to not have to archive again > this file. Yes. > > For the one in pgarch.c: > 1) In pgarch_archiveDone, we could add it as an optimization to > actually let the server know that the segment has been already > archived, preventing a retry. Not sure what you mean by "could add it as an optimization"? > In timeline.c: > 1) writeTimeLineHistoryFile, it does not matter much. In the worst > case we would have just a dummy temporary file, and startup process > would come back here (in exitArchiveRecovery() we may finish with an > unnamed backup file similarly). OK > 2) In writeTimeLineHistoryFile, similarly we don't need to care > much, in WalRcvFetchTimeLineHistoryFiles recovery would take again > the same path OK > > Thoughts? > Thanks for the review and comments. I think the question is whether we only want to do the additional fsync() only when it ultimately may lead to data loss, or even in cases where it may cause operational issues (e.g. switching back to recovery needlessly). I'd vote for the latter, as I think it makes the database easier to operate (less manual interventions) and the performance impact is 0 (as those fsyncs are really rare). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
-- Hi,
On 01/22/2016 06:45 AM, Michael Paquier wrote:So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.
Yep. Same experience here (with qemu-kvm VMs).Here are some comments about your patch after a look at the code.
Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.
Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could there be some other platforms? And are we sure the file systems on those platforms are safe without the fsync call?
That is, while the report references ext4, there may be other file systems with the same problem - ext4 was used mostly as it's the most widely used Linux file system.3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?
I'm strongly against this "optimization" - I'm more than happy to exchange the one fsync for not having to manually fix the database after crash.
I don't really see why switching back to recovery should be desirable in this case? Imagine you have a warm/hot standby, and that you promote it to master. The clients connect, start issuing commands and then the system crashes and loses the recovery.conf rename. The system reboots, database performs local recovery but then starts as a standby and starts rejecting writes. That seems really weird to me.4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.
Please no, for the same reasons as in (3).5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?
Why would it be better? Why should we do something twice when we don't have to? Had this not be reliable, then the whole recovery process is fundamentally broken and we better fix it instead of merely putting a band-aid on it.6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.
OK, although I still don't quite see why that should be a reason not to do the fsync. It's not really going to give us any measurable performance advantage (how often we do those fsyncs), so I'd vote to keep it and make sure the partial segments are named accordingly. Less confusion is always better.7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.
I may be missing something, but why would we switch to recovery in this case?
For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.
So we'll lose the segment (won't have it locally under the permanent name), as we've already restored it and won't do that again. Is that really a good thing to do?2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.
Yes.
For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.
Not sure what you mean by "could add it as an optimization"?In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).
OK2) In writeTimeLineHistoryFile, similarly we don't need to care
much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path
OK
Thoughts?
Thanks for the review and comments. I think the question is whether we only want to do the additional fsync() only when it ultimately may lead to data loss, or even in cases where it may cause operational issues (e.g. switching back to recovery needlessly).
I'd vote for the latter, as I think it makes the database easier to operate (less manual interventions) and the performance impact is 0 (as those fsyncs are really rare).
Yeah, unless it gives a significant performance penalty, I'd agree that the latter seems like the better option of those. +1 for that way :)
On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/22/2016 06:45 AM, Michael Paquier wrote: >> Here are some comments about your patch after a look at the code. >> >> Regarding the additions in fsync_fname() in xlog.c: >> 1) In InstallXLogFileSegment, rename() will be called only if >> HAVE_WORKING_LINK is not used, which happens only on Windows and >> cygwin. We could add it for consistency, but it should be within the >> #else/#endif block. It is not critical as of now. >> 2) The call in RemoveXlogFile is not necessary, the rename happening >> only on Windows. > > Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could > there be some other platforms? And are we sure the file systems on those > platforms are safe without the fsync call? > That is, while the report references ext4, there may be other file systems > with the same problem - ext4 was used mostly as it's the most widely used > Linux file system. From pg_config_manual.h: #if !defined(WIN32) && !defined(__CYGWIN__) #define HAVE_WORKING_LINK 1 #endif If we want to be consistent with what Posix proposes, I am not against adding it. >> 3) In exitArchiveRecovery if the rename is not made durable I think >> it does not matter much. Even if recovery.conf is the one present >> once the node restarts node would move back again to recovery, and >> actually we had better move back to recovery in this case, no? > > I'm strongly against this "optimization" - I'm more than happy to exchange > the one fsync for not having to manually fix the database after crash. > > I don't really see why switching back to recovery should be desirable in > this case? Imagine you have a warm/hot standby, and that you promote it to > master. The clients connect, start issuing commands and then the system > crashes and loses the recovery.conf rename. The system reboots, database > performs local recovery but then starts as a standby and starts rejecting > writes. That seems really weird to me. >> 4) StartupXLOG for the tablespace map. I don't think this one is >> needed as well. Even if the tablespace map is not removed after a >> power loss user would get an error telling that the file should be >> removed. > > Please no, for the same reasons as in (3). > >> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the >> fsync, we guarantee that there is no need to do again the recovery. >> But in case of a power loss, isn't it better to do the recovery again? > > Why would it be better? Why should we do something twice when we don't have > to? Had this not be reliable, then the whole recovery process is > fundamentally broken and we better fix it instead of merely putting a > band-aid on it. Group shot with 3), 4) and 5). Well, there is no data loss here, putting me in the direction of considering this addition of an fsync as an optimization and not a bug. >> 6) For the one after XLogArchiveNotify() for the last partial >> segment of the old timeline, it doesn't really matter to not make the >> change persistent as this is mainly done because it is useful to >> identify that it is a partial segment. > > OK, although I still don't quite see why that should be a reason not to do > the fsync. It's not really going to give us any measurable performance > advantage (how often we do those fsyncs), so I'd vote to keep it and make > sure the partial segments are named accordingly. Less confusion is always > better. Check. >> 7) In CancelBackup, this one is not needed as well I think. We would >> surely want to get back to recovery if those files remain after a >> power loss. > > I may be missing something, but why would we switch to recovery in this > case? >> For the ones in xlogarchive.c: >> 1) For KeepFileRestoredFromArchive, it does not matter here, we are >> renaming a file with a temporary name to a permanent name. Once the >> node restarts, we would see an extra temporary file if the rename >> was not effective. > > So we'll lose the segment (won't have it locally under the permanent name), > as we've already restored it and won't do that again. Is that really a good > thing to do? At this point if a segment is restored from archive and there is a power loss we are going back to recovery. The advantage of having the fsync would ensure that the segment is not fetched twice. >> 2) XLogArchiveForceDone, the only bad thing that would happen here is >> to leave behind a .ready file instead of a .done file. I guess that we >> could have it though as an optimization to not have to archive again >> this file. > > Yes. > >> >> For the one in pgarch.c: >> 1) In pgarch_archiveDone, we could add it as an optimization to >> actually let the server know that the segment has been already >> archived, preventing a retry. > > Not sure what you mean by "could add it as an optimization"? In case of a loss of the rename, server would retry archiving it. So adding an fsync here ensures that the segment is marked as .done for good. >> Thoughts? > > Thanks for the review and comments. I think the question is whether we only > want to do the additional fsync() only when it ultimately may lead to data > loss, or even in cases where it may cause operational issues (e.g. switching > back to recovery needlessly). > I'd vote for the latter, as I think it makes the database easier to operate > (less manual interventions) and the performance impact is 0 (as those fsyncs > are really rare). My first line of thoughts after looking at the patch is that I am not against adding those fsync calls on HEAD as there is roughly an advantage to not go back to recovery in most cases and ensure consistent names, but as they do not imply any data loss I would not encourage a back-patch. Adding them seems harmless at first sight I agree, but those are not actual bugs. -- Michael
On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/22/2016 06:45 AM, Michael Paquier wrote: > >> So, I have been playing with a Linux VM with VMware Fusion and on >> ext4 with data=ordered the renames are getting lost if the root >> folder is not fsync. By killing-9 the VM I am able to reproduce that >> really easily. > > > Yep. Same experience here (with qemu-kvm VMs). I still think a better approach for this is to run the database on an LVM volume and take lots of snapshots. No VM needed, though it doesn't hurt. LVM volumes are below the level of the filesystem and a snapshot captures the state of the raw blocks the filesystem has written to the block layer. The block layer does no caching though the drive may but neither the VM solution nor LVM would capture that. LVM snapshots would have the advantage that you can keep running the database and you can take lots of snapshots with relatively little overhead. Having dozens or hundreds of snapshots would be unacceptable performance drain in production but for testing it should be practical and they take relatively little space -- just the blocks changed since the snapshot was taken. -- greg
On 2016-01-22 21:32:29 +0900, Michael Paquier wrote: > Group shot with 3), 4) and 5). Well, there is no data loss here, > putting me in the direction of considering this addition of an fsync > as an optimization and not a bug. I think this is an extremely weak argument. The reasoning when exactly a loss of file is acceptable is complicated. In many cases adding an additional fsync won't add measurable cost, given the frequency of operations and/or the cost of surrounding operations. Now, if you can make an argument why something is potentially impacting performance *and* definitely not required: OK, then we can discuss that. Andres
On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote: > On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 01/22/2016 06:45 AM, Michael Paquier wrote: >> >>> So, I have been playing with a Linux VM with VMware Fusion and on >>> ext4 with data=ordered the renames are getting lost if the root >>> folder is not fsync. By killing-9 the VM I am able to reproduce that >>> really easily. >> >> >> Yep. Same experience here (with qemu-kvm VMs). > > I still think a better approach for this is to run the database on an > LVM volume and take lots of snapshots. No VM needed, though it doesn't > hurt. LVM volumes are below the level of the filesystem and a snapshot > captures the state of the raw blocks the filesystem has written to the > block layer. The block layer does no caching though the drive may but > neither the VM solution nor LVM would capture that. > > LVM snapshots would have the advantage that you can keep running the > database and you can take lots of snapshots with relatively little > overhead. Having dozens or hundreds of snapshots would be unacceptable > performance drain in production but for testing it should be practical > and they take relatively little space -- just the blocks changed since > the snapshot was taken. Another idea: hardcode a PANIC just after rename() with restart_after_crash = off (this needs is IsBootstrapProcess() checks). Once server crashes, kill-9 the VM. Then restart the VM and the Postgres instance with a new binary that does not have the PANIC, and see how things are moving on. There is a window of up to several seconds after the rename() call, so I guess that this would work. -- Michael
On 01/23/2016 02:35 AM, Michael Paquier wrote: > On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote: >> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> On 01/22/2016 06:45 AM, Michael Paquier wrote: >>> >>>> So, I have been playing with a Linux VM with VMware Fusion and on >>>> ext4 with data=ordered the renames are getting lost if the root >>>> folder is not fsync. By killing-9 the VM I am able to reproduce that >>>> really easily. >>> >>> >>> Yep. Same experience here (with qemu-kvm VMs). >> >> I still think a better approach for this is to run the database on an >> LVM volume and take lots of snapshots. No VM needed, though it doesn't >> hurt. LVM volumes are below the level of the filesystem and a snapshot >> captures the state of the raw blocks the filesystem has written to the >> block layer. The block layer does no caching though the drive may but >> neither the VM solution nor LVM would capture that. >> >> LVM snapshots would have the advantage that you can keep running the >> database and you can take lots of snapshots with relatively little >> overhead. Having dozens or hundreds of snapshots would be unacceptable >> performance drain in production but for testing it should be practical >> and they take relatively little space -- just the blocks changed since >> the snapshot was taken. > > Another idea: hardcode a PANIC just after rename() with > restart_after_crash = off (this needs is IsBootstrapProcess() checks). > Once server crashes, kill-9 the VM. Then restart the VM and the > Postgres instance with a new binary that does not have the PANIC, and > see how things are moving on. There is a window of up to several > seconds after the rename() call, so I guess that this would work. I don't see how that would improve anything, as the PANIC has no impact on the I/O requests already issued to the system. What you need is some sort of coordination between the database and the script that kills the VM (or takes a LVM snapshot). That can be done by simply emitting a particular log message, and the "kill script" may simply watch the file (for example over SSH). This has the benefit that you can also watch for additional conditions that are difficult to check from that particular part of the code (and only kill the VM when all of them trigger - for example only on the third checkpoint since start, and such). The reason why I was not particularly thrilled about the LVM snapshot idea is that to identify this particular data loss issue, you need to be able to reason about the expected state of the database (what transactions are committed, how many segments are there). And my understanding was that Greg's idea was merely "try to start the DB on a snapshot and see if starts / is not corrupted," which would not work with this particular issue, as the database seemed just fine - the data loss is silent. Adding the "last XLOG segment" into pg_controldata would make it easier to detect without having to track details about which transactions got committed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 23, 2016 at 11:39 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/23/2016 02:35 AM, Michael Paquier wrote: >> >> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote: >>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra >>> <tomas.vondra@2ndquadrant.com> wrote: >>> LVM snapshots would have the advantage that you can keep running the >>> database and you can take lots of snapshots with relatively little >>> overhead. Having dozens or hundreds of snapshots would be unacceptable >>> performance drain in production but for testing it should be practical >>> and they take relatively little space -- just the blocks changed since >>> the snapshot was taken. >> >> >> Another idea: hardcode a PANIC just after rename() with >> restart_after_crash = off (this needs is IsBootstrapProcess() checks). >> Once server crashes, kill-9 the VM. Then restart the VM and the >> Postgres instance with a new binary that does not have the PANIC, and >> see how things are moving on. There is a window of up to several >> seconds after the rename() call, so I guess that this would work. > > > I don't see how that would improve anything, as the PANIC has no impact on > the I/O requests already issued to the system. What you need is some sort of > coordination between the database and the script that kills the VM (or takes > a LVM snapshot). Well, to emulate the noise that non-renamed files have on the system we could simply emulate the loss of rename() by just commenting it out and then forcibly crash the instance or just PANIC the instance just before rename(). This would emulate what we are looking for, no? What we want to check is how the system reacts should an unwanted file be in place. For example, take the rename() call in InstallXLogFileSegment(), crashing with an non-effective rename() will cause the presence of an annoying xlogtemp file. Making the rename persistent would make the server complain about an invalid magic number in a segment that has just been created. -- Michael
On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 01/22/2016 06:45 AM, Michael Paquier wrote: >>> Here are some comments about your patch after a look at the code. >>> >>> Regarding the additions in fsync_fname() in xlog.c: >>> 1) In InstallXLogFileSegment, rename() will be called only if >>> HAVE_WORKING_LINK is not used, which happens only on Windows and >>> cygwin. We could add it for consistency, but it should be within the >>> #else/#endif block. It is not critical as of now. >>> 2) The call in RemoveXlogFile is not necessary, the rename happening >>> only on Windows. >> >> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could >> there be some other platforms? And are we sure the file systems on those >> platforms are safe without the fsync call? >> That is, while the report references ext4, there may be other file systems >> with the same problem - ext4 was used mostly as it's the most widely used >> Linux file system. > > From pg_config_manual.h: > #if !defined(WIN32) && !defined(__CYGWIN__) > #define HAVE_WORKING_LINK 1 > #endif > If we want to be consistent with what Posix proposes, I am not against > adding it. I did some tests with NTFS using cygwin, and the rename() calls remain even after powering off the VM. But I agree that adding an fsync() in both cases would be fine. >>> Thoughts? >> >> Thanks for the review and comments. I think the question is whether we only >> want to do the additional fsync() only when it ultimately may lead to data >> loss, or even in cases where it may cause operational issues (e.g. switching >> back to recovery needlessly). >> I'd vote for the latter, as I think it makes the database easier to operate >> (less manual interventions) and the performance impact is 0 (as those fsyncs >> are really rare). > > My first line of thoughts after looking at the patch is that I am not > against adding those fsync calls on HEAD as there is roughly an > advantage to not go back to recovery in most cases and ensure > consistent names, but as they do not imply any data loss I would not > encourage a back-patch. Adding them seems harmless at first sight I > agree, but those are not actual bugs. OK. It is true that PGDATA would be fsync'd in 4 code paths with your patch which are not that much taken: - Renaming tablespace map file and backup label file (three times) - Renaming to recovery.done So, what do you think about the patch attached? Moving the calls into the critical sections is not really necessary except when installing a new segment. -- Michael
Attachment
On 01/25/2016 08:30 AM, Michael Paquier wrote: > On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: ,,, >> My first line of thoughts after looking at the patch is that I am >> not against adding those fsync calls on HEAD as there is roughly >> an advantage to not go back to recovery in most cases and ensure >> consistent names, but as they do not imply any data loss I would >> not encourage a back-patch. Adding them seems harmless at first >> sight I agree, but those are not actual bugs. > > OK. It is true that PGDATA would be fsync'd in 4 code paths with your > patch which are not that much taken: > - Renaming tablespace map file and backup label file (three times) > - Renaming to recovery.done > So, what do you think about the patch attached? Moving the calls into > the critical sections is not really necessary except when installing a > new segment. Seems OK to me. Thanks for the time and improvements! Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Seems OK to me. Thanks for the time and improvements! Thanks. Perhaps a committer could have a look then? I have switched the patch as such in the CF app. Seeing the accumulated feedback upthread that's something that should be backpatched. -- Michael
Michael Paquier wrote: > On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > Seems OK to me. Thanks for the time and improvements! > > Thanks. Perhaps a committer could have a look then? I have switched > the patch as such in the CF app. Seeing the accumulated feedback > upthread that's something that should be backpatched. Yeah. On 9.4 there are already some conflicts, and I'm sure there will be more in almost each branch. Does anyone want to volunteer for producing per-branch versions? The next minor release is to be tagged next week and it'd be good to put this fix there. Thanks, -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-01-25 16:30:47 +0900, Michael Paquier wrote: > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index a2846c4..b124f90 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, > tmppath, path))); > return false; > } > + > + /* > + * Make sure the rename is permanent by fsyncing the parent > + * directory. > + */ > + START_CRIT_SECTION(); > + fsync_fname(XLOGDIR, true); > + END_CRIT_SECTION(); > #endif Hm. I'm seriously doubtful that using critical sections for this is a good idea. What's the scenario you're protecting against by declaring this one? We intentionally don't error out in the isdir cases in fsync_fname() cases anyway. Afaik we need to fsync tmppath before and after the rename, and only then the directory, to actually be safe. > @@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog) > errmsg("could not rename file \"%s\" to \"%s\": %m", > RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE))); > > + /* Make sure the rename is permanent by fsyncing the data directory. */ > + fsync_fname(".", true); > + Shouldn't RECOVERY_COMMAND_DONE be fsynced first here? > ereport(LOG, > (errmsg("archive recovery complete"))); > } > @@ -6150,6 +6169,12 @@ StartupXLOG(void) > TABLESPACE_MAP, BACKUP_LABEL_FILE), > errdetail("Could not rename file \"%s\" to \"%s\": %m.", > TABLESPACE_MAP, TABLESPACE_MAP_OLD))); > + > + /* > + * Make sure the rename is permanent by fsyncing the data > + * directory. > + */ > + fsync_fname(".", true); > } Is it just me, or are the repeated four line comments a bit grating? > /* > @@ -6525,6 +6550,13 @@ StartupXLOG(void) > TABLESPACE_MAP, TABLESPACE_MAP_OLD))); > } > > + /* > + * Make sure the rename is permanent by fsyncing the parent > + * directory. > + */ > + if (haveBackupLabel || haveTblspcMap) > + fsync_fname(".", true); > + Isn't that redundant with the haveTblspcMap case above? > /* Check that the GUCs used to generate the WAL allow recovery */ > CheckRequiredParameterValues(); > > @@ -7305,6 +7337,12 @@ StartupXLOG(void) > errmsg("could not rename file \"%s\" to \"%s\": %m", > origpath, partialpath))); > XLogArchiveNotify(partialfname); > + > + /* > + * Make sure the rename is permanent by fsyncing the parent > + * directory. > + */ > + fsync_fname(XLOGDIR, true); .partial should be fsynced first. > } > } > } > @@ -10905,6 +10943,9 @@ CancelBackup(void) > BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, > TABLESPACE_MAP, TABLESPACE_MAP_OLD))); > } > + > + /* fsync the data directory to persist the renames */ > + fsync_fname(".", true); > } Same. > /* > diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c > index 277c14a..8dda80b 100644 > --- a/src/backend/access/transam/xlogarchive.c > +++ b/src/backend/access/transam/xlogarchive.c > @@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname) > path, xlogfpath))); > > /* > + * Make sure the renames are permanent by fsyncing the parent > + * directory. > + */ > + fsync_fname(XLOGDIR, true); Afaics the file under the temporary filename has not been fsynced up to here. > + /* > * Create .done file forcibly to prevent the restored segment from being > * archived again later. > */ > @@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog) > errmsg("could not rename file \"%s\" to \"%s\": %m", > archiveReady, archiveDone))); > > + /* > + * Make sure the rename is permanent by fsyncing the parent > + * directory. > + */ > + fsync_fname(XLOGDIR "/archive_status", true); > return; > } Afaics the AllocateFile() call below is not protected at all, no? How about introducing a 'safe_rename()' that does something roughly akin to: fsync(oldname); fsync(fname) || true; rename(oldfname, fname); fsync(fname); fsync(basename(fname)); I'd rather have that kind of logic somewhere once, instead of repeated a dozen times... Greetings, Andres Freund
On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote: > Yeah. On 9.4 there are already some conflicts, and I'm sure there will > be more in almost each branch. Does anyone want to volunteer for > producing per-branch versions? > The next minor release is to be tagged next week and it'd be good to put > this fix there. I don't think this is going to be ready for that. The risk of hurrying this through seems higher than the loss risk at this point. Andres
On Tue, Feb 2, 2016 at 1:08 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote: >> Yeah. On 9.4 there are already some conflicts, and I'm sure there will >> be more in almost each branch. Does anyone want to volunteer for >> producing per-branch versions? > >> The next minor release is to be tagged next week and it'd be good to put >> this fix there. > > I don't think this is going to be ready for that. The risk of hurrying > this through seems higher than the loss risk at this point. Agreed. And there is no actual risk of data loss, so let's not hurry and be sure that the right think is pushed. -- Michael
On Tue, Feb 2, 2016 at 12:49 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: >> On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >> > Seems OK to me. Thanks for the time and improvements! >> >> Thanks. Perhaps a committer could have a look then? I have switched >> the patch as such in the CF app. Seeing the accumulated feedback >> upthread that's something that should be backpatched. > > Yeah. On 9.4 there are already some conflicts, and I'm sure there will > be more in almost each branch. Does anyone want to volunteer for > producing per-branch versions? I don't mind doing it once we have something fully-bloomed for master, something that I guess is very likely to apply easily to 9.5 as well. -- Michael
On 2016-02-02 09:56:40 +0900, Michael Paquier wrote: > And there is no actual risk of data loss Huh? - Andres
On Tue, Feb 2, 2016 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-01-25 16:30:47 +0900, Michael Paquier wrote: >> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c >> index a2846c4..b124f90 100644 >> --- a/src/backend/access/transam/xlog.c >> +++ b/src/backend/access/transam/xlog.c >> @@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, >> tmppath, path))); >> return false; >> } >> + >> + /* >> + * Make sure the rename is permanent by fsyncing the parent >> + * directory. >> + */ >> + START_CRIT_SECTION(); >> + fsync_fname(XLOGDIR, true); >> + END_CRIT_SECTION(); >> #endif > > Hm. I'm seriously doubtful that using critical sections for this is a > good idea. What's the scenario you're protecting against by declaring > this one? We intentionally don't error out in the isdir cases in > fsync_fname() cases anyway. > > Afaik we need to fsync tmppath before and after the rename, and only > then the directory, to actually be safe. Regarding the fsync call on the new file before the rename, would it be better to extend fsync_fname() with some kind of noerror flag to work around the case of a file that does not exist or do you think it is better just to use pg_fsync in this case after getting an fd? Using directly pg_fsync() looks redundant with what fsync_fname does already. >> @@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog) >> errmsg("could not rename file \"%s\" to \"%s\": %m", >> RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE))); >> >> + /* Make sure the rename is permanent by fsyncing the data directory. */ >> + fsync_fname(".", true); >> + > > Shouldn't RECOVERY_COMMAND_DONE be fsynced first here? Check. >> /* >> @@ -6525,6 +6550,13 @@ StartupXLOG(void) >> TABLESPACE_MAP, TABLESPACE_MAP_OLD))); >> } >> >> + /* >> + * Make sure the rename is permanent by fsyncing the parent >> + * directory. >> + */ >> + if (haveBackupLabel || haveTblspcMap) >> + fsync_fname(".", true); >> + > > Isn't that redundant with the haveTblspcMap case above? I am not sure I get your point here. Are you referring to the fact that fsync should be done after each rename in this case? >> /* Check that the GUCs used to generate the WAL allow recovery */ >> CheckRequiredParameterValues(); >> >> @@ -7305,6 +7337,12 @@ StartupXLOG(void) >> errmsg("could not rename file \"%s\" to \"%s\": %m", >> origpath, partialpath))); >> XLogArchiveNotify(partialfname); >> + >> + /* >> + * Make sure the rename is permanent by fsyncing the parent >> + * directory. >> + */ >> + fsync_fname(XLOGDIR, true); > > .partial should be fsynced first. Check. >> } >> } >> } >> @@ -10905,6 +10943,9 @@ CancelBackup(void) >> BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, >> TABLESPACE_MAP, TABLESPACE_MAP_OLD))); >> } >> + >> + /* fsync the data directory to persist the renames */ >> + fsync_fname(".", true); >> } > > Same. Re-check. >> /* >> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c >> index 277c14a..8dda80b 100644 >> --- a/src/backend/access/transam/xlogarchive.c >> +++ b/src/backend/access/transam/xlogarchive.c >> @@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname) >> path, xlogfpath))); >> >> /* >> + * Make sure the renames are permanent by fsyncing the parent >> + * directory. >> + */ >> + fsync_fname(XLOGDIR, true); > > Afaics the file under the temporary filename has not been fsynced up to > here. Yes, true, the old file... >> + /* >> * Create .done file forcibly to prevent the restored segment from being >> * archived again later. >> */ >> @@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog) >> errmsg("could not rename file \"%s\" to \"%s\": %m", >> archiveReady, archiveDone))); >> >> + /* >> + * Make sure the rename is permanent by fsyncing the parent >> + * directory. >> + */ >> + fsync_fname(XLOGDIR "/archive_status", true); >> return; >> } > > Afaics the AllocateFile() call below is not protected at all, no? Yep. > How about introducing a 'safe_rename()' that does something roughly akin > to: > fsync(oldname); > fsync(fname) || true; > rename(oldfname, fname); > fsync(fname); > fsync(basename(fname)); > > I'd rather have that kind of logic somewhere once, instead of repeated a > dozen times... Not wrong, and this leads to the following: void rename_safe(const char *old, const char *new, bool isdir, int elevel); Controlling elevel is necessary per the multiple code paths that would use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does that look fine? -- Michael
On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-02-02 09:56:40 +0900, Michael Paquier wrote: >> And there is no actual risk of data loss > > Huh? More precise: what I mean here is that should an OS crash or a power failure happen, we would fall back to recovery at next restart, so we would not actually *lose* data. -- Michael
On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote: > Not wrong, and this leads to the following: > void rename_safe(const char *old, const char *new, bool isdir, int elevel); > Controlling elevel is necessary per the multiple code paths that would > use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does > that look fine? After really coding it, I finished with the following thing: +int +rename_safe(const char *old, const char *new) There is no need to extend that for directories, well we could of course but all the renames happen on files so I see no need to make that more complicated. More refactoring of the other rename() calls could be done as well by extending rename_safe() with a flag to fsync the data within a critical section, particularly for the replication slot code. I have let that out to not complicate more the patch. -- Michael
Attachment
On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote: >> Not wrong, and this leads to the following: >> void rename_safe(const char *old, const char *new, bool isdir, int elevel); >> Controlling elevel is necessary per the multiple code paths that would >> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does >> that look fine? > > After really coding it, I finished with the following thing: > +int > +rename_safe(const char *old, const char *new) > > There is no need to extend that for directories, well we could of > course but all the renames happen on files so I see no need to make > that more complicated. More refactoring of the other rename() calls > could be done as well by extending rename_safe() with a flag to fsync > the data within a critical section, particularly for the replication > slot code. I have let that out to not complicate more the patch. Andres just poked me (2m far from each other now) regarding the fact that we should fsync even after the link() calls when HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea culpa. And the patch misses that. -- Michael
On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote: >>> Not wrong, and this leads to the following: >>> void rename_safe(const char *old, const char *new, bool isdir, int elevel); >>> Controlling elevel is necessary per the multiple code paths that would >>> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does >>> that look fine? >> >> After really coding it, I finished with the following thing: >> +int >> +rename_safe(const char *old, const char *new) >> >> There is no need to extend that for directories, well we could of >> course but all the renames happen on files so I see no need to make >> that more complicated. More refactoring of the other rename() calls >> could be done as well by extending rename_safe() with a flag to fsync >> the data within a critical section, particularly for the replication >> slot code. I have let that out to not complicate more the patch. > > Andres just poked me (2m far from each other now) regarding the fact > that we should fsync even after the link() calls when > HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea > culpa. And the patch misses that. So, attached is an updated patch that adds a new routine link_safe() to ensure that a hard link is on-disk persistent. link() is called twice in timeline.c and once in xlog.c, so those three code paths are impacted. I noticed as well that my previous patch was sometimes doing palloc calls in a critical section (oops), I fixed that on the way. Thoughts welcome. -- Michael
Attachment
On 02/04/2016 09:59 AM, Michael Paquier wrote: > On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote: >>> And there is no actual risk of data loss >> >> Huh? > > More precise: what I mean here is that should an OS crash or a power > failure happen, we would fall back to recovery at next restart, so we > would not actually *lose* data. Except that we actually can't perform the recovery properly because we may not have the last WAL segment (or multiple segments), so we can't replay the last batch of transactions. And we don't even notice that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 02/04/2016 09:59 AM, Michael Paquier wrote: >> >> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote: >>> >>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote: >>>> >>>> And there is no actual risk of data loss >>> >>> >>> Huh? >> >> >> More precise: what I mean here is that should an OS crash or a power >> failure happen, we would fall back to recovery at next restart, so we >> would not actually *lose* data. > > > Except that we actually can't perform the recovery properly because we may > not have the last WAL segment (or multiple segments), so we can't replay the > last batch of transactions. And we don't even notice that. Still the data is here... But well. I won't insist. Tomas, could you have a look at the latest patch I wrote? It would be good to get fresh eyes on it. We could work on a version for ~9.4 once we have a clean approach for master/9.5. -- Michael
Hi, On 02/06/2016 01:16 PM, Michael Paquier wrote: > On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> On 02/04/2016 09:59 AM, Michael Paquier wrote: >>> >>> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote: >>>> >>>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote: >>>>> >>>>> And there is no actual risk of data loss >>>> >>>> >>>> Huh? >>> >>> >>> More precise: what I mean here is that should an OS crash or a >>> power failure happen, we would fall back to recovery at next >>> restart, so we would not actually *lose* data. >> >> >> Except that we actually can't perform the recovery properly >> because we may not have the last WAL segment (or multiple >> segments), so we can't replay the last batch of transactions. And >> we don't even notice that. > > Still the data is here... But well. I won't insist. Huh? This thread started by an example how to cause loss of committed transactions. That fits my definition of "data loss" quite well. > Tomas, could you have a look at the latest patch I wrote? It would be > good to get fresh eyes on it. We could work on a version for ~9.4 > once we have a clean approach for master/9.5. Yep, I'll take a look - I've been out of office for the past 2 weeks, but I've been following the discussion and I agree with the changes discussed there (e.g. adding safe_rename and such). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-02-06 17:43:48 +0100, Tomas Vondra wrote: > >Still the data is here... But well. I won't insist. > > Huh? This thread started by an example how to cause loss of committed > transactions. That fits my definition of "data loss" quite well. Agreed, that view doesn't seem to make much sense. This clearly is a data loss issue. Andres
Hi, On 02/05/2016 10:40 AM, Michael Paquier wrote: > On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote: ... > So, attached is an updated patch that adds a new routine link_safe() > to ensure that a hard link is on-disk persistent. link() is called > twice in timeline.c and once in xlog.c, so those three code paths > are impacted. I noticed as well that my previous patch was sometimes > doing palloc calls in a critical section (oops), I fixed that on the > way. I've finally got around to review the v5 version of the patch. Sorry it took so long (I blame FOSDEM, country-wide flu epidemic and my general laziness). I do like most of the changes to the patch, thanks for improving it. A few comments though: 1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? The only place where we use missing_ok=true is in rename_safe, where right at the beginning we do this: fsync_fname(newfile, false, true); I.e. we're fsyncing the rename target first, in case it exists. But that seems to be conflicting with the comments in xlog.c where we explicitly state that the target file should not exist. Why should it be OK to call rename_safe() when the target already exists? If this really is the right thing to do, it should be explained in the comment above rename_safe(), probably. 2) If rename_safe/link_safe are meant as crash-safe replacements for rename/link, then perhaps we should use the same signatures, including the "const" pointer parameters. So while currently the signatures look like this: int rename_safe(char *oldfile, char *newfile); int link_safe(char *oldfile, char *newfile); it should probably look like this int rename_safe(const char *oldfile, const char *newfile); int link_safe(const char *oldfile, const char *newfile); I've noticed this in CheckPointReplicationOrigin() where the current code has to cast the parameters to (char*) to silence the compiler. 3) Both rename_safe and link_safe do this at the very end: fsync_parent_path(newfile); That however assumes both the oldfile and newfile are placed in the same directory - otherwise we'd fsync only one of them. I don't think we have a place where we're renaming files between directories (or do we), so we're OK with respect to this. But it seems like a good idea to defend against this, or at least mention that in the comments. 4) nitpicking: There are some unnecessary newlines added/removed in RemoveXlogFile, XLogArchiveForceDone. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 24, 2016 at 7:26 AM, Tomas Vondra wrote: > 1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? The > only place where we use missing_ok=true is in rename_safe, where right at > the beginning we do this: > > fsync_fname(newfile, false, true); > > I.e. we're fsyncing the rename target first, in case it exists. But that > seems to be conflicting with the comments in xlog.c where we explicitly > state that the target file should not exist. Why should it be OK to call > rename_safe() when the target already exists? If this really is the right > thing to do, it should be explained in the comment above rename_safe(), > probably. The point is to mimic rename(), which can manage the case where the new entry exists, and to look for a consistent on-disk behavior. It is true that the new argument of fsync_fname is actually not necessary, we could just check for the existence of the new entry with stat(), and perform an fsync if it exists. I have added the following comment in rename_safe(): + /* + * First fsync the old entry and new entry, it this one exists, to ensure + * that they are properly persistent on disk. Calling this routine with + * an existing new target file is fine, rename() will first remove it + * before performing its operation. + */ How does that look? > 2) If rename_safe/link_safe are meant as crash-safe replacements for > rename/link, then perhaps we should use the same signatures, including the > "const" pointer parameters. So while currently the signatures look like > this: > > int rename_safe(char *oldfile, char *newfile); > int link_safe(char *oldfile, char *newfile); > > it should probably look like this > > int rename_safe(const char *oldfile, const char *newfile); > int link_safe(const char *oldfile, const char *newfile); > > I've noticed this in CheckPointReplicationOrigin() where the current code > has to cast the parameters to (char*) to silence the compiler. I recall considering that, and the reason why I did not do so was that fsync_fname() is not doing it either, because OpenTransientFile() is using FileName which is not defined as a constant. At the end we'd finish with one or two casts anyway. So it seems that changing fsync_fname makes sense though by looking at fsync_fname_ext, OpenTransientFile() is using an explicit cast for the file name. > 3) Both rename_safe and link_safe do this at the very end: > > fsync_parent_path(newfile); > > That however assumes both the oldfile and newfile are placed in the same > directory - otherwise we'd fsync only one of them. I don't think we have a > place where we're renaming files between directories (or do we), so we're OK > with respect to this. But it seems like a good idea to defend against this, > or at least mention that in the comments. No, we don't have a code path where a file is renamed between different directories, which is why this code is doing so. The comment is a good addition to have though, so I have added it. I guess that we could complicate more this patch to check if the parent directories of the new and old entries match or not, then fsync both of them, but I'd rather keep things simple. > 4) nitpicking: There are some unnecessary newlines added/removed in > RemoveXlogFile, XLogArchiveForceDone. Fixed. I missed those three ones. -- Michael
Attachment
Hi, > +/* > + * rename_safe -- rename of a file, making it on-disk persistent > + * > + * This routine ensures that a rename file persists in case of a crash by using > + * fsync on the old and new files before and after performing the rename so as > + * this categorizes as an all-or-nothing operation. > + */ > +int > +rename_safe(const char *oldfile, const char *newfile) > +{ > + struct stat filestats; > + > + /* > + * First fsync the old entry and new entry, it this one exists, to ensure > + * that they are properly persistent on disk. Calling this routine with > + * an existing new target file is fine, rename() will first remove it > + * before performing its operation. > + */ > + fsync_fname(oldfile, false); > + if (stat(newfile, &filestats) == 0) > + fsync_fname(newfile, false); I don't think we want any stat()s here. I'd much, much rather check open for ENOENT. > +/* > + * link_safe -- make a file hard link, making it on-disk persistent > + * > + * This routine ensures that a hard link created on a file persists on system > + * in case of a crash by using fsync where on the link generated as well as on > + * its parent directory. > + */ > +int > +link_safe(const char *oldfile, const char *newfile) > +{ If we go for a new abstraction here, I'd rather make it 'replace_file_safe' or something, and move the link/rename code #ifdef into it. > + if (link(oldfile, newfile) < 0) > + return -1; > + > + /* > + * Make the link persistent in case of an OS crash, the new entry > + * generated as well as its parent directory need to be flushed. > + */ > + fsync_fname(newfile, false); > + > + /* > + * Same for parent directory. This routine is never called to rename > + * files across directories, but if this proves to become the case, > + * flushing the parent directory if the old file would be necessary. > + */ > + fsync_parent_path(newfile); > + return 0; I think it's a seriously bad idea to encode that knowledge in such a general sounding routine. We could however argue that this is about safely replacing the *target* file; not about safely removing the old file. Currently I'm inclined to apply this to master soon. But I think we might want to wait a while with backpatching. The recent fsync upgrade disaster kinda makes me a bit careful... Greetings, Andres Freund
On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, Thanks for the review. >> +/* >> + * rename_safe -- rename of a file, making it on-disk persistent >> + * >> + * This routine ensures that a rename file persists in case of a crash by using >> + * fsync on the old and new files before and after performing the rename so as >> + * this categorizes as an all-or-nothing operation. >> + */ >> +int >> +rename_safe(const char *oldfile, const char *newfile) >> +{ >> + struct stat filestats; >> + >> + /* >> + * First fsync the old entry and new entry, it this one exists, to ensure >> + * that they are properly persistent on disk. Calling this routine with >> + * an existing new target file is fine, rename() will first remove it >> + * before performing its operation. >> + */ >> + fsync_fname(oldfile, false); >> + if (stat(newfile, &filestats) == 0) >> + fsync_fname(newfile, false); > > I don't think we want any stat()s here. I'd much, much rather check open > for ENOENT. OK. So you mean more or less that, right? int fd; fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0); if (fd < 0) { if (errno != ENOENT) return -1; } else { pg_fsync(fd); CloseTransientFile(fd); } >> +/* >> + * link_safe -- make a file hard link, making it on-disk persistent >> + * >> + * This routine ensures that a hard link created on a file persists on system >> + * in case of a crash by using fsync where on the link generated as well as on >> + * its parent directory. >> + */ >> +int >> +link_safe(const char *oldfile, const char *newfile) >> +{ > > If we go for a new abstraction here, I'd rather make it > 'replace_file_safe' or something, and move the link/rename code #ifdef > into it. Hm. OK. I don't see any reason why switching to link() even in code paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would be a problem thinking about it. Should HAVE_WORKING_LINK be available on a platform we can combine it with unlink. Is that in line with what you think? >> + if (link(oldfile, newfile) < 0) >> + return -1; >> + >> + /* >> + * Make the link persistent in case of an OS crash, the new entry >> + * generated as well as its parent directory need to be flushed. >> + */ >> + fsync_fname(newfile, false); >> + >> + /* >> + * Same for parent directory. This routine is never called to rename >> + * files across directories, but if this proves to become the case, >> + * flushing the parent directory if the old file would be necessary. >> + */ >> + fsync_parent_path(newfile); >> + return 0; > > I think it's a seriously bad idea to encode that knowledge in such a > general sounding routine. We could however argue that this is about > safely replacing the *target* file; not about safely removing the old > file. Not sure I am following here. Are you referring to the fact that if the new file and old file are on different directories would make this routine unreliable? Because yes that's the case if we want to make both of them persistent, and I think we want to do so. Do you suggest to correct this comment to remove the mention to the old file's parent directory because we just care about having the new file as being persistent? Or do you suggest that we should actually extend this routine so as we fsync both the new and old file's parent directory if they differ? > Currently I'm inclined to apply this to master soon. But I think we > might want to wait a while with backpatching. The recent fsync upgrade > disaster kinda makes me a bit careful... There have not been actual field complaints about that yet. That's fine for me to wait a bit. -- Michael
I would like to have a patch for this finalized today, so that we can apply to master before or during the weekend; with it in the tree for about a week we can be more confident and backpatch close to next weekend, so that we see it in the next set of minor releases. Does that sound good? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I would like to have a patch for this finalized today, so that we can > apply to master before or during the weekend; with it in the tree for > about a week we can be more confident and backpatch close to next > weekend, so that we see it in the next set of minor releases. Does that > sound good? I see no reason to wait before backpatching. If you're concerned about having testing, the more branches it is in, the more buildfarm cycles you will get on it. And we're not going to cut any releases in between, so what's the benefit of not having it there? regards, tom lane
On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> I would like to have a patch for this finalized today, so that we can >> apply to master before or during the weekend; with it in the tree for >> about a week we can be more confident and backpatch close to next >> weekend, so that we see it in the next set of minor releases. Does that >> sound good? > > I see no reason to wait before backpatching. If you're concerned about > having testing, the more branches it is in, the more buildfarm cycles > you will get on it. And we're not going to cut any releases in between, > so what's the benefit of not having it there? Agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 5, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> I would like to have a patch for this finalized today, so that we can >>> apply to master before or during the weekend; with it in the tree for >>> about a week we can be more confident and backpatch close to next >>> weekend, so that we see it in the next set of minor releases. Does that >>> sound good? >> >> I see no reason to wait before backpatching. If you're concerned about >> having testing, the more branches it is in, the more buildfarm cycles >> you will get on it. And we're not going to cut any releases in between, >> so what's the benefit of not having it there? > > Agreed. OK. I could produce that by tonight my time, not before unfortunately. And FWIW, per the comments of Andres, it is not clear to me what we gain by having a common routine for link() and rename() knowing that half the code paths performing a rename do not rely on link(). At least it sound dangerous to me to introduce a dependency to link() in code paths that depend just on rename() for back branches. On HEAD, we could be more adventurous for sure. Regarding the replacement of stat() by something relying on OpenTransientFile I agree though. For the flush of the parent directory in link_safe() we'd still want to do it, and we are fine to not flush the parent directory of the old file because the backend does not move files across paths. -- Michael
On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: > On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Thanks for the review. > > >> +/* > >> + * rename_safe -- rename of a file, making it on-disk persistent > >> + * > >> + * This routine ensures that a rename file persists in case of a crash by using > >> + * fsync on the old and new files before and after performing the rename so as > >> + * this categorizes as an all-or-nothing operation. > >> + */ > >> +int > >> +rename_safe(const char *oldfile, const char *newfile) > >> +{ > >> + struct stat filestats; > >> + > >> + /* > >> + * First fsync the old entry and new entry, it this one exists, to ensure > >> + * that they are properly persistent on disk. Calling this routine with > >> + * an existing new target file is fine, rename() will first remove it > >> + * before performing its operation. > >> + */ > >> + fsync_fname(oldfile, false); > >> + if (stat(newfile, &filestats) == 0) > >> + fsync_fname(newfile, false); > > > > I don't think we want any stat()s here. I'd much, much rather check open > > for ENOENT. > > OK. So you mean more or less that, right? > int fd; > fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0); > if (fd < 0) > { > if (errno != ENOENT) > return -1; > } > else > { > pg_fsync(fd); > CloseTransientFile(fd); > } Yes. Otherwise the check is racy: The file could be gone by the time you do the fsync; leading to a spurious ERROR (which often would get promoted to a PANIC). > >> +/* > >> + * link_safe -- make a file hard link, making it on-disk persistent > >> + * > >> + * This routine ensures that a hard link created on a file persists on system > >> + * in case of a crash by using fsync where on the link generated as well as on > >> + * its parent directory. > >> + */ > >> +int > >> +link_safe(const char *oldfile, const char *newfile) > >> +{ > > > > If we go for a new abstraction here, I'd rather make it > > 'replace_file_safe' or something, and move the link/rename code #ifdef > > into it. > > Hm. OK. I don't see any reason why switching to link() even in code > paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would > be a problem thinking about it. Should HAVE_WORKING_LINK be available > on a platform we can combine it with unlink. Is that in line with what > you think? I wasn't trying to suggest we should replace all rename codepaths with the link wrapper, just the ones that already have a HAVE_WORKING_LINK check. The name of the routine I suggested is bad though... > >> + if (link(oldfile, newfile) < 0) > >> + return -1; > >> + > >> + /* > >> + * Make the link persistent in case of an OS crash, the new entry > >> + * generated as well as its parent directory need to be flushed. > >> + */ > >> + fsync_fname(newfile, false); > >> + > >> + /* > >> + * Same for parent directory. This routine is never called to rename > >> + * files across directories, but if this proves to become the case, > >> + * flushing the parent directory if the old file would be necessary. > >> + */ > >> + fsync_parent_path(newfile); > >> + return 0; > > > > I think it's a seriously bad idea to encode that knowledge in such a > > general sounding routine. We could however argue that this is about > > safely replacing the *target* file; not about safely removing the old > > file. > > Not sure I am following here. Are you referring to the fact that if > the new file and old file are on different directories would make this > routine unreliable? Yes. > Because yes that's the case if we want to make both of them > persistent, and I think we want to do so. That's one way. > Do you suggest to correct this comment to remove the mention to the > old file's parent directory because we just care about having the new > file as being persistent? That's one approach, yes. Combined with the fact that you can't actually reliably rename across directories, the two could be on different filesystems after all, that'd be a suitable defense. It just needs to be properly documented in the function header, not at the bottom. Regards, Andres
On 2016-03-05 07:29:35 +0900, Michael Paquier wrote: > OK. I could produce that by tonight my time, not before unfortunately. I'm switching to this patch, after pushing the pending logical decoding fixes. Probably not today, but tomorrow PST afternoon should work. > And FWIW, per the comments of Andres, it is not clear to me what we > gain by having a common routine for link() and rename() knowing that > half the code paths performing a rename do not rely on link(). I'm not talking about replacing all renames with this. Just the ones that currently use link(). There's not much point in introducing link_safe(), when all the callers have the same duplicated code, with a fallback to rename(). Andres
On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote: >> > I don't think we want any stat()s here. I'd much, much rather check open >> > for ENOENT. >> >> OK. So you mean more or less that, right? >> int fd; >> fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0); >> if (fd < 0) >> { >> if (errno != ENOENT) >> return -1; >> } >> else >> { >> pg_fsync(fd); >> CloseTransientFile(fd); >> } > > Yes. Otherwise the check is racy: The file could be gone by the time you > do the fsync; leading to a spurious ERROR (which often would get > promoted to a PANIC). Yeah, that makes sense. >> >> +/* >> >> + * link_safe -- make a file hard link, making it on-disk persistent >> >> + * >> >> + * This routine ensures that a hard link created on a file persists on system >> >> + * in case of a crash by using fsync where on the link generated as well as on >> >> + * its parent directory. >> >> + */ >> >> +int >> >> +link_safe(const char *oldfile, const char *newfile) >> >> +{ >> > >> > If we go for a new abstraction here, I'd rather make it >> > 'replace_file_safe' or something, and move the link/rename code #ifdef >> > into it. >> >> Hm. OK. I don't see any reason why switching to link() even in code >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would >> be a problem thinking about it. Should HAVE_WORKING_LINK be available >> on a platform we can combine it with unlink. Is that in line with what >> you think? > > I wasn't trying to suggest we should replace all rename codepaths with > the link wrapper, just the ones that already have a HAVE_WORKING_LINK > check. The name of the routine I suggested is bad though... So we'd introduce a first routine rename_or_link_safe(), say replace_safe(). >> Do you suggest to correct this comment to remove the mention to the >> old file's parent directory because we just care about having the new >> file as being persistent? > > That's one approach, yes. Combined with the fact that you can't actually > reliably rename across directories, the two could be on different > filesystems after all, that'd be a suitable defense. It just needs to be > properly documented in the function header, not at the bottom. OK. Got it. Or the two could be on the same filesystem. Still, link() and rename() do not support doing their stuff on different filesystems (EXDEV). -- Michael
On Sat, Mar 5, 2016 at 7:37 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-05 07:29:35 +0900, Michael Paquier wrote: >> OK. I could produce that by tonight my time, not before unfortunately. > > I'm switching to this patch, after pushing the pending logical decoding > fixes. Probably not today, but tomorrow PST afternoon should work. OK, so if that's the case there is not need to step on your toes seen from here. >> And FWIW, per the comments of Andres, it is not clear to me what we >> gain by having a common routine for link() and rename() knowing that >> half the code paths performing a rename do not rely on link(). > > I'm not talking about replacing all renames with this. Just the ones > that currently use link(). There's not much point in introducing > link_safe(), when all the callers have the same duplicated code, with a > fallback to rename(). Indeed, that's the case. I don't have a better name than replace_safe though. replace_paranoid is not a very appealing name either. -- Michael
On 2016-03-05 07:43:00 +0900, Michael Paquier wrote: > On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: > >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote: > >> Hm. OK. I don't see any reason why switching to link() even in code > >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would > >> be a problem thinking about it. Should HAVE_WORKING_LINK be available > >> on a platform we can combine it with unlink. Is that in line with what > >> you think? > > > > I wasn't trying to suggest we should replace all rename codepaths with > > the link wrapper, just the ones that already have a HAVE_WORKING_LINK > > check. The name of the routine I suggested is bad though... > > So we'd introduce a first routine rename_or_link_safe(), say replace_safe(). Or actually maybe just link_safe(), which falls back to access() && rename() if !HAVE_WORKING_LINK. > > That's one approach, yes. Combined with the fact that you can't actually > > reliably rename across directories, the two could be on different > > filesystems after all, that'd be a suitable defense. It just needs to be > > properly documented in the function header, not at the bottom. > > OK. Got it. Or the two could be on the same filesystem. > Still, link() and rename() do not support doing their stuff on > different filesystems (EXDEV). That's my point ...
On Sat, Mar 5, 2016 at 7:47 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-05 07:43:00 +0900, Michael Paquier wrote: >> On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote: >> >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote: >> >> Hm. OK. I don't see any reason why switching to link() even in code >> >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would >> >> be a problem thinking about it. Should HAVE_WORKING_LINK be available >> >> on a platform we can combine it with unlink. Is that in line with what >> >> you think? >> > >> > I wasn't trying to suggest we should replace all rename codepaths with >> > the link wrapper, just the ones that already have a HAVE_WORKING_LINK >> > check. The name of the routine I suggested is bad though... >> >> So we'd introduce a first routine rename_or_link_safe(), say replace_safe(). > > Or actually maybe just link_safe(), which falls back to access() && > rename() if !HAVE_WORKING_LINK. > >> > That's one approach, yes. Combined with the fact that you can't actually >> > reliably rename across directories, the two could be on different >> > filesystems after all, that'd be a suitable defense. It just needs to be >> > properly documented in the function header, not at the bottom. >> >> OK. Got it. Or the two could be on the same filesystem. > >> Still, link() and rename() do not support doing their stuff on >> different filesystems (EXDEV). > > That's my point ... OK, I hacked a v7: - Move the link()/rename() group with HAVE_WORKING_LINK into a single routine, making the previous link_safe renamed to replace_safe. This is sharing a lot of things with rename_safe. I am not sure it is worth complicating the code more this way by having a common single routine for whole. Thoughts welcome. Honestly, I kind of liked the separation with link_safe/rename_safe of previous patches because link_safe could have been directly used by extensions and plugins btw. - Remove the call of stat() in rename_safe() and implement a logic depending on OpenTransientFile()/pg_fsync() to flush any existing target file before performing the rename. Andres, feel free to use this patch as a base, perhaps that will help. -- Michael
Attachment
On 2016-03-05 22:25:36 +0900, Michael Paquier wrote: > OK, I hacked a v7: > - Move the link()/rename() group with HAVE_WORKING_LINK into a single > routine, making the previous link_safe renamed to replace_safe. This > is sharing a lot of things with rename_safe. I am not sure it is worth > complicating the code more this way by having a common single routine > for whole. Thoughts welcome. Honestly, I kind of liked the separation > with link_safe/rename_safe of previous patches because link_safe could > have been directly used by extensions and plugins btw. > - Remove the call of stat() in rename_safe() and implement a logic > depending on OpenTransientFile()/pg_fsync() to flush any existing > target file before performing the rename. > Andres, feel free to use this patch as a base, perhaps that will help. I started working on this; delayed by taking longer than planned on the logical decoding stuff (quite a bit complicated by e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the error handling as it is right now. For one, you have rename_safe return error codes, and act on them in the callers, but on the other hand you call fsync_fname which always errors out in case of failure. I also don't like the new messages much. Will continue working on this tomorrow. Andres Freund
Hi, On 2016-03-05 19:54:05 -0800, Andres Freund wrote: > I started working on this; delayed by taking longer than planned on the > logical decoding stuff (quite a bit complicated by > e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the > error handling as it is right now. For one, you have rename_safe return > error codes, and act on them in the callers, but on the other hand you > call fsync_fname which always errors out in case of failure. I also > don't like the new messages much. > > Will continue working on this tomorrow. So, here's my current version of this. I've not performed any testing yet, and it's hot of the press. There's some comment smithing needed. But otherwise I'm starting to like this. Changes: * renamed rename_safe to durable_rename * renamed replace_safe to durable_link_or_rename (there was never any replacing going on) * pass through elevel to the underlying routines, otherwise we could error out with ERROR when we don't want to. That's particularly important in case of things like InstallXLogFileSegment(). * made fsync_fname use fsync_fname_ext, add 'accept permission errors' param * have walkdir call a wrapper, to add ignore_perms param What do you think? I sure wish we had the recovery testing et al. in all the back branches... - Andres
Attachment
On Mon, Mar 7, 2016 at 3:38 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-05 19:54:05 -0800, Andres Freund wrote: >> I started working on this; delayed by taking longer than planned on the >> logical decoding stuff (quite a bit complicated by >> e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the >> error handling as it is right now. For one, you have rename_safe return >> error codes, and act on them in the callers, but on the other hand you >> call fsync_fname which always errors out in case of failure. I also >> don't like the new messages much. >> >> Will continue working on this tomorrow. > > So, here's my current version of this. I've not performed any testing > yet, and it's hot of the press. There's some comment smithing > needed. But otherwise I'm starting to like this. > > Changes: > * renamed rename_safe to durable_rename > * renamed replace_safe to durable_link_or_rename (there was never any > replacing going on) > * pass through elevel to the underlying routines, otherwise we could > error out with ERROR when we don't want to. That's particularly > important in case of things like InstallXLogFileSegment(). > * made fsync_fname use fsync_fname_ext, add 'accept permission errors' > param > * have walkdir call a wrapper, to add ignore_perms param > > What do you think? I have spent a couple of hours looking at that in details, and the patch is neat. + * This routine ensures that, after returning, the effect of renaming file + * persists in case of a crash. A crash while this routine is running will + * leave you with either the old, or the new file. "you" is not really Postgres-like, "the server" or "the backend" perhaps? + /* XXX: perform close() before? might be outside a transaction. Consider errno! */ ereport(elevel, (errcode_for_file_access(), errmsg("couldnot fsync file \"%s\": %m", fname))); + (void) CloseTransientFile(fd); + return -1; close() should be called before. slot.c for example does so and we don't want to link an fd here in case of elevel >= ERROR. + * It does so by using fsync on the sourcefile before the rename, and the + * target file and directory after. fsync is issued as well on the target file if it exists. I think that's worth mentioning in the header. + /* XXX: Add racy file existence check? */ + if (rename(oldfile, newfile) < 0) I am not sure we should worry about that, what do you think could cause the old file from going missing all of a sudden. Other backend processes are not playing with it in the code paths where this routine is called. Perhaps adding a comment in the header to let users know that would help? Instead of "durable" I think that "persistent" makes more sense. We want to make those renames persistent on disk on case of a crash. So I would suggest the following routine names: - rename_persistent - rename_or_link_persistent Having the verb first also helps identifying that this is a system-level, rename()-like, routine. > I sure wish we had the recovery testing et al. in all the back > branches... Sure, what we have now is something that should really be backpatched, I was just waiting to have all the existing stability issues addressed, the last one on my agenda being the failure of hamster for test 005 I mentioned in another thread before sending patches for other branches. I proposed a couple of potential regarding that actually, see here: http://www.postgresql.org/message-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com -- Michael
Hi, On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: > I have spent a couple of hours looking at that in details, and the > patch is neat. Cool. Doing some more polishing right now. Will be back with an updated version soonish. Did you do some testing? > + * This routine ensures that, after returning, the effect of renaming file > + * persists in case of a crash. A crash while this routine is running will > + * leave you with either the old, or the new file. > "you" is not really Postgres-like, "the server" or "the backend" perhaps? Hm. I think your alternative proposals are more awkward. > + /* XXX: perform close() before? might be outside a > transaction. Consider errno! */ > ereport(elevel, > (errcode_for_file_access(), > errmsg("could not fsync file \"%s\": %m", fname))); > + (void) CloseTransientFile(fd); > + return -1; > close() should be called before. slot.c for example does so and we > don't want to link an fd here in case of elevel >= ERROR. Note that the transient file machinery will normally prevent fd leakage - but it can only do so if called in a transaction context. I've added int save_errno; /* close file upon error, might not be in transaction context */ save_errno = errno; CloseTransientFile(fd); errno = save_errno; stanzas. > + * It does so by using fsync on the sourcefile before the rename, and the > + * target file and directory after. > fsync is issued as well on the target file if it exists. I think > that's worth mentioning in the header. Ok. > + /* XXX: Add racy file existence check? */ > + if (rename(oldfile, newfile) < 0) > I am not sure we should worry about that, what do you think could > cause the old file from going missing all of a sudden. Other backend > processes are not playing with it in the code paths where this routine > is called. Perhaps adding a comment in the header to let users know > that would help? What I'm thinking of is adding a check whether the *target* file already exists, and error out in that case. Just like the link() based path normally does. > Instead of "durable" I think that "persistent" makes more sense. I find durable a lot more descriptive. persistent could refer to retrying the rename or something. > We > want to make those renames persistent on disk on case of a crash. So I > would suggest the following routine names: > - rename_persistent > - rename_or_link_persistent > Having the verb first also helps identifying that this is a > system-level, rename()-like, routine. I prefer the current names. > > I sure wish we had the recovery testing et al. in all the back > > branches... > > Sure, what we have now is something that should really be backpatched, > I was just waiting to have all the existing stability issues > addressed, the last one on my agenda being the failure of hamster for > test 005 I mentioned in another thread before sending patches for > other branches. I proposed a couple of potential regarding that > actually, see here: > http://www.postgresql.org/message-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com Yea. Will be an interesting discussion... Anyway, I did run the patch through the existing checks, after enabling fsync in PostgresNode.pm. Greetings, Andres Freund
On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: >> I have spent a couple of hours looking at that in details, and the >> patch is neat. > > Cool. Doing some more polishing right now. Will be back with an updated > version soonish. > > Did you do some testing? Not much in details yet, I just ran a check-world with fsync enabled for the recovery tests, plus quick manual tests with a cluster manually set up. I'll do more with your new version now that I know there will be one. >> + /* XXX: Add racy file existence check? */ >> + if (rename(oldfile, newfile) < 0) > >> I am not sure we should worry about that, what do you think could >> cause the old file from going missing all of a sudden. Other backend >> processes are not playing with it in the code paths where this routine >> is called. Perhaps adding a comment in the header to let users know >> that would help? > > What I'm thinking of is adding a check whether the *target* file already > exists, and error out in that case. Just like the link() based path > normally does. Ah, OK. Well, why not. I'd rather have an assertion instead of an error though. -- Michael
On 2016-03-08 12:26:34 +0900, Michael Paquier wrote: > On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: > >> I have spent a couple of hours looking at that in details, and the > >> patch is neat. > > > > Cool. Doing some more polishing right now. Will be back with an updated > > version soonish. > > > > Did you do some testing? > > Not much in details yet, I just ran a check-world with fsync enabled > for the recovery tests, plus quick manual tests with a cluster > manually set up. I'll do more with your new version now that I know > there will be one. Here's my updated version. Note that I've split the patch into two. One for the infrastructure, and one for the callsites. > >> + /* XXX: Add racy file existence check? */ > >> + if (rename(oldfile, newfile) < 0) > > > >> I am not sure we should worry about that, what do you think could > >> cause the old file from going missing all of a sudden. Other backend > >> processes are not playing with it in the code paths where this routine > >> is called. Perhaps adding a comment in the header to let users know > >> that would help? > > > > What I'm thinking of is adding a check whether the *target* file already > > exists, and error out in that case. Just like the link() based path > > normally does. > > Ah, OK. Well, why not. I'd rather have an assertion instead of an error though. I think it should definitely be an error if anything. But I'd rather only add it in master... Andres
Attachment
On Tue, Mar 8, 2016 at 2:55 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-08 12:26:34 +0900, Michael Paquier wrote: >> On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: >> >> I have spent a couple of hours looking at that in details, and the >> >> patch is neat. >> > >> > Cool. Doing some more polishing right now. Will be back with an updated >> > version soonish. >> > >> > Did you do some testing? >> >> Not much in details yet, I just ran a check-world with fsync enabled >> for the recovery tests, plus quick manual tests with a cluster >> manually set up. I'll do more with your new version now that I know >> there will be one. > > Here's my updated version. > > Note that I've split the patch into two. One for the infrastructure, and > one for the callsites. Thanks for the updated patches and the split, this makes things easier to look at. I have been doing some testing as well mainly manually using with pgbench and nothing looks broken. + durable_link_or_rename(tmppath, path, ERROR); + durable_rename(path, xlogfpath, ERROR); You may want to add a (void) cast in front of those calls for correctness. - ereport(LOG, - (errcode_for_file_access(), - errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m", - tmppath, path))); We lose a portion of the error message here, but with the file name that's easy to guess where that is happening. I am not complaining (that's fine to me as-is), just mentioning for the archive's sake. >> >> + /* XXX: Add racy file existence check? */ >> >> + if (rename(oldfile, newfile) < 0) >> > >> >> I am not sure we should worry about that, what do you think could >> >> cause the old file from going missing all of a sudden. Other backend >> >> processes are not playing with it in the code paths where this routine >> >> is called. Perhaps adding a comment in the header to let users know >> >> that would help? >> > >> > What I'm thinking of is adding a check whether the *target* file already >> > exists, and error out in that case. Just like the link() based path >> > normally does. >> >> Ah, OK. Well, why not. I'd rather have an assertion instead of an error though. > > I think it should definitely be an error if anything. But I'd rather > only add it in master... I guess I know why :) That's also why I was thinking about an assertion. -- Michael
Hi, On 2016-03-08 16:21:45 +0900, Michael Paquier wrote: > + durable_link_or_rename(tmppath, path, ERROR); > + durable_rename(path, xlogfpath, ERROR); > You may want to add a (void) cast in front of those calls for correctness. "correctness"? This is neatnikism, not correctness. I've actually added (void)'s to the sites that return on error (i.e. pass LOG or something), but not the ones where we pass ERROR. > - ereport(LOG, > - (errcode_for_file_access(), > - errmsg("could not link file \"%s\" to \"%s\" > (initialization of log file): %m", > - tmppath, path))); > We lose a portion of the error message here, but with the file name > that's easy to guess where that is happening. I am not complaining > (that's fine to me as-is), just mentioning for the archive's sake. Yea, I think that's fine too. - Andres
On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote: >> Instead of "durable" I think that "persistent" makes more sense. > > I find durable a lot more descriptive. persistent could refer to > retrying the rename or something. Yeah, I like durable, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Mon, 2016-03-07 at 21:55 -0800, Andres Freund wrote: > On 2016-03-08 12:26:34 +0900, Michael Paquier wrote: > > On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote: > > > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote: > > >> I have spent a couple of hours looking at that in details, and the > > >> patch is neat. > > > > > > Cool. Doing some more polishing right now. Will be back with an updated > > > version soonish. > > > > > > Did you do some testing? > > > > Not much in details yet, I just ran a check-world with fsync enabled > > for the recovery tests, plus quick manual tests with a cluster > > manually set up. I'll do more with your new version now that I know > > there will be one. > > Here's my updated version. > > Note that I've split the patch into two. One for the infrastructure, and > one for the callsites. I've repeated the power-loss testing today. With the patches applied I'm not longer able to reproduce the issue (despite trying about 10x), while without them I've hit it on the first try. This is on kernel 4.4.2. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-03-08 23:47:48 +0100, Tomas Vondra wrote: > I've repeated the power-loss testing today. With the patches applied I'm > not longer able to reproduce the issue (despite trying about 10x), while > without them I've hit it on the first try. This is on kernel 4.4.2. Yay, thanks for testing! Andres
On 03/08/2016 02:16 PM, Robert Haas wrote: > On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote: >>> Instead of "durable" I think that "persistent" makes more sense. >> >> I find durable a lot more descriptive. persistent could refer to >> retrying the rename or something. > > Yeah, I like durable, too. There is also precedent, DURABLE as in aciD JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On 2016-03-07 21:55:52 -0800, Andres Freund wrote: > Here's my updated version. > > Note that I've split the patch into two. One for the infrastructure, and > one for the callsites. I've finally pushed these, after making a number of mostly cosmetic fixes. The only of real consequence is that I've removed the durable_* call from the renames to .deleted in xlog[archive].c - these don't need to be durable, and are windows only. Oh, and that there was a typo in the !HAVE_WORKING_LINK case. There's a *lot* of version skew here: not-present functionality, moved files, different APIs - we got it all. I've tried to check in each version whether we're missing fsyncs for renames and everything. Michael, *please* double check the diffs for the different branches. Note that we currently have some frontend programs with the equivalent problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog) are missing pretty much the same directory fsyncs. And at least for pg_recvxlog it's critical, especially now that receivexlog support syncrep. I've not done anything about that; there's pretty much no chance to share backend code with the frontend in the back-branches. Greetings, Andres Freund
On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote: > I've finally pushed these, after making a number of mostly cosmetic > fixes. The only of real consequence is that I've removed the durable_* > call from the renames to .deleted in xlog[archive].c - these don't need > to be durable, and are windows only. Oh, and that there was a typo in > the !HAVE_WORKING_LINK case. > > There's a *lot* of version skew here: not-present functionality, moved > files, different APIs - we got it all. I've tried to check in each > version whether we're missing fsyncs for renames and everything. > Michael, *please* double check the diffs for the different branches. I have finally been able to spend some time reviewing what you pushed on back-branches, and things are in correct shape I think. One small issue that I have is that for EXEC_BACKEND builds, in write_nondefault_variables we still use one instance of rename(). I cannot really believe that there are production builds of Postgres with EXEC_BACKEND on non-Windows platforms, but I think that we had better cover our backs in this code path. For the other extra 2 calls of rename() in xlog.c and xlogarchive.c, those are fine untouched I think there is no need to care about WIN32 blocks... > Note that we currently have some frontend programs with the equivalent > problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog) > are missing pretty much the same directory fsyncs. And at least for > pg_recvxlog it's critical, especially now that receivexlog support > syncrep. I've not done anything about that; there's pretty much no > chance to share backend code with the frontend in the back-branches. Yeah, true. We definitely need to do something for that, even for HEAD it seems like an overkill to have something in for example src/common to allow frontends to have something if the fix is localized (pg_rewind may use something else), and it would be nice to finish wrapping that for the next minor release, so I propose the attached patches. At the same time, I think that adminpack had better be fixed as well, so there are actually three patches in this series, things that I shaped thinking about a backpatch btw, particularly for 0002. -- Michael
Attachment
On 2016-03-15 15:39:50 +0100, Michael Paquier wrote: > I have finally been able to spend some time reviewing what you pushed > on back-branches, and things are in correct shape I think. One small > issue that I have is that for EXEC_BACKEND builds, in > write_nondefault_variables we still use one instance of rename(). Correctly so afaics, because write_nondefault_variables is by definition non-durable. We write that stuff at every start / SIGHUP. Adding an fsync there would be an unnecessary slowdown. I don't think it's good policy adding fsync for stuff that definitely doesn't need it. > Yeah, true. We definitely need to do something for that, even for HEAD > it seems like an overkill to have something in for example src/common > to allow frontends to have something if the fix is localized > (pg_rewind may use something else), and it would be nice to finish > wrapping that for the next minor release, so I propose the attached > patches. At the same time, I think that adminpack had better be fixed > as well, so there are actually three patches in this series, things > that I shaped thinking about a backpatch btw, particularly for 0002. I'm doubtful about "fixing" adminpack. We don't know how it's used, and this could *seriously* increase its overhead for something potentially used at a high rate. > /* > + * Wrapper of rename() similar to the backend version with the same function > + * name aimed at making the renaming durable on disk. Note that this version > + * does not fsync the old file before the rename as all the code paths leading > + * to this function are already doing this operation. The new file is also > + * normally not present on disk before the renaming so there is no need to > + * bother about it. > + */ > +static int > +durable_rename(const char *oldfile, const char *newfile) > +{ > + int fd; > + char parentpath[MAXPGPATH]; > + > + if (rename(oldfile, newfile) != 0) > + { > + /* durable_rename produced a log entry */ > + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"), > + progname, current_walfile_name, strerror(errno)); > + return -1; > + } > + > + /* > + * To guarantee renaming of the file is persistent, fsync the file with its > + * new name, and its containing directory. > + */ > + fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); Why is S_IRUSR | S_IWUSR specified here? Are you working on a fix for pg_rewind? Let's go with initdb -S in a first iteration, then we can, if somebody is interest enough, work on making this nicer in master. Greetings, Andres Freund
On 3/15/16 10:39 AM, Michael Paquier wrote: > On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote: > >> Note that we currently have some frontend programs with the equivalent >> problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog) >> are missing pretty much the same directory fsyncs. And at least for >> pg_recvxlog it's critical, especially now that receivexlog support >> syncrep. I've not done anything about that; there's pretty much no >> chance to share backend code with the frontend in the back-branches. > > Yeah, true. We definitely need to do something for that, even for HEAD > it seems like an overkill to have something in for example src/common > to allow frontends to have something if the fix is localized > (pg_rewind may use something else), and it would be nice to finish > wrapping that for the next minor release, so I propose the attached > patches. I noticed this when reviewing the pg_receive_xlog refactor and was going to submit a patch after the CF. It didn't occur to me to piggyback on this work but I think it makes sense. +1 from me for fixing this in pg_receivexlog and back-patching. -- -David david@pgmasters.net
On Wed, Mar 16, 2016 at 2:46 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-15 15:39:50 +0100, Michael Paquier wrote: >> Yeah, true. We definitely need to do something for that, even for HEAD >> it seems like an overkill to have something in for example src/common >> to allow frontends to have something if the fix is localized >> (pg_rewind may use something else), and it would be nice to finish >> wrapping that for the next minor release, so I propose the attached >> patches. At the same time, I think that adminpack had better be fixed >> as well, so there are actually three patches in this series, things >> that I shaped thinking about a backpatch btw, particularly for 0002. > > I'm doubtful about "fixing" adminpack. We don't know how it's used, and > this could *seriously* increase its overhead for something potentially > used at a high rate. I think that Dave or Guillaume added here in CC could bring some light on the matter. Let's see if that's a problem for them. I would tend to think that it is not that critical, still I would imagine that this function is not called at a high frequency. >> /* >> + * Wrapper of rename() similar to the backend version with the same function >> + * name aimed at making the renaming durable on disk. Note that this version >> + * does not fsync the old file before the rename as all the code paths leading >> + * to this function are already doing this operation. The new file is also >> + * normally not present on disk before the renaming so there is no need to >> + * bother about it. >> + */ >> +static int >> +durable_rename(const char *oldfile, const char *newfile) >> +{ >> + int fd; >> + char parentpath[MAXPGPATH]; >> + >> + if (rename(oldfile, newfile) != 0) >> + { >> + /* durable_rename produced a log entry */ >> + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"), >> + progname, current_walfile_name, strerror(errno)); >> + return -1; >> + } >> + >> + /* >> + * To guarantee renaming of the file is persistent, fsync the file with its >> + * new name, and its containing directory. >> + */ >> + fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); > > Why is S_IRUSR | S_IWUSR specified here? Oops. I have removed it and updated that as attached. > Are you working on a fix for pg_rewind? Let's go with initdb -S in a > first iteration, then we can, if somebody is interest enough, work on > making this nicer in master. I am really -1 for this approach. Wrapping initdb -S with find_other_exec is intrusive in back-branches knowing that all the I/O write operations manipulating file descriptors go through file_ops.c, and we actually just need to fsync the target file in close_target_file(), making the fix being a 7-line patch, and there is no need to depend on another binary at all. The backup label file, as well as the control file are using the same code path in file_ops.c... And I like simple things :) At the same time, I found a legit bug when the modified backup_label file is created in createBackupLabel: the file is opened, written, but not closed with close_target_file(), and it should be. -- Michael
Attachment
On 2016-03-17 23:05:42 +0900, Michael Paquier wrote: > > Are you working on a fix for pg_rewind? Let's go with initdb -S in a > > first iteration, then we can, if somebody is interest enough, work on > > making this nicer in master. > > I am really -1 for this approach. Wrapping initdb -S with > find_other_exec is intrusive in back-branches knowing that all the I/O > write operations manipulating file descriptors go through file_ops.c, > and we actually just need to fsync the target file in > close_target_file(), making the fix being a 7-line patch, and there is > no need to depend on another binary at all. The backup label file, as > well as the control file are using the same code path in file_ops.c... > And I like simple things :) This is a *much* more expensive approach though. Doing the fsync directly after modifying the file. One file by one file. Will usually result in each fsync blocking for a while. In comparison of doing a flush and then an fsync pass over the whole directory will usually only block seldomly. The flushes for all files can be combined into very few barrier operations. Besides that, you're not syncing the directories, despite open_target_file() potentially creating the directory. Greetings, Andres Freund
On Fri, Mar 18, 2016 at 12:03 AM, Andres Freund <andres@anarazel.de> wrote: > This is a *much* more expensive approach though. Doing the fsync > directly after modifying the file. One file by one file. Will usually > result in each fsync blocking for a while. > > In comparison of doing a flush and then an fsync pass over the whole > directory will usually only block seldomly. The flushes for all files > can be combined into very few barrier operations. Hm... OK. I'd really like to keep the run of pg_rewind minimal as well if possible. So here you go. -- Michael
Attachment
On Thu, Mar 17, 2016 at 11:03 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-17 23:05:42 +0900, Michael Paquier wrote: >> > Are you working on a fix for pg_rewind? Let's go with initdb -S in a >> > first iteration, then we can, if somebody is interest enough, work on >> > making this nicer in master. >> >> I am really -1 for this approach. Wrapping initdb -S with >> find_other_exec is intrusive in back-branches knowing that all the I/O >> write operations manipulating file descriptors go through file_ops.c, >> and we actually just need to fsync the target file in >> close_target_file(), making the fix being a 7-line patch, and there is >> no need to depend on another binary at all. The backup label file, as >> well as the control file are using the same code path in file_ops.c... >> And I like simple things :) > > This is a *much* more expensive approach though. Doing the fsync > directly after modifying the file. One file by one file. Will usually > result in each fsync blocking for a while. > > In comparison of doing a flush and then an fsync pass over the whole > directory will usually only block seldomly. The flushes for all files > can be combined into very few barrier operations. Yeah, I'm pretty sure this was discussed when initdb -S went in. I think reusing that is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-03-18 15:08:32 +0900, Michael Paquier wrote: > +/* > + * Sync data directory to ensure that what has been generated up to now is > + * persistent in case of a crash, and this is done once globally for > + * performance reasons as sync requests on individual files would be > + * a negative impact on the running time of pg_rewind. This is invoked at > + * the end of processing once everything has been processed and written. > + */ > +static void > +syncTargetDirectory(const char *argv0) > +{ > + int ret; > + char exec_path[MAXPGPATH]; > + char cmd[MAXPGPATH]; > + > + if (dry_run) > + return; I think it makes more sense to return after detecting the binary, so you'd find out about problems around not finding initdb during the dry run, not later. > + /* Grab and invoke initdb to perform the sync */ > + if ((ret = find_other_exec(argv0, "initdb", > + "initdb (PostgreSQL) " PG_VERSION "\n", > + exec_path)) < 0) > + { > + char full_path[MAXPGPATH]; > + > + if (find_my_exec(argv0, full_path) < 0) > + strlcpy(full_path, progname, sizeof(full_path)); > + > + if (ret == -1) > + pg_fatal("The program \"initdb\" is needed by %s but was \n" > + "not found in the same directory as \"%s\".\n" > + "Check your installation.\n", progname, full_path); > + else > + pg_fatal("The program \"postgres\" was found by \"%s\" but was \n" > + "not the same version as %s.\n" > + "Check your installation.\n", progname, full_path); Wrong binary name. > + } > + > + /* now run initdb */ > + if (debug) > + snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S", > + exec_path, datadir_target); > + else > + snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"", > + exec_path, datadir_target, DEVNULL); > + > + if (system(cmd) != 0) > + pg_fatal("sync of target directory with initdb -S failed\n"); > + pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n"); > +} Don't see need for emitting "done", for now at least. > /* > + * Wrapper of rename() similar to the backend version with the same function > + * name aimed at making the renaming durable on disk. Note that this version > + * does not fsync the old file before the rename as all the code paths leading > + * to this function are already doing this operation. The new file is also > + * normally not present on disk before the renaming so there is no need to > + * bother about it. I don't think it's a good idea to skip fsyncing the old file based on that; it's way too likely that that'll not be done for the next user of durable_rename. > + */ > +static int > +durable_rename(const char *oldfile, const char *newfile) > +{ > + int fd; > + char parentpath[MAXPGPATH]; > + > + if (rename(oldfile, newfile) != 0) > + { > + /* durable_rename produced a log entry */ Uh? > + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"), > + progname, current_walfile_name, strerror(errno)); current_walfile_name doesn't look right, that's a largely independent global variable. > + if (fsync(fd) != 0) > + { > + close(fd); > + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), > + progname, newfile, strerror(errno)); > + return -1; Close should be after the strerror() call (yes, that mistake already existed once in receivelog.c). > + fd = open(parentpath, O_RDONLY | PG_BINARY, 0); > + > + /* > + * Some OSs don't allow us to open directories at all (Windows returns > + * EACCES), just ignore the error in that case. If desired also silently > + * ignoring errors about unreadable files. Log others. > + */ Comment is not applicable as a whole. > + if (fd < 0 && (errno == EISDIR || errno == EACCES)) > + return 0; > + else if (fd < 0) > + { > + fprintf(stderr, _("%s: could not open file \"%s\": %s\n"), > + progname, parentpath, strerror(errno)); > + return -1; > + } > + > + if (fsync(fd) != 0) > + { > + close(fd); > + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), > + progname, parentpath, strerror(errno)); > + return -1; close() needs to be moved again. It'd be easier to apply these if the rate of trivial issues were lower :(. Attached is an heavily revised version of the patch. Besides the above, I've: * copied fsync_fname_ext from initdb, I think it's easier to understand code this way, and it'll make unifying the code easier * added fsyncing of the directory in mark_file_as_archived() * The WAL files also need to be fsynced when created in open_walfile(): - otherwise the directory entry itself isn't safely persistent, as we don't fsync the directory in the stream->synchronous fsync() cases. - we refuse to resume in open_walfile(), if a file isn't 16MB when restarting. Without an fsync that's actually not unlikely after a crash. Even with an fsync that's not guaranteed not to happen, but the chance of it is much lower. I'm too tired to push this at the eleventh hour. Besides a heavily revised patch, backpatching will likely include a number of conflicts. If somebody in the US has the energy to take care of this... I've also noticed that a) pg_basebackup doesn't do anything about durability (it probably needs a very similar patch to the one pg_rewind just received). b) nor does pg_dump[all] I think it's pretty unacceptable for backup tools to be so cavalier about durability. So we're going to have another round of fsync stuff in the next set of releases anyway... Greetings, Andres Freund
Attachment
On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-03-18 15:08:32 +0900, Michael Paquier wrote: >> + fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"), >> + progname, current_walfile_name, strerror(errno)); > > current_walfile_name doesn't look right, that's a largely independent > global variable. Stupid mistake. >> + if (fsync(fd) != 0) >> + { >> + close(fd); >> + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), >> + progname, newfile, strerror(errno)); >> + return -1; > > Close should be after the strerror() call (yes, that mistake already > existed once in receivelog.c). Right. > It'd be easier to apply these if the rate of trivial issues were lower > :(. Sorry about that. I'll be more careful. > Attached is an heavily revised version of the patch. Besides the above, > I've: > * copied fsync_fname_ext from initdb, I think it's easier to understand > code this way, and it'll make unifying the code easier OK. > * added fsyncing of the directory in mark_file_as_archived() > * The WAL files also need to be fsynced when created in open_walfile(): > - otherwise the directory entry itself isn't safely persistent, as we > don't fsync the directory in the stream->synchronous fsync() cases. > - we refuse to resume in open_walfile(), if a file isn't 16MB when > restarting. Without an fsync that's actually not unlikely after a > crash. Even with an fsync that's not guaranteed not to happen, but > the chance of it is much lower. > I'm too tired to push this at the eleventh hour. Besides a heavily > revised patch, backpatching will likely include a number of conflicts. > If somebody in the US has the energy to take care of this... Close enough to the US. Attached are backpatchable versions based on the corrected version you sent. 9.3 and 9.4 share the same patch, more work has been necessary for 9.2 but that's not huge either. > So we're going to have another round of fsync stuff in the next set of > releases anyway... Yes, seeing how 9.5.2 is close by, I think that it would be wiser to push this stuff after the upcoming minor release. -- Michael
Attachment
On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote: > I've also noticed that Coming back to this issue because... > a) pg_basebackup doesn't do anything about durability (it probably needs > a very similar patch to the one pg_rewind just received). I think that one of the QE tests running here just got bitten by that. A base backup was taken with pg_basebackup and more or less after a VM was plugged off. The trick is that for pg_basebackup we cannot rely on initdb: pg_basebackup is a client-side utility. In most of the PG packages (Fedora, RHEL), it is put on the client-side package, where initdb is not. So it seems to me that the correct fix is not to use initdb -S but to have copies of fsync_parent_path, durable_rename and fsync_fname_ext in streamutil.c, and then we reuse them for both pg_receivexlog and pg_basebackup. At least that's less risky for back-branches this way. > b) nor does pg_dump[all] I have not hacked up that yet, but I would think that we would need one extra copy of some of those fsync_* routines in src/bin/pg_dump/. There is another thread for that already... On master I guess we'd end with something centralized in src/common/, but let's close the existing holes first. > So we're going to have another round of fsync stuff in the next set of > releases anyway... The sooner the better I think. Any people caring about this problem are now limited in using initdb -S after calling pg_basebackup or pg_dump. That's a solution, though the flushes should be contained inside each utility. -- Michael
Attachment
On Thu, May 12, 2016 at 2:58 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote: >> I've also noticed that > > Coming back to this issue because... > >> a) pg_basebackup doesn't do anything about durability (it probably needs >> a very similar patch to the one pg_rewind just received). > > I think that one of the QE tests running here just got bitten by that. > A base backup was taken with pg_basebackup and more or less after a VM > was plugged off. The trick is that for pg_basebackup we cannot rely on > initdb: pg_basebackup is a client-side utility. In most of the PG > packages (Fedora, RHEL), it is put on the client-side package, where > initdb is not. So it seems to me that the correct fix is not to use > initdb -S but to have copies of fsync_parent_path, durable_rename and > fsync_fname_ext in streamutil.c, and then we reuse them for both > pg_receivexlog and pg_basebackup. At least that's less risky for > back-branches this way. > >> b) nor does pg_dump[all] > > I have not hacked up that yet, but I would think that we would need > one extra copy of some of those fsync_* routines in src/bin/pg_dump/. > There is another thread for that already... On master I guess we'd end > with something centralized in src/common/, but let's close the > existing holes first. > >> So we're going to have another round of fsync stuff in the next set of >> releases anyway... > > The sooner the better I think. Any people caring about this problem > are now limited in using initdb -S after calling pg_basebackup or > pg_dump. That's a solution, though the flushes should be contained > inside each utility. And actually this won't fly high if there is no equivalent of walkdir() or if the fsync()'s are not applied recursively. On master at least the refactoring had better be done cleanly first... For the back branches, we could just have some recursive call like fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do you think that this should be part of fe_utils or src/common/? I'd tend to think the latter is more adapted as there is an equivalent in the backend. On back-branches, we could just have something like fsync_recursively that walks though the paths. An even more simple approach would be to fsync() individually things that have been written, but that would suck in performance. Thoughts from others? -- Michael