Thread: silent data loss with ext4 / all current versions

silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

27 November 2015, 11:18:03

Hi,

I've been doing some power failure tests (i.e.  unexpectedly
interrupting power) a few days ago, and I've discovered a fairly serious
case of silent data loss on ext3/ext4. Initially i thought it's a
filesystem bug, but after further investigation I'm pretty sure it's our
fault.

What happens is that when we recycle WAL segments, we rename them and
then sync them using fdatasync (which is the default on Linux). However
fdatasync does not force fsync on the parent directory, so in case of
power failure the rename may get lost. The recovery won't realize those
segments actually contain changes from "future" and thus does not replay
them. Hence data loss. The recovery completes as if everything went OK,
so the data loss is entirely silent.

Reproducing this is rather trivial. I've prepared a simple C program
simulating our WAL recycling, that I intended to send to ext4 mailing
list to demonstrate the ext4 bug before (I realized it's most likely our
bug and not theirs).

The example program is called ext4-data-loss.c and is available here
(along with other stuff mentioned in this message):

     https://github.com/2ndQuadrant/ext4-data-loss

Compile it, run it (over ssh from another host), interrupt the power and
after restart you should see some of the segments be lost (the  rename
reverted).

The git repo also contains a bunch of python scripts that I initially
used to reproduce this on PostgreSQL - insert.py, update.py and
xlog-watch.py. I'm not going to explain the details here, it's a bit
more complicated but the cause is exactly the same as with the C
program, just demonstrated in database. See README for instructions.

So, what's going on? The problem is that while the rename() is atomic,
it's not guaranteed to be durable without an explicit fsync on the
parent directory. And by default we only do fdatasync on the recycled
segments, which may not force fsync on the directory (and ext4 does not
do that, apparently).

This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and
4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19,
but I believe all versions since spread checkpoints were introduced are
vulnerable).

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

I've done the same tests on xfs and that seems to be safe - I've been
unable to reproduce the issue, so either the issue is not there or it's
more difficult to hit it. I haven't tried on other file systems, because
ext4 and xfs cover vast majority of deployments (at least on Linux), and
thus issue on ext4 is serious enough I believe.

It's possible to make ext3/ext4 safe with respect to this issue by using
full journaling (data=journal) instead of the default (data=ordered)
mode. However this comes at a significant performance cost and pretty
much no one is using it with PostgreSQL because data=ordered is believed
to be safe.

It's also possible to mitigate this by setting wal_sync_method=fsync,
but I don't think I've ever seen that change at a customer. This also
comes with a significant performance penalty, comparable to setting
data=journal. This has the advantage that this can be done without
restarting the database (SIGHUP is enough).

So pretty much everyone running on Linux + ext3/ext4 is vulnerable.

It's also worth mentioning that the data is not actually lost - it's
properly fsynced in the WAL segments, it's just the rename that got
lost. So it's possible to survive this without losing data by manually
renaming the segments, but this must happen before starting the cluster
because the automatic recovery comes and discards all the data etc.

I think this issue might also result in various other issues, not just
data loss. For example, I wouldn't be surprised by data corruption due
to flushing some of the changes in data files to disk (due to contention
for shared buffers and reaching vm.dirty_bytes) and then losing the
matching WAL segment. Also, while I have only seen 1 to 3 segments
getting lost, it might be possible that more segments can get lost,
possibly making the recovery impossible. And of course, this might cause
problems with WAL archiving due to archiving the same
segment twice (before and after crash).

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
sure this needs to be backpatched to all backbranches. I've also
attached a patch that adds pg_current_xlog_flush_location() function,
which proved to be quite useful when debugging this issue.

I'd also like to propose adding "last segment" to pg_controldata, next
to the last checkpoint / restartpoint. We don't need to write this on
every commit, once per segment (on the first write) is enough. This
would make investigating the issue much easier, and it'd also make it
possible to terminate the recovery with an error if the last found
segment does not match the expectation (instead of just assuming we've
found all segments, leading to data loss).

Another useful change would be to allow pg_xlogdump to print segments
even if the contents does not match the filename. Currently it's
impossible to even look at the contents in that case, so renaming the
existing segments is mostly guess work (find segments whrere pg_xlogdump
fails, try renaming to next segments).

And finally, I've done a quick review of all places that might suffer
the same issue - some are not really interesting as the stuff is
ephemeral anyway (like pgstat for example), but there are ~15 places
that may need this fix:

  * src/backend/access/transam/timeline.c (2 matches)
  * src/backend/access/transam/xlog.c (9 matches)
  * src/backend/access/transam/xlogarchive.c (3 matches)
  * src/backend/postmaster/pgarch.c (1 match)

Some of these places might be actually safe because a fsync happens
somewhere immediately after the rename (e.g. in a caller), but I guess
better safe than sorry.

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then
the rename won't be replayed and will be lost).

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: silent data loss with ext4 / all current versions

From

Teodor Sigaev

Date:

27 November 2015, 12:35:14

> What happens is that when we recycle WAL segments, we rename them and then sync
> them using fdatasync (which is the default on Linux). However fdatasync does not
> force fsync on the parent directory, so in case of power failure the rename may
> get lost. The recovery won't realize those segments actually contain changes
Agree. Some time ago I faced with this, although it wasn't a postgres.

> So, what's going on? The problem is that while the rename() is atomic, it's not
> guaranteed to be durable without an explicit fsync on the parent directory. And
> by default we only do fdatasync on the recycled segments, which may not force
> fsync on the directory (and ext4 does not do that, apparently).
>
> This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and
> also all supported PostgreSQL versions (tested on 9.1.19, but I believe all
> versions since spread checkpoints were introduced are vulnerable).
>
> FWIW this has nothing to do with storage reliability - you may have good drives,
> RAID controller with BBU, reliable SSDs or whatever, and you're still not safe.
> This issue is at the filesystem level, not storage.
Agree again.

> I plan to do more power failure testing soon, with more complex test scenarios.
> I suspect there might be other similar issues (e.g. when we rename a file before
> a checkpoint and don't fsync the directory - then the rename won't be replayed
> and will be lost).
It would be very useful, but I hope you will not find a new bug :)

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

27 November 2015, 13:18:30

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, what's going on? The problem is that while the rename() is atomic, it's
> not guaranteed to be durable without an explicit fsync on the parent
> directory. And by default we only do fdatasync on the recycled segments,
> which may not force fsync on the directory (and ext4 does not do that,
> apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

> FWIW this has nothing to do with storage reliability - you may have good
> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
> still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

> I've done the same tests on xfs and that seems to be safe - I've been unable
> to reproduce the issue, so either the issue is not there or it's more
> difficult to hit it. I haven't tried on other file systems, because ext4 and
> xfs cover vast majority of deployments (at least on Linux), and thus issue
> on ext4 is serious enough I believe.
>
> So pretty much everyone running on Linux + ext3/ext4 is vulnerable.
>
> It's also worth mentioning that the data is not actually lost - it's
> properly fsynced in the WAL segments, it's just the rename that got lost. So
> it's possible to survive this without losing data by manually renaming the
> segments, but this must happen before starting the cluster because the
> automatic recovery comes and discards all the data etc.

Hm. Most users are not going to notice that, particularly where things
are embedded.

> I think this issue might also result in various other issues, not just data
> loss. For example, I wouldn't be surprised by data corruption due to
> flushing some of the changes in data files to disk (due to contention for
> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
> segment. Also, while I have only seen 1 to 3 segments getting lost, it might
> be possible that more segments can get lost, possibly making the recovery
> impossible. And of course, this might cause problems with WAL archiving due
> to archiving the same
> segment twice (before and after crash).

Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.

> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
> this needs to be backpatched to all backbranches. I've also attached a patch
> that adds pg_current_xlog_flush_location() function, which proved to be
> quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

> And finally, I've done a quick review of all places that might suffer the
> same issue - some are not really interesting as the stuff is ephemeral
> anyway (like pgstat for example), but there are ~15 places that may need
> this fix:
>
>  * src/backend/access/transam/timeline.c (2 matches)
>  * src/backend/access/transam/xlog.c (9 matches)
>  * src/backend/access/transam/xlogarchive.c (3 matches)
>  * src/backend/postmaster/pgarch.c (1 match)
>
> Some of these places might be actually safe because a fsync happens
> somewhere immediately after the rename (e.g. in a caller), but I guess
> better safe than sorry.

I had a quick look at those code paths and indeed it would be safer to
be sure that once rename() is called we issue those fsync calls.

> I plan to do more power failure testing soon, with more complex test
> scenarios. I suspect there might be other similar issues (e.g. when we
> rename a file before a checkpoint and don't fsync the directory - then the
> rename won't be replayed and will be lost).

That would be great.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Greg Stark

Date:

27 November 2015, 13:29:21

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I plan to do more power failure testing soon, with more complex test
> scenarios. I suspect there might be other similar issues (e.g. when we
> rename a file before a checkpoint and don't fsync the directory - then the
> rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.

That always seemed unsatisfactory because in the past we were mainly
concerned with whether fsync was actually getting propagated to the
physical media. But for testing whether we're fsyncing enough for the
filesystem that would be good enough.

-- 
greg

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

27 November 2015, 14:02:52

Hi,

On 11/27/2015 02:28 PM, Greg Stark wrote:
> On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I plan to do more power failure testing soon, with more complex
>> test scenarios. I suspect there might be other similar issues (e.g.
>> when we rename a file before a checkpoint and don't fsync the
>> directory -then the rename won't be replayed and will be lost).
>
> I'm curious how you're doing this testing. The easiest way I can
> think of would be to run a database on an LVM volume and take a large
> number of LVM snapshots very rapidly and then see if the database can
> start up from each snapshot. Bonus points for keeping track of the
> committed transactions before each snaphsot and ensuring they're
> still there I guess.

I do have reliable storage (Intel SSD with power-loss protection), and
I've connected the system to a sophisticated power-loss-making device
called "the power switch" (image attached).

In other words, in the last ~7 days the system got rebooted more times
than in the previous ~5 years.

> That always seemed unsatisfactory because in the past we were mainly
> concerned with whether fsync was actually getting propagated to the
> physical media. But for testing whether we're fsyncing enough for
> the filesystem that would be good enough.

Yeah. I considered some form of virtualized setup initially, but my
original intent was to verify whether disabling write barriers really is
safe (because I've heard numerous complaints that it's stupid). And as
write barriers are more tightly coupled to the hardware, I opted for the
more brutal approach.

But I agree some form of virtualized setup might be more flexible,
although I'm not sure LVM snapshots are good approach as snapshots may
wait for I/O requests to complete and such. I think something qemu might
work better when combined with "kill -9" and I plan to try reproducing
the data loss issue on such setup.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

power-failure-device.jpg

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

27 November 2015, 18:01:19


On 11/27/2015 02:18 PM, Michael Paquier wrote:
> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So, what's going on? The problem is that while the rename() is atomic, it's
>> not guaranteed to be durable without an explicit fsync on the parent
>> directory. And by default we only do fdatasync on the recycled segments,
>> which may not force fsync on the directory (and ext4 does not do that,
>> apparently).
>
> Yeah, that seems to be the way the POSIX spec clears things.
> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
> force all currently queued I/O operations associated with the file
> indicated by file descriptor fildes to the synchronized I/O completion
> state. All I/O operations shall be completed as defined for
> synchronized I/O file integrity completion."
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
> If I understand that right, it is guaranteed that the rename() will be
> atomic, meaning that there will be only one file even if there is a
> crash, but that we need to fsync() the parent directory as mentioned.
>
>> FWIW this has nothing to do with storage reliability - you may have good
>> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
>> still not safe. This issue is at the filesystem level, not storage.
>
> The POSIX spec authorizes this behavior, so the FS is not to blame,
> clearly. At least that's what I get from it.

The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX 
expert), but we should be prepared for the less favorable interpretation 
I think.

>
>> I think this issue might also result in various other issues, not just data
>> loss. For example, I wouldn't be surprised by data corruption due to
>> flushing some of the changes in data files to disk (due to contention for
>> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
>> segment. Also, while I have only seen 1 to 3 segments getting lost, it might
>> be possible that more segments can get lost, possibly making the recovery
>> impossible. And of course, this might cause problems with WAL archiving due
>> to archiving the same
>> segment twice (before and after crash).
>
> Possible, the switch to .done is done after renaming the segment in
> xlogarchive.c. So this could happen in theory.

Yes. That's one of the suspicious places in my notes (haven't posted all 
the details, the message was long enough already).

>> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
>> this needs to be backpatched to all backbranches. I've also attached a patch
>> that adds pg_current_xlog_flush_location() function, which proved to be
>> quite useful when debugging this issue.
>
> Agreed. We should be sure as well that the calls to fsync_fname get
> issued in a critical section with START/END_CRIT_SECTION(). It does
> not seem to be the case with your patch.

Don't know. I've based that on code from replication/logical/ which does 
fsync_fname() on all the interesting places, without the critical section.

regards

-- 
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

29 November 2015, 12:59:32

On Sat, Nov 28, 2015 at 3:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 11/27/2015 02:18 PM, Michael Paquier wrote:
>>
>> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> So, what's going on? The problem is that while the rename() is atomic,
>>> it's
>>> not guaranteed to be durable without an explicit fsync on the parent
>>> directory. And by default we only do fdatasync on the recycled segments,
>>> which may not force fsync on the directory (and ext4 does not do that,
>>> apparently).
>>
>>
>> Yeah, that seems to be the way the POSIX spec clears things.
>> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
>> force all currently queued I/O operations associated with the file
>> indicated by file descriptor fildes to the synchronized I/O completion
>> state. All I/O operations shall be completed as defined for
>> synchronized I/O file integrity completion."
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
>> If I understand that right, it is guaranteed that the rename() will be
>> atomic, meaning that there will be only one file even if there is a
>> crash, but that we need to fsync() the parent directory as mentioned.
>>
>>> FWIW this has nothing to do with storage reliability - you may have good
>>> drives, RAID controller with BBU, reliable SSDs or whatever, and you're
>>> still not safe. This issue is at the filesystem level, not storage.
>>
>>
>> The POSIX spec authorizes this behavior, so the FS is not to blame,
>> clearly. At least that's what I get from it.
>
>
> The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
> expert),

As I am understanding it, FS implementations are free to decide to
make the rename persist on disk or not.

> but we should be prepared for the less favorable interpretation I
> think.

Yep. I agree. And in case my previous words were not clear, that's the
same line of thought here, we had better cover our backs and study
carefully each code path that could be impacted.

>>> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
>>> sure
>>> this needs to be backpatched to all backbranches. I've also attached a
>>> patch
>>> that adds pg_current_xlog_flush_location() function, which proved to be
>>> quite useful when debugging this issue.
>>
>>
>> Agreed. We should be sure as well that the calls to fsync_fname get
>> issued in a critical section with START/END_CRIT_SECTION(). It does
>> not seem to be the case with your patch.
>
>
> Don't know. I've based that on code from replication/logical/ which does
> fsync_fname() on all the interesting places, without the critical section.

For slot information in slot.c, there will be a PANIC when fsyncing
pg_replslot at some points. It does not seem that weird to do the same
for example after renaming the backup label file..
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Craig Ringer

Date:

29 November 2015, 13:38:06

On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu> wrote:

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I plan to do more power failure testing soon, with more complex test
> scenarios. I suspect there might be other similar issues (e.g. when we
> rename a file before a checkpoint and don't fsync the directory - then the
> rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.

I've had a few tries at implementing a qemu-based crashtester where it hard kills the qemu instance at a random point then starts it back up.

I always got stuck on the validation part - actually ensuring that the DB state is how we expect. I think I could probably get that right now, it's been a while.

The VM can be started back up and killed again over and over quite quickly.

It's not as good as physical plug-pull, but it's a lot more practical.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: silent data loss with ext4 / all current versions

From

Craig Ringer

Date:

29 November 2015, 13:41:33

On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

It's also possible to mitigate this by setting wal_sync_method=fsync

Are you sure?

https://lwn.net/Articles/322823/ tends to suggest that fsync() on the file is insufficient to ensure rename() is persistent, though it's somewhat old.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

29 November 2015, 14:33:37

Hi,

On 11/29/2015 02:38 PM, Craig Ringer wrote:
> On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu
> <mailto:stark@mit.edu>> wrote:
>
>     On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
>     <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
>     wrote:
>     > I plan to do more power failure testing soon, with more complex test
>     > scenarios. I suspect there might be other similar issues (e.g. when we
>     > rename a file before a checkpoint and don't fsync the directory - then the
>     > rename won't be replayed and will be lost).
>
>     I'm curious how you're doing this testing. The easiest way I can think
>     of would be to run a database on an LVM volume and take a large number
>     of LVM snapshots very rapidly and then see if the database can start
>     up from each snapshot. Bonus points for keeping track of the committed
>     transactions before each snaphsot and ensuring they're still there I
>     guess.
>
>
> I've had a few tries at implementing a qemu-based crashtester where it
> hard kills the qemu instance at a random point then starts it back up.

I've tried to reproduce the issue by killing a qemu VM, and so far I've 
been unsuccessful. On bare HW it was easily reproducible (I'd hit the 
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu 
somehow interacts with the I/O.

> I always got stuck on the validation part - actually ensuring that the
> DB state is how we expect. I think I could probably get that right now,
> it's been a while.

Weel, I guess we can't really check all the details, but I guess the 
checksums make checking the general consistency somewhat simpler. And 
then you have to design the workload in a way that makes the check 
easier - for example remembering the committed values etc.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

29 November 2015, 14:43:38

On 11/29/2015 02:41 PM, Craig Ringer wrote:
> On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     It's also possible to mitigate this by setting wal_sync_method=fsync
>
>
> Are you sure?
>
> https://lwn.net/Articles/322823/ tends to suggest that fsync() on the
> file is insufficient to ensure rename() is persistent, though it's
> somewhat old.

Good point. I don't know, and I'm not any smarter after reading the LWN 
article. What I meant by "mitigate" is that I've been unable to 
reproduce the issue after setting wal_sync_method=fsync, so my 
conclusion is that it either fixes the issue or at least significantly 
reduces the probability of hitting it.

It's pretty clear that the right fix is the additional fsync on pg_xlog.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

29 November 2015, 18:58:23

On 11/29/2015 03:33 PM, Tomas Vondra wrote:
> Hi,
>
> On 11/29/2015 02:38 PM, Craig Ringer wrote:
>>
>> I've had a few tries at implementing a qemu-based crashtester where it
>> hard kills the qemu instance at a random point then starts it back up.
>
> I've tried to reproduce the issue by killing a qemu VM, and so far I've
> been unsuccessful. On bare HW it was easily reproducible (I'd hit the
> issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
> somehow interacts with the I/O.

Update: I've managed to reproduce the issue in the qemu setup - I think 
it needs slightly different timing due to the VM being slightly slower. 
I also tweaked vm.dirty_bytes and vm.dirty_background_bytes to values 
used on the bare hardware (I suspect it widens the window).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Peter Eisentraut

Date:

01 December 2015, 21:45:04

On 11/27/15 8:18 AM, Michael Paquier wrote:
> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> > So, what's going on? The problem is that while the rename() is atomic, it's
>> > not guaranteed to be durable without an explicit fsync on the parent
>> > directory. And by default we only do fdatasync on the recycled segments,
>> > which may not force fsync on the directory (and ext4 does not do that,
>> > apparently).
> Yeah, that seems to be the way the POSIX spec clears things.
> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
> force all currently queued I/O operations associated with the file
> indicated by file descriptor fildes to the synchronized I/O completion
> state. All I/O operations shall be completed as defined for
> synchronized I/O file integrity completion."
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
> If I understand that right, it is guaranteed that the rename() will be
> atomic, meaning that there will be only one file even if there is a
> crash, but that we need to fsync() the parent directory as mentioned.

I don't see anywhere in the spec that a rename needs an fsync of the
directory to be durable.  I can see why that would be needed in
practice, though.  File system developers would probably be able to give
a more definite answer.

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

01 December 2015, 22:00:20


On 12/01/2015 10:44 PM, Peter Eisentraut wrote:
> On 11/27/15 8:18 AM, Michael Paquier wrote:
>> On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> So, what's going on? The problem is that while the rename() is atomic, it's
>>>> not guaranteed to be durable without an explicit fsync on the parent
>>>> directory. And by default we only do fdatasync on the recycled segments,
>>>> which may not force fsync on the directory (and ext4 does not do that,
>>>> apparently).
>> Yeah, that seems to be the way the POSIX spec clears things.
>> "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
>> force all currently queued I/O operations associated with the file
>> indicated by file descriptor fildes to the synchronized I/O completion
>> state. All I/O operations shall be completed as defined for
>> synchronized I/O file integrity completion."
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
>> If I understand that right, it is guaranteed that the rename() will be
>> atomic, meaning that there will be only one file even if there is a
>> crash, but that we need to fsync() the parent directory as mentioned.
>
> I don't see anywhere in the spec that a rename needs an fsync of the
>  directory to be durable. I can see why that would be needed in
> practice, though. File system developers would probably be able to
> give a more definite answer.

Yeah, POSIX is the smallest common denominator. In this case the spec 
seems not to require this durability guarantee (rename without fsync on 
directory), which allows a POSIX-compliant filesystem.

At least that's my conclusion from reading https://lwn.net/Articles/322823/

However, as I explained in the original post, it's more complicated as 
this only seems to be problem with fdatasync. I've been unable to 
reproduce the issue with wal_sync_method=fsync.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

01 December 2015, 22:06:09

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
     calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
     (except for those in timeline.c, as the START/END_CRIT_SECTION is
     not available there)

The patch is fairly trivial and I've done some rudimentary testing, but
I'm sure I haven't exercised all the modified paths.

regards
Tomas

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

xlog-fsync-v2.patch

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

02 December 2015, 06:23:53

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Attached is v2 of the patch, that
>
> (a) adds explicit fsync on the parent directory after all the rename()
>     calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
>
> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls
>     (except for those in timeline.c, as the START/END_CRIT_SECTION is
>     not available there)
>
> The patch is fairly trivial and I've done some rudimentary testing, but I'm
> sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

02 December 2015, 06:24:24

On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is v2 of the patch, that
>>
>> (a) adds explicit fsync on the parent directory after all the rename()
>>     calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
>>
>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls
>>     (except for those in timeline.c, as the START/END_CRIT_SECTION is
>>     not available there)
>>
>> The patch is fairly trivial and I've done some rudimentary testing, but I'm
>> sure I haven't exercised all the modified paths.
>
> I would like to have an in-depth look at that after finishing the
> current CF, I am the manager of this one after all... Could you
> register it to 2016-01 CF for the time being? I don't mind being
> beaten by someone else if this someone has some room to look at this
> patch..

And please feel free to add my name as reviewer.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

19 January 2016, 06:44:16

On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Attached is v2 of the patch, that
>>>
>>> (a) adds explicit fsync on the parent directory after all the rename()
>>>     calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
>>>
>>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls
>>>     (except for those in timeline.c, as the START/END_CRIT_SECTION is
>>>     not available there)
>>>
>>> The patch is fairly trivial and I've done some rudimentary testing, but I'm
>>> sure I haven't exercised all the modified paths.
>>
>> I would like to have an in-depth look at that after finishing the
>> current CF, I am the manager of this one after all... Could you
>> register it to 2016-01 CF for the time being? I don't mind being
>> beaten by someone else if this someone has some room to look at this
>> patch..
>
> And please feel free to add my name as reviewer.

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

19 January 2016, 06:58:10


On 01/19/2016 07:44 AM, Michael Paquier wrote:
> On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> Attached is v2 of the patch, that
>>>>
>>>> (a) adds explicit fsync on the parent directory after all the rename()
>>>>      calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
>>>>
>>>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls
>>>>      (except for those in timeline.c, as the START/END_CRIT_SECTION is
>>>>      not available there)
>>>>
>>>> The patch is fairly trivial and I've done some rudimentary testing, but I'm
>>>> sure I haven't exercised all the modified paths.
>>>
>>> I would like to have an in-depth look at that after finishing the
>>> current CF, I am the manager of this one after all... Could you
>>> register it to 2016-01 CF for the time being? I don't mind being
>>> beaten by someone else if this someone has some room to look at this
>>> patch..
>>
>> And please feel free to add my name as reviewer.
>
> Tomas, I am planning to have a look at that, because it seems to be
> important. In case it becomes lost on my radar, do you mind if I add
> it to the 2016-03 CF?

Well, what else can I do? I have to admit I'm quite surprised this is 
still rotting here, considering it addresses a rather serious data loss 
/ corruption issue on pretty common setup.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

19 January 2016, 07:03:31

On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 01/19/2016 07:44 AM, Michael Paquier wrote:
>>
>> On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>> On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>>
>>>> On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
>>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>>
>>>>> Attached is v2 of the patch, that
>>>>>
>>>>> (a) adds explicit fsync on the parent directory after all the rename()
>>>>>      calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c
>>>>>
>>>>> (b) adds START/END_CRIT_SECTION around the new fsync_fname calls
>>>>>      (except for those in timeline.c, as the START/END_CRIT_SECTION is
>>>>>      not available there)
>>>>>
>>>>> The patch is fairly trivial and I've done some rudimentary testing, but
>>>>> I'm
>>>>> sure I haven't exercised all the modified paths.
>>>>
>>>>
>>>> I would like to have an in-depth look at that after finishing the
>>>> current CF, I am the manager of this one after all... Could you
>>>> register it to 2016-01 CF for the time being? I don't mind being
>>>> beaten by someone else if this someone has some room to look at this
>>>> patch..
>>>
>>>
>>> And please feel free to add my name as reviewer.
>>
>>
>> Tomas, I am planning to have a look at that, because it seems to be
>> important. In case it becomes lost on my radar, do you mind if I add
>> it to the 2016-03 CF?
>
>
> Well, what else can I do? I have to admit I'm quite surprised this is still
> rotting here, considering it addresses a rather serious data loss /
> corruption issue on pretty common setup.

Well, I think you did what you could. And we need to be sure now that
it gets in and that this patch gets a serious lookup. So for now my
guess is that not loosing track of it would be a good first move. I
have added it here to attract more attention:
https://commitfest.postgresql.org/9/484/
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

19 January 2016, 07:20:54


On 01/19/2016 08:03 AM, Michael Paquier wrote:
> On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
...
>>> Tomas, I am planning to have a look at that, because it seems to be
>>> important. In case it becomes lost on my radar, do you mind if I add
>>> it to the 2016-03 CF?
>>
>>
>> Well, what else can I do? I have to admit I'm quite surprised this is still
>> rotting here, considering it addresses a rather serious data loss /
>> corruption issue on pretty common setup.
>
> Well, I think you did what you could. And we need to be sure now that
> it gets in and that this patch gets a serious lookup. So for now my
> guess is that not loosing track of it would be a good first move. I
> have added it here to attract more attention:
> https://commitfest.postgresql.org/9/484/

Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I 
added it into the CF app).


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

22 January 2016, 05:46:02

On Tue, Jan 19, 2016 at 4:20 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 01/19/2016 08:03 AM, Michael Paquier wrote:
>>
>> On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>
> ...
>>>>
>>>> Tomas, I am planning to have a look at that, because it seems to be
>>>> important. In case it becomes lost on my radar, do you mind if I add
>>>> it to the 2016-03 CF?
>>>
>>>
>>>
>>> Well, what else can I do? I have to admit I'm quite surprised this is
>>> still
>>> rotting here, considering it addresses a rather serious data loss /
>>> corruption issue on pretty common setup.
>>
>>
>> Well, I think you did what you could. And we need to be sure now that
>> it gets in and that this patch gets a serious lookup. So for now my
>> guess is that not loosing track of it would be a good first move. I
>> have added it here to attract more attention:
>> https://commitfest.postgresql.org/9/484/
>
>
> Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added
> it into the CF app).

So, I have been playing with a Linux VM with VMware Fusion and on ext4
with data=ordered the renames are getting lost if the root folder is
not fsync. By killing-9 the VM I am able to reproduce that really
easily.

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.
3) In exitArchiveRecovery if the rename is not made durable I think it
does not matter much. Even if recovery.conf is the one present once
the node restarts node would move back again to recovery, and actually
we had better move back to recovery in this case, no?
4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.
5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?
6) For the one after XLogArchiveNotify() for the last partial segment
of the old timeline, it doesn't really matter to not make the change
persistent as this is mainly done because it is useful to identify
that it is a partial segment.
7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename was
not effective.
2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).
2) In writeTimeLineHistoryFile, similarly we don't need to care much,
in WalRcvFetchTimeLineHistoryFiles recovery would take again the same
path

Thoughts?
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

22 January 2016, 08:26:13

Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:

> So, I have been playing with a Linux VM with VMware Fusion and on
> ext4 with data=ordered the renames are getting lost if the root
> folder is not fsync. By killing-9 the VM I am able to reproduce that
> really easily.

Yep. Same experience here (with qemu-kvm VMs).

> Here are some comments about your patch after a look at the code.
>
> Regarding the additions in fsync_fname() in xlog.c:
> 1) In InstallXLogFileSegment, rename() will be called only if
> HAVE_WORKING_LINK is not used, which happens only on Windows and
> cygwin. We could add it for consistency, but it should be within the
> #else/#endif block. It is not critical as of now.
> 2) The call in RemoveXlogFile is not necessary, the rename happening
> only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or 
could there be some other platforms? And are we sure the file systems on 
those platforms are safe without the fsync call?

That is, while the report references ext4, there may be other file 
systems with the same problem - ext4 was used mostly as it's the most 
widely used Linux file system.

> 3) In exitArchiveRecovery if the rename is not made durable I think
> it does not matter much. Even if recovery.conf is the one present
> once the node restarts node would move back again to recovery, and
> actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy to 
exchange the one fsync for not having to manually fix the database after 
crash.

I don't really see why switching back to recovery should be desirable in 
this case? Imagine you have a warm/hot standby, and that you promote it 
to master. The clients connect, start issuing commands and then the 
system crashes and loses the recovery.conf rename. The system reboots, 
database performs local recovery but then starts as a standby and starts 
rejecting writes. That seems really weird to me.

> 4) StartupXLOG for the tablespace map. I don't think this one is
> needed as well. Even if the tablespace map is not removed after a
> power loss user would get an error telling that the file should be
> removed.

Please no, for the same reasons as in (3).

> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
> fsync, we guarantee that there is no need to do again the recovery.
> But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don't 
have to? Had this not be reliable, then the whole recovery process is 
fundamentally broken and we better fix it instead of merely putting a 
band-aid on it.

> 6) For the one after XLogArchiveNotify() for the last partial
> segment of the old timeline, it doesn't really matter to not make the
> change persistent as this is mainly done because it is useful to
> identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not to 
do the fsync. It's not really going to give us any measurable 
performance advantage (how often we do those fsyncs), so I'd vote to 
keep it and make sure the partial segments are named accordingly. Less 
confusion is always better.

> 7) In CancelBackup, this one is not needed as well I think. We would
> surely want to get back to recovery if those files remain after a
> power loss.

I may be missing something, but why would we switch to recovery in this 
case?

>
> For the ones in xlogarchive.c:
> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
> renaming a file with a temporary name to a permanent name. Once the
> node restarts, we would see an extra temporary file if the rename
> was not effective.

So we'll lose the segment (won't have it locally under the permanent 
name), as we've already restored it and won't do that again. Is that 
really a good thing to do?

> 2) XLogArchiveForceDone, the only bad thing that would happen here is
> to leave behind a .ready file instead of a .done file. I guess that we
> could have it though as an optimization to not have to archive again
> this file.

Yes.

>
> For the one in pgarch.c:
> 1) In pgarch_archiveDone, we could add it as an optimization to
> actually let the server know that the segment has been already
> archived, preventing a retry.

Not sure what you mean by "could add it as an optimization"?

> In timeline.c:
> 1) writeTimeLineHistoryFile, it does not matter much. In the worst
> case we would have just a dummy temporary file, and startup process
> would come back here (in exitArchiveRecovery() we may finish with an
> unnamed backup file similarly).

OK

> 2) In writeTimeLineHistoryFile, similarly we don't need to care
> much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
> the same path

OK

>
> Thoughts?
>

Thanks for the review and comments. I think the question is whether we 
only want to do the additional fsync() only when it ultimately may lead 
to data loss, or even in cases where it may cause operational issues 
(e.g. switching back to recovery needlessly).

I'd vote for the latter, as I think it makes the database easier to 
operate (less manual interventions) and the performance impact is 0 (as 
those fsyncs are really rare).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Magnus Hagander

Date:

22 January 2016, 08:41:08

On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could there be some other platforms? And are we sure the file systems on those platforms are safe without the fsync call?

That is, while the report references ext4, there may be other file systems with the same problem - ext4 was used mostly as it's the most widely used Linux file system.

3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy to exchange the one fsync for not having to manually fix the database after crash.

I don't really see why switching back to recovery should be desirable in this case? Imagine you have a warm/hot standby, and that you promote it to master. The clients connect, start issuing commands and then the system crashes and loses the recovery.conf rename. The system reboots, database performs local recovery but then starts as a standby and starts rejecting writes. That seems really weird to me.

4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.

Please no, for the same reasons as in (3).

5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don't have to? Had this not be reliable, then the whole recovery process is fundamentally broken and we better fix it instead of merely putting a band-aid on it.

6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not to do the fsync. It's not really going to give us any measurable performance advantage (how often we do those fsyncs), so I'd vote to keep it and make sure the partial segments are named accordingly. Less confusion is always better.

7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

I may be missing something, but why would we switch to recovery in this case?

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.

So we'll lose the segment (won't have it locally under the permanent name), as we've already restored it and won't do that again. Is that really a good thing to do?

2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

Yes.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

Not sure what you mean by "could add it as an optimization"?

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).

OK

2) In writeTimeLineHistoryFile, similarly we don't need to care
much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path

OK

Thoughts?

Thanks for the review and comments. I think the question is whether we only want to do the additional fsync() only when it ultimately may lead to data loss, or even in cases where it may cause operational issues (e.g. switching back to recovery needlessly).

I'd vote for the latter, as I think it makes the database easier to operate (less manual interventions) and the performance impact is 0 (as those fsyncs are really rare).

Yeah, unless it gives a significant performance penalty, I'd agree that the latter seems like the better option of those. +1 for that way :)

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

22 January 2016, 12:32:35

On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>> Here are some comments about your patch after a look at the code.
>>
>> Regarding the additions in fsync_fname() in xlog.c:
>> 1) In InstallXLogFileSegment, rename() will be called only if
>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>> cygwin. We could add it for consistency, but it should be within the
>> #else/#endif block. It is not critical as of now.
>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>> only on Windows.
>
> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
> there be some other platforms? And are we sure the file systems on those
> platforms are safe without the fsync call?
> That is, while the report references ext4, there may be other file systems
> with the same problem - ext4 was used mostly as it's the most widely used
> Linux file system.

From pg_config_manual.h:
#if !defined(WIN32) && !defined(__CYGWIN__)
#define HAVE_WORKING_LINK 1
#endif
If we want to be consistent with what Posix proposes, I am not against
adding it.

>> 3) In exitArchiveRecovery if the rename is not made durable I think
>> it does not matter much. Even if recovery.conf is the one present
>> once the node restarts node would move back again to recovery, and
>> actually we had better move back to recovery in this case, no?
>
> I'm strongly against this "optimization" - I'm more than happy to exchange
> the one fsync for not having to manually fix the database after crash.
>
> I don't really see why switching back to recovery should be desirable in
> this case? Imagine you have a warm/hot standby, and that you promote it to
> master. The clients connect, start issuing commands and then the system
> crashes and loses the recovery.conf rename. The system reboots, database
> performs local recovery but then starts as a standby and starts rejecting
> writes. That seems really weird to me.
>> 4) StartupXLOG for the tablespace map. I don't think this one is
>> needed as well. Even if the tablespace map is not removed after a
>> power loss user would get an error telling that the file should be
>> removed.
>
> Please no, for the same reasons as in (3).
>
>> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
>> fsync, we guarantee that there is no need to do again the recovery.
>> But in case of a power loss, isn't it better to do the recovery again?
>
> Why would it be better? Why should we do something twice when we don't have
> to? Had this not be reliable, then the whole recovery process is
> fundamentally broken and we better fix it instead of merely putting a
> band-aid on it.

Group shot with 3), 4) and 5). Well, there is no data loss here,
putting me in the direction of considering this addition of an fsync
as an optimization and not a bug.

>> 6) For the one after XLogArchiveNotify() for the last partial
>> segment of the old timeline, it doesn't really matter to not make the
>> change persistent as this is mainly done because it is useful to
>> identify that it is a partial segment.
>
> OK, although I still don't quite see why that should be a reason not to do
> the fsync. It's not really going to give us any measurable performance
> advantage (how often we do those fsyncs), so I'd vote to keep it and make
> sure the partial segments are named accordingly. Less confusion is always
> better.

Check.

>> 7) In CancelBackup, this one is not needed as well I think. We would
>> surely want to get back to recovery if those files remain after a
>> power loss.
>
> I may be missing something, but why would we switch to recovery in this
> case?



>> For the ones in xlogarchive.c:
>> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
>> renaming a file with a temporary name to a permanent name. Once the
>> node restarts, we would see an extra temporary file if the rename
>> was not effective.
>
> So we'll lose the segment (won't have it locally under the permanent name),
> as we've already restored it and won't do that again. Is that really a good
> thing to do?

At this point if a segment is restored from archive and there is a
power loss we are going back to recovery. The advantage of having the
fsync would ensure that the segment is not fetched twice.

>> 2) XLogArchiveForceDone, the only bad thing that would happen here is
>> to leave behind a .ready file instead of a .done file. I guess that we
>> could have it though as an optimization to not have to archive again
>> this file.
>
> Yes.
>
>>
>> For the one in pgarch.c:
>> 1) In pgarch_archiveDone, we could add it as an optimization to
>> actually let the server know that the segment has been already
>> archived, preventing a retry.
>
> Not sure what you mean by "could add it as an optimization"?

In case of a loss of the rename, server would retry archiving it. So
adding an fsync here ensures that the segment is marked as .done for
good.

>> Thoughts?
>
> Thanks for the review and comments. I think the question is whether we only
> want to do the additional fsync() only when it ultimately may lead to data
> loss, or even in cases where it may cause operational issues (e.g. switching
> back to recovery needlessly).
> I'd vote for the latter, as I think it makes the database easier to operate
> (less manual interventions) and the performance impact is 0 (as those fsyncs
> are really rare).

My first line of thoughts after looking at the patch is that I am not
against adding those fsync calls on HEAD as there is roughly an
advantage to not go back to recovery in most cases and ensure
consistent names, but as they do not imply any data loss I would not
encourage a back-patch. Adding them seems harmless at first sight I
agree, but those are not actual bugs.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Greg Stark

Date:

22 January 2016, 12:42:44

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>
>> So, I have been playing with a Linux VM with VMware Fusion and on
>> ext4 with data=ordered the renames are getting lost if the root
>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>> really easily.
>
>
> Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

-- 
greg

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

22 January 2016, 12:54:52

On 2016-01-22 21:32:29 +0900, Michael Paquier wrote:
> Group shot with 3), 4) and 5). Well, there is no data loss here,
> putting me in the direction of considering this addition of an fsync
> as an optimization and not a bug.

I think this is an extremely weak argument. The reasoning when exactly a
loss of file is acceptable is complicated. In many cases adding an
additional fsync won't add measurable cost, given the frequency of
operations and/or the cost of surrounding operations.

Now, if you can make an argument why something is potentially impacting
performance *and* definitely not required: OK, then we can discuss
that.

Andres

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

23 January 2016, 01:36:05

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>
>>> So, I have been playing with a Linux VM with VMware Fusion and on
>>> ext4 with data=ordered the renames are getting lost if the root
>>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>>> really easily.
>>
>>
>> Yep. Same experience here (with qemu-kvm VMs).
>
> I still think a better approach for this is to run the database on an
> LVM volume and take lots of snapshots. No VM needed, though it doesn't
> hurt. LVM volumes are below the level of the filesystem and a snapshot
> captures the state of the raw blocks the filesystem has written to the
> block layer. The block layer does no caching though the drive may but
> neither the VM solution nor LVM would capture that.
>
> LVM snapshots would have the advantage that you can keep running the
> database and you can take lots of snapshots with relatively little
> overhead. Having dozens or hundreds of snapshots would be unacceptable
> performance drain in production but for testing it should be practical
> and they take relatively little space -- just the blocks changed since
> the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

23 January 2016, 02:40:05

On 01/23/2016 02:35 AM, Michael Paquier wrote:
> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:
>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>>
>>>> So, I have been playing with a Linux VM with VMware Fusion and on
>>>> ext4 with data=ordered the renames are getting lost if the root
>>>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>>>> really easily.
>>>
>>>
>>> Yep. Same experience here (with qemu-kvm VMs).
>>
>> I still think a better approach for this is to run the database on an
>> LVM volume and take lots of snapshots. No VM needed, though it doesn't
>> hurt. LVM volumes are below the level of the filesystem and a snapshot
>> captures the state of the raw blocks the filesystem has written to the
>> block layer. The block layer does no caching though the drive may but
>> neither the VM solution nor LVM would capture that.
>>
>> LVM snapshots would have the advantage that you can keep running the
>> database and you can take lots of snapshots with relatively little
>> overhead. Having dozens or hundreds of snapshots would be unacceptable
>> performance drain in production but for testing it should be practical
>> and they take relatively little space -- just the blocks changed since
>> the snapshot was taken.
>
> Another idea: hardcode a PANIC just after rename() with
> restart_after_crash = off (this needs is IsBootstrapProcess() checks).
> Once server crashes, kill-9 the VM. Then restart the VM and the
> Postgres instance with a new binary that does not have the PANIC, and
> see how things are moving on. There is a window of up to several
> seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact 
on the I/O requests already issued to the system. What you need is some 
sort of coordination between the database and the script that kills the 
VM (or takes a LVM snapshot).

That can be done by simply emitting a particular log message, and the 
"kill script" may simply watch the file (for example over SSH). This has 
the benefit that you can also watch for additional conditions that are 
difficult to check from that particular part of the code (and only kill 
the VM when all of them trigger - for example only on the third 
checkpoint since start, and such).

The reason why I was not particularly thrilled about the LVM snapshot 
idea is that to identify this particular data loss issue, you need to be 
able to reason about the expected state of the database (what 
transactions are committed, how many segments are there). And my 
understanding was that Greg's idea was merely "try to start the DB on a 
snapshot and see if starts / is not corrupted," which would not work 
with this particular issue, as the database seemed just fine - the data 
loss is silent. Adding the "last XLOG segment" into pg_controldata would 
make it easier to detect without having to track details about which 
transactions got committed.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

23 January 2016, 15:01:32

On Sat, Jan 23, 2016 at 11:39 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/23/2016 02:35 AM, Michael Paquier wrote:
>>
>> On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:
>>> On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>> LVM snapshots would have the advantage that you can keep running the
>>> database and you can take lots of snapshots with relatively little
>>> overhead. Having dozens or hundreds of snapshots would be unacceptable
>>> performance drain in production but for testing it should be practical
>>> and they take relatively little space -- just the blocks changed since
>>> the snapshot was taken.
>>
>>
>> Another idea: hardcode a PANIC just after rename() with
>> restart_after_crash = off (this needs is IsBootstrapProcess() checks).
>> Once server crashes, kill-9 the VM. Then restart the VM and the
>> Postgres instance with a new binary that does not have the PANIC, and
>> see how things are moving on. There is a window of up to several
>> seconds after the rename() call, so I guess that this would work.
>
>
> I don't see how that would improve anything, as the PANIC has no impact on
> the I/O requests already issued to the system. What you need is some sort of
> coordination between the database and the script that kills the VM (or takes
> a LVM snapshot).

Well, to emulate the noise that non-renamed files have on the system
we could simply emulate the loss of rename() by just commenting it out
and then forcibly crash the instance or just PANIC the instance just
before rename(). This would emulate what we are looking for, no? What
we want to check is how the system reacts should an unwanted file be
in place.
For example, take the rename() call in InstallXLogFileSegment(),
crashing with an non-effective rename() will cause the presence of an
annoying xlogtemp file. Making the rename persistent would make the
server complain about an invalid magic number in a segment that has
just been created.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

25 January 2016, 07:30:53

On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>>> Here are some comments about your patch after a look at the code.
>>>
>>> Regarding the additions in fsync_fname() in xlog.c:
>>> 1) In InstallXLogFileSegment, rename() will be called only if
>>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>>> cygwin. We could add it for consistency, but it should be within the
>>> #else/#endif block. It is not critical as of now.
>>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>>> only on Windows.
>>
>> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
>> there be some other platforms? And are we sure the file systems on those
>> platforms are safe without the fsync call?
>> That is, while the report references ext4, there may be other file systems
>> with the same problem - ext4 was used mostly as it's the most widely used
>> Linux file system.
>
> From pg_config_manual.h:
> #if !defined(WIN32) && !defined(__CYGWIN__)
> #define HAVE_WORKING_LINK 1
> #endif
> If we want to be consistent with what Posix proposes, I am not against
> adding it.

I did some tests with NTFS using cygwin, and the rename() calls remain
even after powering off the VM. But I agree that adding an fsync() in
both cases would be fine.

>>> Thoughts?
>>
>> Thanks for the review and comments. I think the question is whether we only
>> want to do the additional fsync() only when it ultimately may lead to data
>> loss, or even in cases where it may cause operational issues (e.g. switching
>> back to recovery needlessly).
>> I'd vote for the latter, as I think it makes the database easier to operate
>> (less manual interventions) and the performance impact is 0 (as those fsyncs
>> are really rare).
>
> My first line of thoughts after looking at the patch is that I am not
> against adding those fsync calls on HEAD as there is roughly an
> advantage to not go back to recovery in most cases and ensure
> consistent names, but as they do not imply any data loss I would not
> encourage a back-patch. Adding them seems harmless at first sight I
> agree, but those are not actual bugs.

OK. It is true that PGDATA would be fsync'd in 4 code paths with your
patch which are not that much taken:
- Renaming tablespace map file and backup label file (three times)
- Renaming to recovery.done
So, what do you think about the patch attached? Moving the calls into
the critical sections is not really necessary except when installing a
new segment.
--
Michael

Attachment

xlog-fsync-v3.patch

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

25 January 2016, 09:50:17

On 01/25/2016 08:30 AM, Michael Paquier wrote:
> On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:

,,,

>> My first line of thoughts after looking at the patch is that I am
>> not against adding those fsync calls on HEAD as there is roughly
>> an advantage to not go back to recovery in most cases and ensure
>> consistent names, but as they do not imply any data loss I would
>> not encourage a back-patch. Adding them seems harmless at first
>> sight I agree, but those are not actual bugs.
>
> OK. It is true that PGDATA would be fsync'd in 4 code paths with your
> patch which are not that much taken:
> - Renaming tablespace map file and backup label file (three times)
> - Renaming to recovery.done
> So, what do you think about the patch attached? Moving the calls into
> the critical sections is not really necessary except when installing a
> new segment.

Seems OK to me. Thanks for the time and improvements!

Tomas


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

25 January 2016, 12:27:24

On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Seems OK to me. Thanks for the time and improvements!

Thanks. Perhaps a committer could have a look then? I have switched
the patch as such in the CF app. Seeing the accumulated feedback
upthread that's something that should be backpatched.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Alvaro Herrera

Date:

01 February 2016, 15:50:02

Michael Paquier wrote:
> On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Seems OK to me. Thanks for the time and improvements!
> 
> Thanks. Perhaps a committer could have a look then? I have switched
> the patch as such in the CF app. Seeing the accumulated feedback
> upthread that's something that should be backpatched.

Yeah.  On 9.4 there are already some conflicts, and I'm sure there will
be more in almost each branch.  Does anyone want to volunteer for
producing per-branch versions?

The next minor release is to be tagged next week and it'd be good to put
this fix there.

Thanks,

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

01 February 2016, 16:07:19

On 2016-01-25 16:30:47 +0900, Michael Paquier wrote:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index a2846c4..b124f90 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
>                          tmppath, path)));
>          return false;
>      }
> +
> +    /*
> +     * Make sure the rename is permanent by fsyncing the parent
> +     * directory.
> +     */
> +    START_CRIT_SECTION();
> +    fsync_fname(XLOGDIR, true);
> +    END_CRIT_SECTION();
>  #endif

Hm. I'm seriously doubtful that using critical sections for this is a
good idea. What's the scenario you're protecting against by declaring
this one?  We intentionally don't error out in the isdir cases in
fsync_fname() cases anyway.

Afaik we need to fsync tmppath before and after the rename, and only
then the directory, to actually be safe.

> @@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
>                   errmsg("could not rename file \"%s\" to \"%s\": %m",
>                          RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
>  
> +    /* Make sure the rename is permanent by fsyncing the data directory. */
> +    fsync_fname(".", true);
> +

Shouldn't RECOVERY_COMMAND_DONE be fsynced first here?

>      ereport(LOG,
>              (errmsg("archive recovery complete")));
>  }
> @@ -6150,6 +6169,12 @@ StartupXLOG(void)
>                                  TABLESPACE_MAP, BACKUP_LABEL_FILE),
>                           errdetail("Could not rename file \"%s\" to \"%s\": %m.",
>                                     TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
> +
> +            /*
> +             * Make sure the rename is permanent by fsyncing the data
> +             * directory.
> +             */
> +            fsync_fname(".", true);
>          }

Is it just me, or are the repeated four line comments a bit grating?

>          /*
> @@ -6525,6 +6550,13 @@ StartupXLOG(void)
>                                  TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>          }
>  
> +        /*
> +         * Make sure the rename is permanent by fsyncing the parent
> +         * directory.
> +         */
> +        if (haveBackupLabel || haveTblspcMap)
> +            fsync_fname(".", true);
> +

Isn't that redundant with the haveTblspcMap case above?

>          /* Check that the GUCs used to generate the WAL allow recovery */
>          CheckRequiredParameterValues();
>  
> @@ -7305,6 +7337,12 @@ StartupXLOG(void)
>                           errmsg("could not rename file \"%s\" to \"%s\": %m",
>                                  origpath, partialpath)));
>                  XLogArchiveNotify(partialfname);
> +
> +                /*
> +                 * Make sure the rename is permanent by fsyncing the parent
> +                 * directory.
> +                 */
> +                fsync_fname(XLOGDIR, true);

.partial should be fsynced first.

>              }
>          }
>      }
> @@ -10905,6 +10943,9 @@ CancelBackup(void)
>                             BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
>                             TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>      }
> +
> +    /* fsync the data directory to persist the renames */
> +    fsync_fname(".", true);
>  }

Same.

>  /*
> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
> index 277c14a..8dda80b 100644
> --- a/src/backend/access/transam/xlogarchive.c
> +++ b/src/backend/access/transam/xlogarchive.c
> @@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
>                          path, xlogfpath)));
>  
>      /*
> +     * Make sure the renames are permanent by fsyncing the parent
> +     * directory.
> +     */
> +    fsync_fname(XLOGDIR, true);

Afaics the file under the temporary filename has not been fsynced up to
here.

> +    /*
>       * Create .done file forcibly to prevent the restored segment from being
>       * archived again later.
>       */
> @@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
>                       errmsg("could not rename file \"%s\" to \"%s\": %m",
>                              archiveReady, archiveDone)));
>  
> +        /*
> +         * Make sure the rename is permanent by fsyncing the parent
> +         * directory.
> +         */
> +        fsync_fname(XLOGDIR "/archive_status", true);
>          return;
>      }

Afaics the AllocateFile() call below is not protected at all, no?


How about introducing a 'safe_rename()' that does something roughly akin
to:
fsync(oldname);
fsync(fname) || true;
rename(oldfname, fname);
fsync(fname);
fsync(basename(fname));

I'd rather have that kind of logic somewhere once, instead of repeated a
dozen times...

Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

01 February 2016, 16:08:58

On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote:
> Yeah.  On 9.4 there are already some conflicts, and I'm sure there will
> be more in almost each branch.  Does anyone want to volunteer for
> producing per-branch versions?

> The next minor release is to be tagged next week and it'd be good to put
> this fix there.

I don't think this is going to be ready for that. The risk of hurrying
this through seems higher than the loss risk at this point.

Andres

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

02 February 2016, 00:56:46

On Tue, Feb 2, 2016 at 1:08 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote:
>> Yeah.  On 9.4 there are already some conflicts, and I'm sure there will
>> be more in almost each branch.  Does anyone want to volunteer for
>> producing per-branch versions?
>
>> The next minor release is to be tagged next week and it'd be good to put
>> this fix there.
>
> I don't think this is going to be ready for that. The risk of hurrying
> this through seems higher than the loss risk at this point.

Agreed. And there is no actual risk of data loss, so let's not hurry
and be sure that the right think is pushed.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

02 February 2016, 00:58:15

On Tue, Feb 2, 2016 at 12:49 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Seems OK to me. Thanks for the time and improvements!
>>
>> Thanks. Perhaps a committer could have a look then? I have switched
>> the patch as such in the CF app. Seeing the accumulated feedback
>> upthread that's something that should be backpatched.
>
> Yeah.  On 9.4 there are already some conflicts, and I'm sure there will
> be more in almost each branch.  Does anyone want to volunteer for
> producing per-branch versions?

I don't mind doing it once we have something fully-bloomed for master,
something that I guess is very likely to apply easily to 9.5 as well.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

02 February 2016, 00:59:22

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
> And there is no actual risk of data loss

Huh?

- Andres

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

02 February 2016, 07:20:15

On Tue, Feb 2, 2016 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-25 16:30:47 +0900, Michael Paquier wrote:
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index a2846c4..b124f90 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
>>                                               tmppath, path)));
>>               return false;
>>       }
>> +
>> +     /*
>> +      * Make sure the rename is permanent by fsyncing the parent
>> +      * directory.
>> +      */
>> +     START_CRIT_SECTION();
>> +     fsync_fname(XLOGDIR, true);
>> +     END_CRIT_SECTION();
>>  #endif
>
> Hm. I'm seriously doubtful that using critical sections for this is a
> good idea. What's the scenario you're protecting against by declaring
> this one?  We intentionally don't error out in the isdir cases in
> fsync_fname() cases anyway.
>
> Afaik we need to fsync tmppath before and after the rename, and only
> then the directory, to actually be safe.

Regarding the fsync call on the new file before the rename, would it
be better to extend fsync_fname() with some kind of noerror flag to
work around the case of a file that does not exist or do you think it
is better just to use pg_fsync in this case after getting an fd? Using
directly pg_fsync() looks redundant with what fsync_fname does
already.

>> @@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
>>                                errmsg("could not rename file \"%s\" to \"%s\": %m",
>>                                               RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
>>
>> +     /* Make sure the rename is permanent by fsyncing the data directory. */
>> +     fsync_fname(".", true);
>> +
>
> Shouldn't RECOVERY_COMMAND_DONE be fsynced first here?

Check.

>>               /*
>> @@ -6525,6 +6550,13 @@ StartupXLOG(void)
>>                                                               TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>>               }
>>
>> +             /*
>> +              * Make sure the rename is permanent by fsyncing the parent
>> +              * directory.
>> +              */
>> +             if (haveBackupLabel || haveTblspcMap)
>> +                     fsync_fname(".", true);
>> +
>
> Isn't that redundant with the haveTblspcMap case above?

I am not sure I get your point here. Are you referring to the fact
that fsync should be done after each rename in this case?

>>               /* Check that the GUCs used to generate the WAL allow recovery */
>>               CheckRequiredParameterValues();
>>
>> @@ -7305,6 +7337,12 @@ StartupXLOG(void)
>>                                                errmsg("could not rename file \"%s\" to \"%s\": %m",
>>                                                               origpath, partialpath)));
>>                               XLogArchiveNotify(partialfname);
>> +
>> +                             /*
>> +                              * Make sure the rename is permanent by fsyncing the parent
>> +                              * directory.
>> +                              */
>> +                             fsync_fname(XLOGDIR, true);
>
> .partial should be fsynced first.

Check.

>>                       }
>>               }
>>       }
>> @@ -10905,6 +10943,9 @@ CancelBackup(void)
>>                                                  BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
>>                                                  TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
>>       }
>> +
>> +     /* fsync the data directory to persist the renames */
>> +     fsync_fname(".", true);
>>  }
>
> Same.

Re-check.

>>  /*
>> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
>> index 277c14a..8dda80b 100644
>> --- a/src/backend/access/transam/xlogarchive.c
>> +++ b/src/backend/access/transam/xlogarchive.c
>> @@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
>>                                               path, xlogfpath)));
>>
>>       /*
>> +      * Make sure the renames are permanent by fsyncing the parent
>> +      * directory.
>> +      */
>> +     fsync_fname(XLOGDIR, true);
>
> Afaics the file under the temporary filename has not been fsynced up to
> here.

Yes, true, the old file...

>> +     /*
>>        * Create .done file forcibly to prevent the restored segment from being
>>        * archived again later.
>>        */
>> @@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
>>                                        errmsg("could not rename file \"%s\" to \"%s\": %m",
>>                                                       archiveReady, archiveDone)));
>>
>> +             /*
>> +              * Make sure the rename is permanent by fsyncing the parent
>> +              * directory.
>> +              */
>> +             fsync_fname(XLOGDIR "/archive_status", true);
>>               return;
>>       }
>
> Afaics the AllocateFile() call below is not protected at all, no?

Yep.

> How about introducing a 'safe_rename()' that does something roughly akin
> to:
> fsync(oldname);
> fsync(fname) || true;
> rename(oldfname, fname);
> fsync(fname);
> fsync(basename(fname));
>
> I'd rather have that kind of logic somewhere once, instead of repeated a
> dozen times...

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 February 2016, 08:59:47

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
>> And there is no actual risk of data loss
>
> Huh?

More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 February 2016, 09:02:37

On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
> Not wrong, and this leads to the following:
> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
> Controlling elevel is necessary per the multiple code paths that would
> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
> that look fine?

After really coding it, I finished with the following thing:
+int
+rename_safe(const char *old, const char *new)

There is no need to extend that for directories, well we could of
course but all the renames happen on files so I see no need to make
that more complicated. More refactoring of the other rename() calls
could be done as well by extending rename_safe() with a flag to fsync
the data within a critical section, particularly for the replication
slot code. I have let that out to not complicate more the patch.
--
Michael

Attachment

xlog-fsync-v4.patch

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 February 2016, 11:34:08

On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
>> Not wrong, and this leads to the following:
>> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
>> Controlling elevel is necessary per the multiple code paths that would
>> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
>> that look fine?
>
> After really coding it, I finished with the following thing:
> +int
> +rename_safe(const char *old, const char *new)
>
> There is no need to extend that for directories, well we could of
> course but all the renames happen on files so I see no need to make
> that more complicated. More refactoring of the other rename() calls
> could be done as well by extending rename_safe() with a flag to fsync
> the data within a critical section, particularly for the replication
> slot code. I have let that out to not complicate more the patch.

Andres just poked me (2m far from each other now) regarding the fact
that we should fsync even after the link() calls when
HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
culpa. And the patch misses that.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

05 February 2016, 09:41:05

On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
>>> Not wrong, and this leads to the following:
>>> void rename_safe(const char *old, const char *new, bool isdir, int elevel);
>>> Controlling elevel is necessary per the multiple code paths that would
>>> use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
>>> that look fine?
>>
>> After really coding it, I finished with the following thing:
>> +int
>> +rename_safe(const char *old, const char *new)
>>
>> There is no need to extend that for directories, well we could of
>> course but all the renames happen on files so I see no need to make
>> that more complicated. More refactoring of the other rename() calls
>> could be done as well by extending rename_safe() with a flag to fsync
>> the data within a critical section, particularly for the replication
>> slot code. I have let that out to not complicate more the patch.
>
> Andres just poked me (2m far from each other now) regarding the fact
> that we should fsync even after the link() calls when
> HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
> culpa. And the patch misses that.

So, attached is an updated patch that adds a new routine link_safe()
to ensure that a hard link is on-disk persistent. link() is called
twice in timeline.c and once in xlog.c, so those three code paths are
impacted. I noticed as well that my previous patch was sometimes doing
palloc calls in a critical section (oops), I fixed that on the way.

Thoughts welcome.
--
Michael

Attachment

xlog-fsync-v5.patch

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

05 February 2016, 17:11:59

On 02/04/2016 09:59 AM, Michael Paquier wrote:
> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
>>> And there is no actual risk of data loss
>>
>> Huh?
>
> More precise: what I mean here is that should an OS crash or a power
> failure happen, we would fall back to recovery at next restart, so we
> would not actually *lose* data.

Except that we actually can't perform the recovery properly because we 
may not have the last WAL segment (or multiple segments), so we can't 
replay the last batch of transactions. And we don't even notice that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

06 February 2016, 12:16:31

On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 02/04/2016 09:59 AM, Michael Paquier wrote:
>>
>> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
>>>>
>>>> And there is no actual risk of data loss
>>>
>>>
>>> Huh?
>>
>>
>> More precise: what I mean here is that should an OS crash or a power
>> failure happen, we would fall back to recovery at next restart, so we
>> would not actually *lose* data.
>
>
> Except that we actually can't perform the recovery properly because we may
> not have the last WAL segment (or multiple segments), so we can't replay the
> last batch of transactions. And we don't even notice that.

Still the data is here... But well. I won't insist. Tomas, could you
have a look at the latest patch I wrote? It would be good to get fresh
eyes on it. We could work on a version for ~9.4 once we have a clean
approach for master/9.5.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

06 February 2016, 16:43:59

Hi,

On 02/06/2016 01:16 PM, Michael Paquier wrote:
> On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 02/04/2016 09:59 AM, Michael Paquier wrote:
>>>
>>> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:
>>>>
>>>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:
>>>>>
>>>>> And there is no actual risk of data loss
>>>>
>>>>
>>>> Huh?
>>>
>>>
>>> More precise: what I mean here is that should an OS crash or a
>>> power failure happen, we would fall back to recovery at next
>>> restart, so we would not actually *lose* data.
>>
>>
>> Except that we actually can't perform the recovery properly
>> because we may not have the last WAL segment (or multiple
>> segments), so we can't replay the last batch of transactions. And
>> we don't even notice that.
>
> Still the data is here... But well. I won't insist.

Huh? This thread started by an example how to cause loss of committed 
transactions. That fits my definition of "data loss" quite well.

> Tomas, could you have a look at the latest patch I wrote? It would be
> good to get fresh eyes on it. We could work on a version for ~9.4
> once we have a clean approach for master/9.5.

Yep, I'll take a look - I've been out of office for the past 2 weeks, 
but I've been following the discussion and I agree with the changes 
discussed there (e.g. adding safe_rename and such).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

06 February 2016, 17:41:55

On 2016-02-06 17:43:48 +0100, Tomas Vondra wrote:
> >Still the data is here... But well. I won't insist.
> 
> Huh? This thread started by an example how to cause loss of committed
> transactions. That fits my definition of "data loss" quite well.

Agreed, that view doesn't seem to make much sense. This clearly is a
data loss issue.

Andres

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

23 February 2016, 22:26:58

Hi,

On 02/05/2016 10:40 AM, Michael Paquier wrote:
> On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:
...
> So, attached is an updated patch that adds a new routine link_safe()
> to ensure that a hard link is on-disk persistent. link() is called
> twice in timeline.c and once in xlog.c, so those three code paths
> are impacted. I noticed as well that my previous patch was sometimes
> doing palloc calls in a critical section (oops), I fixed that on the
> way.

I've finally got around to review the v5 version of the patch. Sorry it 
took so long (I blame FOSDEM, country-wide flu epidemic and my general 
laziness).

I do like most of the changes to the patch, thanks for improving it. A 
few comments though:

1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? 
The only place where we use missing_ok=true is in rename_safe, where 
right at the beginning we do this:
    fsync_fname(newfile, false, true);

I.e. we're fsyncing the rename target first, in case it exists. But that 
seems to be conflicting with the comments in xlog.c where we explicitly 
state that the target file should not exist. Why should it be OK to call 
rename_safe() when the target already exists? If this really is the 
right thing to do, it should be explained in the comment above 
rename_safe(), probably.

2) If rename_safe/link_safe are meant as crash-safe replacements for 
rename/link, then perhaps we should use the same signatures, including 
the "const" pointer parameters. So while currently the signatures look 
like this:
    int rename_safe(char *oldfile, char *newfile);    int link_safe(char *oldfile, char *newfile);

it should probably look like this
    int rename_safe(const char *oldfile, const char *newfile);    int link_safe(const char *oldfile, const char
*newfile);

I've noticed this in CheckPointReplicationOrigin() where the current 
code has to cast the parameters to (char*) to silence the compiler.

3) Both rename_safe and link_safe do this at the very end:
    fsync_parent_path(newfile);

That however assumes both the oldfile and newfile are placed in the same 
directory - otherwise we'd fsync only one of them. I don't think we have 
a place where we're renaming files between directories (or do we), so 
we're OK with respect to this. But it seems like a good idea to defend 
against this, or at least mention that in the comments.

4) nitpicking: There are some unnecessary newlines added/removed in 
RemoveXlogFile, XLogArchiveForceDone.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

24 February 2016, 07:06:19

On Wed, Feb 24, 2016 at 7:26 AM, Tomas Vondra wrote:
> 1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? The
> only place where we use missing_ok=true is in rename_safe, where right at
> the beginning we do this:
>
>     fsync_fname(newfile, false, true);
>
> I.e. we're fsyncing the rename target first, in case it exists. But that
> seems to be conflicting with the comments in xlog.c where we explicitly
> state that the target file should not exist. Why should it be OK to call
> rename_safe() when the target already exists? If this really is the right
> thing to do, it should be explained in the comment above rename_safe(),
> probably.

The point is to mimic rename(), which can manage the case where the
new entry exists, and to look for a consistent on-disk behavior. It is
true that the new argument of fsync_fname is actually not necessary,
we could just check for the existence of the new entry with stat(),
and perform an fsync if it exists.
I have added the following comment in rename_safe():
+   /*
+    * First fsync the old entry and new entry, it this one exists, to ensure
+    * that they are properly persistent on disk. Calling this routine with
+    * an existing new target file is fine, rename() will first remove it
+    * before performing its operation.
+    */
How does that look?

> 2) If rename_safe/link_safe are meant as crash-safe replacements for
> rename/link, then perhaps we should use the same signatures, including the
> "const" pointer parameters. So while currently the signatures look like
> this:
>
>     int rename_safe(char *oldfile, char *newfile);
>     int link_safe(char *oldfile, char *newfile);
>
> it should probably look like this
>
>     int rename_safe(const char *oldfile, const char *newfile);
>     int link_safe(const char *oldfile, const char *newfile);
>
> I've noticed this in CheckPointReplicationOrigin() where the current code
> has to cast the parameters to (char*) to silence the compiler.

I recall considering that, and the reason why I did not do so was that
fsync_fname() is not doing it either, because OpenTransientFile() is
using FileName which is not defined as a constant. At the end we'd
finish with one or two casts anyway. So it seems that changing
fsync_fname makes sense though by looking at fsync_fname_ext,
OpenTransientFile() is using an explicit cast for the file name.

> 3) Both rename_safe and link_safe do this at the very end:
>
>     fsync_parent_path(newfile);
>
> That however assumes both the oldfile and newfile are placed in the same
> directory - otherwise we'd fsync only one of them. I don't think we have a
> place where we're renaming files between directories (or do we), so we're OK
> with respect to this. But it seems like a good idea to defend against this,
> or at least mention that in the comments.

No, we don't have a code path where a file is renamed between
different directories, which is why this code is doing so. The comment
is a good addition to have though, so I have added it. I guess that we
could complicate more this patch to check if the parent directories of
the new and old entries match or not, then fsync both of them, but I'd
rather keep things simple.

> 4) nitpicking: There are some unnecessary newlines added/removed in
> RemoveXlogFile, XLogArchiveForceDone.

Fixed. I missed those three ones.
--
Michael

Attachment

xlog-fsync-v6.patch

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

03 March 2016, 19:06:20

Hi,

> +/*
> + * rename_safe -- rename of a file, making it on-disk persistent
> + *
> + * This routine ensures that a rename file persists in case of a crash by using
> + * fsync on the old and new files before and after performing the rename so as
> + * this categorizes as an all-or-nothing operation.
> + */
> +int
> +rename_safe(const char *oldfile, const char *newfile)
> +{
> +    struct stat    filestats;
> +
> +    /*
> +     * First fsync the old entry and new entry, it this one exists, to ensure
> +     * that they are properly persistent on disk. Calling this routine with
> +     * an existing new target file is fine, rename() will first remove it
> +     * before performing its operation.
> +     */
> +    fsync_fname(oldfile, false);
> +    if (stat(newfile, &filestats) == 0)
> +        fsync_fname(newfile, false);

I don't think we want any stat()s here. I'd much, much rather check open
for ENOENT.


> +/*
> + * link_safe -- make a file hard link, making it on-disk persistent
> + *
> + * This routine ensures that a hard link created on a file persists on system
> + * in case of a crash by using fsync where on the link generated as well as on
> + * its parent directory.
> + */
> +int
> +link_safe(const char *oldfile, const char *newfile)
> +{

If we go for a new abstraction here, I'd rather make it
'replace_file_safe' or something, and move the link/rename code #ifdef
into it.


> +    if (link(oldfile, newfile) < 0)
> +        return -1;
> +
> +    /*
> +     * Make the link persistent in case of an OS crash, the new entry
> +     * generated as well as its parent directory need to be flushed.
> +     */
> +    fsync_fname(newfile, false);
> +
> +    /*
> +     * Same for parent directory. This routine is never called to rename
> +     * files across directories, but if this proves to become the case,
> +     * flushing the parent directory if the old file would be necessary.
> +     */
> +    fsync_parent_path(newfile);
> +    return 0;

I think it's a seriously bad idea to encode that knowledge in such a
general sounding routine.  We could however argue that this is about
safely replacing the *target* file; not about safely removing the old
file.


Currently I'm inclined to apply this to master soon. But I think we
might want to wait a while with backpatching.  The recent fsync upgrade
disaster kinda makes me a bit careful...


Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 March 2016, 05:51:56

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,

Thanks for the review.

>> +/*
>> + * rename_safe -- rename of a file, making it on-disk persistent
>> + *
>> + * This routine ensures that a rename file persists in case of a crash by using
>> + * fsync on the old and new files before and after performing the rename so as
>> + * this categorizes as an all-or-nothing operation.
>> + */
>> +int
>> +rename_safe(const char *oldfile, const char *newfile)
>> +{
>> +     struct stat     filestats;
>> +
>> +     /*
>> +      * First fsync the old entry and new entry, it this one exists, to ensure
>> +      * that they are properly persistent on disk. Calling this routine with
>> +      * an existing new target file is fine, rename() will first remove it
>> +      * before performing its operation.
>> +      */
>> +     fsync_fname(oldfile, false);
>> +     if (stat(newfile, &filestats) == 0)
>> +             fsync_fname(newfile, false);
>
> I don't think we want any stat()s here. I'd much, much rather check open
> for ENOENT.

OK. So you mean more or less that, right?
int fd;
fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
if (fd < 0)
{   if (errno != ENOENT)       return -1;
}
else
{   pg_fsync(fd);   CloseTransientFile(fd);
}

>> +/*
>> + * link_safe -- make a file hard link, making it on-disk persistent
>> + *
>> + * This routine ensures that a hard link created on a file persists on system
>> + * in case of a crash by using fsync where on the link generated as well as on
>> + * its parent directory.
>> + */
>> +int
>> +link_safe(const char *oldfile, const char *newfile)
>> +{
>
> If we go for a new abstraction here, I'd rather make it
> 'replace_file_safe' or something, and move the link/rename code #ifdef
> into it.

Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

>> +     if (link(oldfile, newfile) < 0)
>> +             return -1;
>> +
>> +     /*
>> +      * Make the link persistent in case of an OS crash, the new entry
>> +      * generated as well as its parent directory need to be flushed.
>> +      */
>> +     fsync_fname(newfile, false);
>> +
>> +     /*
>> +      * Same for parent directory. This routine is never called to rename
>> +      * files across directories, but if this proves to become the case,
>> +      * flushing the parent directory if the old file would be necessary.
>> +      */
>> +     fsync_parent_path(newfile);
>> +     return 0;
>
> I think it's a seriously bad idea to encode that knowledge in such a
> general sounding routine.  We could however argue that this is about
> safely replacing the *target* file; not about safely removing the old
> file.

Not sure I am following here. Are you referring to the fact that if
the new file and old file are on different directories would make this
routine unreliable? Because yes that's the case if we want to make
both of them persistent, and I think we want to do so. Do you suggest
to correct this comment to remove the mention to the old file's parent
directory because we just care about having the new file as being
persistent? Or do you suggest that we should actually extend this
routine so as we fsync both the new and old file's parent directory if
they differ?

> Currently I'm inclined to apply this to master soon. But I think we
> might want to wait a while with backpatching.  The recent fsync upgrade
> disaster kinda makes me a bit careful...

There have not been actual field complaints about that yet. That's
fine for me to wait a bit.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Alvaro Herrera

Date:

04 March 2016, 15:53:01

I would like to have a patch for this finalized today, so that we can
apply to master before or during the weekend; with it in the tree for
about a week we can be more confident and backpatch close to next
weekend, so that we see it in the next set of minor releases.  Does that
sound good?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Tom Lane

Date:

04 March 2016, 16:09:26

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I would like to have a patch for this finalized today, so that we can
> apply to master before or during the weekend; with it in the tree for
> about a week we can be more confident and backpatch close to next
> weekend, so that we see it in the next set of minor releases.  Does that
> sound good?

I see no reason to wait before backpatching.  If you're concerned about
having testing, the more branches it is in, the more buildfarm cycles
you will get on it.  And we're not going to cut any releases in between,
so what's the benefit of not having it there?
        regards, tom lane

Re: silent data loss with ext4 / all current versions

From

Robert Haas

Date:

04 March 2016, 16:23:29

On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I would like to have a patch for this finalized today, so that we can
>> apply to master before or during the weekend; with it in the tree for
>> about a week we can be more confident and backpatch close to next
>> weekend, so that we see it in the next set of minor releases.  Does that
>> sound good?
>
> I see no reason to wait before backpatching.  If you're concerned about
> having testing, the more branches it is in, the more buildfarm cycles
> you will get on it.  And we're not going to cut any releases in between,
> so what's the benefit of not having it there?

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 March 2016, 22:29:39

On Sat, Mar 5, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> I would like to have a patch for this finalized today, so that we can
>>> apply to master before or during the weekend; with it in the tree for
>>> about a week we can be more confident and backpatch close to next
>>> weekend, so that we see it in the next set of minor releases.  Does that
>>> sound good?
>>
>> I see no reason to wait before backpatching.  If you're concerned about
>> having testing, the more branches it is in, the more buildfarm cycles
>> you will get on it.  And we're not going to cut any releases in between,
>> so what's the benefit of not having it there?
>
> Agreed.

OK. I could produce that by tonight my time, not before unfortunately.
And FWIW, per the comments of Andres, it is not clear to me what we
gain by having a common routine for link() and rename() knowing that
half the code paths performing a rename do not rely on link(). At
least it sound dangerous to me to introduce a dependency to link() in
code paths that depend just on rename() for back branches. On HEAD, we
could be more adventurous for sure. Regarding the replacement of
stat() by something relying on OpenTransientFile I agree though. For
the flush of the parent directory in link_safe() we'd still want to do
it, and we are fine to not flush the parent directory of the old file
because the backend does not move files across paths.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

04 March 2016, 22:35:36

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:
> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> 
> Thanks for the review.
> 
> >> +/*
> >> + * rename_safe -- rename of a file, making it on-disk persistent
> >> + *
> >> + * This routine ensures that a rename file persists in case of a crash by using
> >> + * fsync on the old and new files before and after performing the rename so as
> >> + * this categorizes as an all-or-nothing operation.
> >> + */
> >> +int
> >> +rename_safe(const char *oldfile, const char *newfile)
> >> +{
> >> +     struct stat     filestats;
> >> +
> >> +     /*
> >> +      * First fsync the old entry and new entry, it this one exists, to ensure
> >> +      * that they are properly persistent on disk. Calling this routine with
> >> +      * an existing new target file is fine, rename() will first remove it
> >> +      * before performing its operation.
> >> +      */
> >> +     fsync_fname(oldfile, false);
> >> +     if (stat(newfile, &filestats) == 0)
> >> +             fsync_fname(newfile, false);
> >
> > I don't think we want any stat()s here. I'd much, much rather check open
> > for ENOENT.
> 
> OK. So you mean more or less that, right?
> int fd;
> fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
> if (fd < 0)
> {
>     if (errno != ENOENT)
>         return -1;
> }
> else
> {
>     pg_fsync(fd);
>     CloseTransientFile(fd);
> }

Yes. Otherwise the check is racy: The file could be gone by the time you
do the fsync; leading to a spurious ERROR (which often would get
promoted to a PANIC).

> >> +/*
> >> + * link_safe -- make a file hard link, making it on-disk persistent
> >> + *
> >> + * This routine ensures that a hard link created on a file persists on system
> >> + * in case of a crash by using fsync where on the link generated as well as on
> >> + * its parent directory.
> >> + */
> >> +int
> >> +link_safe(const char *oldfile, const char *newfile)
> >> +{
> >
> > If we go for a new abstraction here, I'd rather make it
> > 'replace_file_safe' or something, and move the link/rename code #ifdef
> > into it.
> 
> Hm. OK. I don't see any reason why switching to link() even in code
> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
> be a problem thinking about it. Should HAVE_WORKING_LINK be available
> on a platform we can combine it with unlink. Is that in line with what
> you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

> >> +     if (link(oldfile, newfile) < 0)
> >> +             return -1;
> >> +
> >> +     /*
> >> +      * Make the link persistent in case of an OS crash, the new entry
> >> +      * generated as well as its parent directory need to be flushed.
> >> +      */
> >> +     fsync_fname(newfile, false);
> >> +
> >> +     /*
> >> +      * Same for parent directory. This routine is never called to rename
> >> +      * files across directories, but if this proves to become the case,
> >> +      * flushing the parent directory if the old file would be necessary.
> >> +      */
> >> +     fsync_parent_path(newfile);
> >> +     return 0;
> >
> > I think it's a seriously bad idea to encode that knowledge in such a
> > general sounding routine.  We could however argue that this is about
> > safely replacing the *target* file; not about safely removing the old
> > file.
> 
> Not sure I am following here. Are you referring to the fact that if
> the new file and old file are on different directories would make this
> routine unreliable?

Yes.


> Because yes that's the case if we want to make both of them
> persistent, and I think we want to do so.

That's one way.


> Do you suggest to correct this comment to remove the mention to the
> old file's parent directory because we just care about having the new
> file as being persistent?

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.


Regards,

Andres

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

04 March 2016, 22:37:54

On 2016-03-05 07:29:35 +0900, Michael Paquier wrote:
> OK. I could produce that by tonight my time, not before unfortunately.

I'm switching to this patch, after pushing the pending logical decoding
fixes. Probably not today, but tomorrow PST afternoon should work.

> And FWIW, per the comments of Andres, it is not clear to me what we
> gain by having a common routine for link() and rename() knowing that
> half the code paths performing a rename do not rely on link().

I'm not talking about replacing all renames with this. Just the ones
that currently use link(). There's not much point in introducing
link_safe(), when all the callers have the same duplicated code, with a
fallback to rename().

Andres

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 March 2016, 22:43:05

On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:
>> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
>> > I don't think we want any stat()s here. I'd much, much rather check open
>> > for ENOENT.
>>
>> OK. So you mean more or less that, right?
>> int fd;
>> fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
>> if (fd < 0)
>> {
>>     if (errno != ENOENT)
>>         return -1;
>> }
>> else
>> {
>>     pg_fsync(fd);
>>     CloseTransientFile(fd);
>> }
>
> Yes. Otherwise the check is racy: The file could be gone by the time you
> do the fsync; leading to a spurious ERROR (which often would get
> promoted to a PANIC).

Yeah, that makes sense.

>> >> +/*
>> >> + * link_safe -- make a file hard link, making it on-disk persistent
>> >> + *
>> >> + * This routine ensures that a hard link created on a file persists on system
>> >> + * in case of a crash by using fsync where on the link generated as well as on
>> >> + * its parent directory.
>> >> + */
>> >> +int
>> >> +link_safe(const char *oldfile, const char *newfile)
>> >> +{
>> >
>> > If we go for a new abstraction here, I'd rather make it
>> > 'replace_file_safe' or something, and move the link/rename code #ifdef
>> > into it.
>>
>> Hm. OK. I don't see any reason why switching to link() even in code
>> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
>> be a problem thinking about it. Should HAVE_WORKING_LINK be available
>> on a platform we can combine it with unlink. Is that in line with what
>> you think?
>
> I wasn't trying to suggest we should replace all rename codepaths with
> the link wrapper, just the ones that already have a HAVE_WORKING_LINK
> check. The name of the routine I suggested is bad though...

So we'd introduce a first routine rename_or_link_safe(), say replace_safe().

>> Do you suggest to correct this comment to remove the mention to the
>> old file's parent directory because we just care about having the new
>> file as being persistent?
>
> That's one approach, yes. Combined with the fact that you can't actually
> reliably rename across directories, the two could be on different
> filesystems after all, that'd be a suitable defense. It just needs to be
> properly documented in the function header, not at the bottom.

OK. Got it. Or the two could be on the same filesystem. Still, link()
and rename() do not support doing their stuff on different filesystems
(EXDEV).
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

04 March 2016, 22:46:46

On Sat, Mar 5, 2016 at 7:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-05 07:29:35 +0900, Michael Paquier wrote:
>> OK. I could produce that by tonight my time, not before unfortunately.
>
> I'm switching to this patch, after pushing the pending logical decoding
> fixes. Probably not today, but tomorrow PST afternoon should work.

OK, so if that's the case there is not need to step on your toes seen from here.

>> And FWIW, per the comments of Andres, it is not clear to me what we
>> gain by having a common routine for link() and rename() knowing that
>> half the code paths performing a rename do not rely on link().
>
> I'm not talking about replacing all renames with this. Just the ones
> that currently use link(). There's not much point in introducing
> link_safe(), when all the callers have the same duplicated code, with a
> fallback to rename().

Indeed, that's the case. I don't have a better name than replace_safe
though. replace_paranoid is not a very appealing name either.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

04 March 2016, 22:47:52

On 2016-03-05 07:43:00 +0900, Michael Paquier wrote:
> On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:
> >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
> >> Hm. OK. I don't see any reason why switching to link() even in code
> >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
> >> be a problem thinking about it. Should HAVE_WORKING_LINK be available
> >> on a platform we can combine it with unlink. Is that in line with what
> >> you think?
> >
> > I wasn't trying to suggest we should replace all rename codepaths with
> > the link wrapper, just the ones that already have a HAVE_WORKING_LINK
> > check. The name of the routine I suggested is bad though...
>
> So we'd introduce a first routine rename_or_link_safe(), say replace_safe().

Or actually maybe just link_safe(), which falls back to access() &&
rename() if !HAVE_WORKING_LINK.

> > That's one approach, yes. Combined with the fact that you can't actually
> > reliably rename across directories, the two could be on different
> > filesystems after all, that'd be a suitable defense. It just needs to be
> > properly documented in the function header, not at the bottom.
>
> OK. Got it. Or the two could be on the same filesystem.

> Still, link() and rename() do not support doing their stuff on
> different filesystems (EXDEV).

That's my point ...

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

05 March 2016, 13:25:43

On Sat, Mar 5, 2016 at 7:47 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-05 07:43:00 +0900, Michael Paquier wrote:
>> On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:
>> >> On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
>> >> Hm. OK. I don't see any reason why switching to link() even in code
>> >> paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
>> >> be a problem thinking about it. Should HAVE_WORKING_LINK be available
>> >> on a platform we can combine it with unlink. Is that in line with what
>> >> you think?
>> >
>> > I wasn't trying to suggest we should replace all rename codepaths with
>> > the link wrapper, just the ones that already have a HAVE_WORKING_LINK
>> > check. The name of the routine I suggested is bad though...
>>
>> So we'd introduce a first routine rename_or_link_safe(), say replace_safe().
>
> Or actually maybe just link_safe(), which falls back to access() &&
> rename() if !HAVE_WORKING_LINK.
>
>> > That's one approach, yes. Combined with the fact that you can't actually
>> > reliably rename across directories, the two could be on different
>> > filesystems after all, that'd be a suitable defense. It just needs to be
>> > properly documented in the function header, not at the bottom.
>>
>> OK. Got it. Or the two could be on the same filesystem.
>
>> Still, link() and rename() do not support doing their stuff on
>> different filesystems (EXDEV).
>
> That's my point ...

OK, I hacked a v7:
- Move the link()/rename() group with HAVE_WORKING_LINK into a single
routine, making the previous link_safe renamed to replace_safe. This
is sharing a lot of things with rename_safe. I am not sure it is worth
complicating the code more this way by having a common single routine
for whole. Thoughts welcome. Honestly, I kind of liked the separation
with link_safe/rename_safe of previous patches because link_safe could
have been directly used by extensions and plugins btw.
- Remove the call of stat() in rename_safe() and implement a logic
depending on OpenTransientFile()/pg_fsync() to flush any existing
target file before performing the rename.
Andres, feel free to use this patch as a base, perhaps that will help.
--
Michael

Attachment

xlog-fsync-v7.patch

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

06 March 2016, 03:54:17

On 2016-03-05 22:25:36 +0900, Michael Paquier wrote:
> OK, I hacked a v7:
> - Move the link()/rename() group with HAVE_WORKING_LINK into a single
> routine, making the previous link_safe renamed to replace_safe. This
> is sharing a lot of things with rename_safe. I am not sure it is worth
> complicating the code more this way by having a common single routine
> for whole. Thoughts welcome. Honestly, I kind of liked the separation
> with link_safe/rename_safe of previous patches because link_safe could
> have been directly used by extensions and plugins btw.
> - Remove the call of stat() in rename_safe() and implement a logic
> depending on OpenTransientFile()/pg_fsync() to flush any existing
> target file before performing the rename.
> Andres, feel free to use this patch as a base, perhaps that will help.

I started working on this; delayed by taking longer than planned on the
logical decoding stuff (quite a bit complicated by
e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8).  I'm not very happy with the
error handling as it is right now.  For one, you have rename_safe return
error codes, and act on them in the callers, but on the other hand you
call fsync_fname which always errors out in case of failure.  I also
don't like the new messages much.

Will continue working on this tomorrow.

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

07 March 2016, 06:38:47

Hi,

On 2016-03-05 19:54:05 -0800, Andres Freund wrote:
> I started working on this; delayed by taking longer than planned on the
> logical decoding stuff (quite a bit complicated by
> e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8).  I'm not very happy with the
> error handling as it is right now.  For one, you have rename_safe return
> error codes, and act on them in the callers, but on the other hand you
> call fsync_fname which always errors out in case of failure.  I also
> don't like the new messages much.
>
> Will continue working on this tomorrow.

So, here's my current version of this. I've not performed any testing
yet, and it's hot of the press. There's some comment smithing
needed. But otherwise I'm starting to like this.

Changes:
* renamed rename_safe to durable_rename
* renamed replace_safe to durable_link_or_rename (there was never any
  replacing going on)
* pass through elevel to the underlying routines, otherwise we could
  error out with ERROR when we don't want to. That's particularly
  important in case of things like InstallXLogFileSegment().
* made fsync_fname use fsync_fname_ext, add 'accept permission errors'
  param
* have walkdir call a wrapper, to add ignore_perms param

What do you think?

I sure wish we had the recovery testing et al. in all the back
branches...

- Andres

Attachment

0001-durable-rename-andres-v8.patch

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

08 March 2016, 03:01:26

On Mon, Mar 7, 2016 at 3:38 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-05 19:54:05 -0800, Andres Freund wrote:
>> I started working on this; delayed by taking longer than planned on the
>> logical decoding stuff (quite a bit complicated by
>> e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8).  I'm not very happy with the
>> error handling as it is right now.  For one, you have rename_safe return
>> error codes, and act on them in the callers, but on the other hand you
>> call fsync_fname which always errors out in case of failure.  I also
>> don't like the new messages much.
>>
>> Will continue working on this tomorrow.
>
> So, here's my current version of this. I've not performed any testing
> yet, and it's hot of the press. There's some comment smithing
> needed. But otherwise I'm starting to like this.
>
> Changes:
> * renamed rename_safe to durable_rename
> * renamed replace_safe to durable_link_or_rename (there was never any
>   replacing going on)
> * pass through elevel to the underlying routines, otherwise we could
>   error out with ERROR when we don't want to. That's particularly
>   important in case of things like InstallXLogFileSegment().
> * made fsync_fname use fsync_fname_ext, add 'accept permission errors'
>   param
> * have walkdir call a wrapper, to add ignore_perms param
>
> What do you think?

I have spent a couple of hours looking at that in details, and the
patch is neat.

+ * This routine ensures that, after returning, the effect of renaming file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave you with either the old, or the new file.
"you" is not really Postgres-like, "the server" or "the backend" perhaps?

+       /* XXX: perform close() before? might be outside a
transaction. Consider errno! */       ereport(elevel,               (errcode_for_file_access(),
errmsg("couldnot fsync file \"%s\": %m", fname)));
 
+       (void) CloseTransientFile(fd);
+       return -1;
close() should be called before. slot.c for example does so and we
don't want to link an fd here in case of elevel >= ERROR.

+ * It does so by using fsync on the sourcefile before the rename, and the
+ * target file and directory after.
fsync is issued as well on the target file if it exists. I think
that's worth mentioning in the header.

+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)
I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

Instead of "durable" I think that "persistent" makes more sense. We
want to make those renames persistent on disk on case of a crash. So I
would suggest the following routine names:
- rename_persistent
- rename_or_link_persistent
Having the verb first also helps identifying that this is a
system-level, rename()-like, routine.

> I sure wish we had the recovery testing et al. in all the back
> branches...

Sure, what we have now is something that should really be backpatched,
I was just waiting to have all the existing stability issues
addressed, the last one on my agenda being the failure of hamster for
test 005 I mentioned in another thread before sending patches for
other branches. I proposed a couple of potential regarding that
actually, see here:
http://www.postgresql.org/message-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

08 March 2016, 03:18:19

Hi,

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:
> I have spent a couple of hours looking at that in details, and the
> patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

> + * This routine ensures that, after returning, the effect of renaming file
> + * persists in case of a crash. A crash while this routine is running will
> + * leave you with either the old, or the new file.

> "you" is not really Postgres-like, "the server" or "the backend" perhaps?

Hm. I think your alternative proposals are more awkward.

> +       /* XXX: perform close() before? might be outside a
> transaction. Consider errno! */
>         ereport(elevel,
>                 (errcode_for_file_access(),
>                  errmsg("could not fsync file \"%s\": %m", fname)));
> +       (void) CloseTransientFile(fd);
> +       return -1;
> close() should be called before. slot.c for example does so and we
> don't want to link an fd here in case of elevel >= ERROR.

Note that the transient file machinery will normally prevent fd leakage
- but it can only do so if called in a transaction context. I've added    int            save_errno;
    /* close file upon error, might not be in transaction context */    save_errno = errno;    CloseTransientFile(fd);
 errno = save_errno;
 
stanzas.

> + * It does so by using fsync on the sourcefile before the rename, and the
> + * target file and directory after.

> fsync is issued as well on the target file if it exists. I think
> that's worth mentioning in the header.

Ok.


> +   /* XXX: Add racy file existence check? */
> +   if (rename(oldfile, newfile) < 0)

> I am not sure we should worry about that, what do you think could
> cause the old file from going missing all of a sudden. Other backend
> processes are not playing with it in the code paths where this routine
> is called. Perhaps adding a comment in the header to let users know
> that would help?

What I'm thinking of is adding a check whether the *target* file already
exists, and error out in that case. Just like the link() based path
normally does.


> Instead of "durable" I think that "persistent" makes more sense.

I find durable a lot more descriptive. persistent could refer to
retrying the rename or something.

> We
> want to make those renames persistent on disk on case of a crash. So I
> would suggest the following routine names:
> - rename_persistent
> - rename_or_link_persistent
> Having the verb first also helps identifying that this is a
> system-level, rename()-like, routine.

I prefer the current names.

> > I sure wish we had the recovery testing et al. in all the back
> > branches...
> 
> Sure, what we have now is something that should really be backpatched,
> I was just waiting to have all the existing stability issues
> addressed, the last one on my agenda being the failure of hamster for
> test 005 I mentioned in another thread before sending patches for
> other branches. I proposed a couple of potential regarding that
> actually, see here:
> http://www.postgresql.org/message-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com

Yea. Will be an interesting discussion... Anyway, I did run the patch
through the existing checks, after enabling fsync in PostgresNode.pm.

Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

08 March 2016, 03:26:38

On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:
>> I have spent a couple of hours looking at that in details, and the
>> patch is neat.
>
> Cool. Doing some more polishing right now. Will be back with an updated
> version soonish.
>
> Did you do some testing?

Not much in details yet, I just ran a check-world with fsync enabled
for the recovery tests, plus quick manual tests with a cluster
manually set up. I'll do more with your new version now that I know
there will be one.

>> +   /* XXX: Add racy file existence check? */
>> +   if (rename(oldfile, newfile) < 0)
>
>> I am not sure we should worry about that, what do you think could
>> cause the old file from going missing all of a sudden. Other backend
>> processes are not playing with it in the code paths where this routine
>> is called. Perhaps adding a comment in the header to let users know
>> that would help?
>
> What I'm thinking of is adding a check whether the *target* file already
> exists, and error out in that case. Just like the link() based path
> normally does.

Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

08 March 2016, 05:56:01

On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:
> On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:
> >> I have spent a couple of hours looking at that in details, and the
> >> patch is neat.
> >
> > Cool. Doing some more polishing right now. Will be back with an updated
> > version soonish.
> >
> > Did you do some testing?
>
> Not much in details yet, I just ran a check-world with fsync enabled
> for the recovery tests, plus quick manual tests with a cluster
> manually set up. I'll do more with your new version now that I know
> there will be one.

Here's my updated version.

Note that I've split the patch into two. One for the infrastructure, and
one for the callsites.

> >> +   /* XXX: Add racy file existence check? */
> >> +   if (rename(oldfile, newfile) < 0)
> >
> >> I am not sure we should worry about that, what do you think could
> >> cause the old file from going missing all of a sudden. Other backend
> >> processes are not playing with it in the code paths where this routine
> >> is called. Perhaps adding a comment in the header to let users know
> >> that would help?
> >
> > What I'm thinking of is adding a check whether the *target* file already
> > exists, and error out in that case. Just like the link() based path
> > normally does.
>
> Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.

I think it should definitely be an error if anything. But I'd rather
only add it in master...

Andres

Attachment

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

08 March 2016, 07:21:50

On Tue, Mar 8, 2016 at 2:55 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:
>> On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:
>> >> I have spent a couple of hours looking at that in details, and the
>> >> patch is neat.
>> >
>> > Cool. Doing some more polishing right now. Will be back with an updated
>> > version soonish.
>> >
>> > Did you do some testing?
>>
>> Not much in details yet, I just ran a check-world with fsync enabled
>> for the recovery tests, plus quick manual tests with a cluster
>> manually set up. I'll do more with your new version now that I know
>> there will be one.
>
> Here's my updated version.
>
> Note that I've split the patch into two. One for the infrastructure, and
> one for the callsites.

Thanks for the updated patches and the split, this makes things easier
to look at. I have been doing some testing as well mainly manually
using with pgbench and nothing looks broken.

+   durable_link_or_rename(tmppath, path, ERROR);
+   durable_rename(path, xlogfpath, ERROR);
You may want to add a (void) cast in front of those calls for correctness.

-       ereport(LOG,
-               (errcode_for_file_access(),
-                errmsg("could not link file \"%s\" to \"%s\"
(initialization of log file): %m",
-                       tmppath, path)));
We lose a portion of the error message here, but with the file name
that's easy to guess where that is happening. I am not complaining
(that's fine to me as-is), just mentioning for the archive's sake.

>> >> +   /* XXX: Add racy file existence check? */
>> >> +   if (rename(oldfile, newfile) < 0)
>> >
>> >> I am not sure we should worry about that, what do you think could
>> >> cause the old file from going missing all of a sudden. Other backend
>> >> processes are not playing with it in the code paths where this routine
>> >> is called. Perhaps adding a comment in the header to let users know
>> >> that would help?
>> >
>> > What I'm thinking of is adding a check whether the *target* file already
>> > exists, and error out in that case. Just like the link() based path
>> > normally does.
>>
>> Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.
>
> I think it should definitely be an error if anything. But I'd rather
> only add it in master...

I guess I know why :) That's also why I was thinking about an assertion.
-- 
Michael

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

08 March 2016, 07:25:00

Hi,

On 2016-03-08 16:21:45 +0900, Michael Paquier wrote:
> +   durable_link_or_rename(tmppath, path, ERROR);
> +   durable_rename(path, xlogfpath, ERROR);

> You may want to add a (void) cast in front of those calls for correctness.

"correctness"?  This is neatnikism, not correctness. I've actually added
(void)'s to the sites that return on error (i.e. pass LOG or something),
but not the ones where we pass ERROR.

> -       ereport(LOG,
> -               (errcode_for_file_access(),
> -                errmsg("could not link file \"%s\" to \"%s\"
> (initialization of log file): %m",
> -                       tmppath, path)));
> We lose a portion of the error message here, but with the file name
> that's easy to guess where that is happening. I am not complaining
> (that's fine to me as-is), just mentioning for the archive's sake.

Yea, I think that's fine too.


- Andres

Re: silent data loss with ext4 / all current versions

From

Robert Haas

Date:

08 March 2016, 22:16:40

On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote:
>> Instead of "durable" I think that "persistent" makes more sense.
>
> I find durable a lot more descriptive. persistent could refer to
> retrying the rename or something.

Yeah, I like durable, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: silent data loss with ext4 / all current versions

From

Tomas Vondra

Date:

08 March 2016, 22:47:57

Hi,

On Mon, 2016-03-07 at 21:55 -0800, Andres Freund wrote:
> On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:
> > On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:
> > >> I have spent a couple of hours looking at that in details, and the
> > >> patch is neat.
> > >
> > > Cool. Doing some more polishing right now. Will be back with an updated
> > > version soonish.
> > >
> > > Did you do some testing?
> > 
> > Not much in details yet, I just ran a check-world with fsync enabled
> > for the recovery tests, plus quick manual tests with a cluster
> > manually set up. I'll do more with your new version now that I know
> > there will be one.
> 
> Here's my updated version.
> 
> Note that I've split the patch into two. One for the infrastructure, and
> one for the callsites.

I've repeated the power-loss testing today. With the patches applied I'm
not longer able to reproduce the issue (despite trying about 10x), while
without them I've hit it on the first try. This is on kernel 4.4.2.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

08 March 2016, 22:54:24

On 2016-03-08 23:47:48 +0100, Tomas Vondra wrote:
> I've repeated the power-loss testing today. With the patches applied I'm
> not longer able to reproduce the issue (despite trying about 10x), while
> without them I've hit it on the first try. This is on kernel 4.4.2.

Yay, thanks for testing!

Andres

Re: silent data loss with ext4 / all current versions

From

"Joshua D. Drake"

Date:

08 March 2016, 23:34:42

On 03/08/2016 02:16 PM, Robert Haas wrote:
> On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Instead of "durable" I think that "persistent" makes more sense.
>>
>> I find durable a lot more descriptive. persistent could refer to
>> retrying the rename or something.
>
> Yeah, I like durable, too.

There is also precedent, DURABLE as in aciD

JD



-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

10 March 2016, 03:25:49

On 2016-03-07 21:55:52 -0800, Andres Freund wrote:
> Here's my updated version.
> 
> Note that I've split the patch into two. One for the infrastructure, and
> one for the callsites.

I've finally pushed these, after making a number of mostly cosmetic
fixes. The only of real consequence is that I've removed the durable_*
call from the renames to .deleted in xlog[archive].c - these don't need
to be durable, and are windows only. Oh, and that there was a typo in
the !HAVE_WORKING_LINK case.

There's a *lot* of version skew here: not-present functionality, moved
files, different APIs - we got it all.  I've tried to check in each
version whether we're missing fsyncs for renames and everything.
Michael, *please* double check the diffs for the different branches.

Note that we currently have some frontend programs with the equivalent
problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
are missing pretty much the same directory fsyncs.  And at least for
pg_recvxlog it's critical, especially now that receivexlog support
syncrep.  I've not done anything about that; there's pretty much no
chance to share backend code with the frontend in the back-branches.

Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

15 March 2016, 14:40:01

On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote:
> I've finally pushed these, after making a number of mostly cosmetic
> fixes. The only of real consequence is that I've removed the durable_*
> call from the renames to .deleted in xlog[archive].c - these don't need
> to be durable, and are windows only. Oh, and that there was a typo in
> the !HAVE_WORKING_LINK case.
>
> There's a *lot* of version skew here: not-present functionality, moved
> files, different APIs - we got it all.  I've tried to check in each
> version whether we're missing fsyncs for renames and everything.
> Michael, *please* double check the diffs for the different branches.

I have finally been able to spend some time reviewing what you pushed
on back-branches, and things are in correct shape I think. One small
issue that I have is that for EXEC_BACKEND builds, in
write_nondefault_variables we still use one instance of rename(). I
cannot really believe that there are production builds of Postgres
with EXEC_BACKEND on non-Windows platforms, but I think that we had
better cover our backs in this code path. For the other extra 2 calls
of rename() in xlog.c and xlogarchive.c, those are fine untouched I
think there is no need to care about WIN32 blocks...

> Note that we currently have some frontend programs with the equivalent
> problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
> are missing pretty much the same directory fsyncs.  And at least for
> pg_recvxlog it's critical, especially now that receivexlog support
> syncrep.  I've not done anything about that; there's pretty much no
> chance to share backend code with the frontend in the back-branches.

Yeah, true. We definitely need to do something for that, even for HEAD
it seems like an overkill to have something in for example src/common
to allow frontends to have something if the fix is localized
(pg_rewind may use something else), and it would be nice to finish
wrapping that for the next minor release, so I propose the attached
patches. At the same time, I think that adminpack had better be fixed
as well, so there are actually three patches in this series, things
that I shaped thinking about a backpatch btw, particularly for 0002.
--
Michael

Attachment

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

15 March 2016, 17:46:27

On 2016-03-15 15:39:50 +0100, Michael Paquier wrote:
> I have finally been able to spend some time reviewing what you pushed
> on back-branches, and things are in correct shape I think. One small
> issue that I have is that for EXEC_BACKEND builds, in
> write_nondefault_variables we still use one instance of rename().

Correctly so afaics, because write_nondefault_variables is by definition
non-durable. We write that stuff at every start / SIGHUP. Adding an
fsync there would be an unnecessary slowdown.  I don't think it's good
policy adding fsync for stuff that definitely doesn't need it.


> Yeah, true. We definitely need to do something for that, even for HEAD
> it seems like an overkill to have something in for example src/common
> to allow frontends to have something if the fix is localized
> (pg_rewind may use something else), and it would be nice to finish
> wrapping that for the next minor release, so I propose the attached
> patches. At the same time, I think that adminpack had better be fixed
> as well, so there are actually three patches in this series, things
> that I shaped thinking about a backpatch btw, particularly for 0002.

I'm doubtful about "fixing" adminpack. We don't know how it's used, and
this could *seriously* increase its overhead for something potentially
used at a high rate.


>  /*
> + * Wrapper of rename() similar to the backend version with the same function
> + * name aimed at making the renaming durable on disk. Note that this version
> + * does not fsync the old file before the rename as all the code paths leading
> + * to this function are already doing this operation. The new file is also
> + * normally not present on disk before the renaming so there is no need to
> + * bother about it.
> + */
> +static int
> +durable_rename(const char *oldfile, const char *newfile)
> +{
> +    int        fd;
> +    char    parentpath[MAXPGPATH];
> +
> +    if (rename(oldfile, newfile) != 0)
> +    {
> +        /* durable_rename produced a log entry */
> +        fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
> +                progname, current_walfile_name, strerror(errno));
> +        return -1;
> +    }
> +
> +    /*
> +     * To guarantee renaming of the file is persistent, fsync the file with its
> +     * new name, and its containing directory.
> +     */
> +    fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);

Why is S_IRUSR | S_IWUSR specified here?


Are you working on a fix for pg_rewind? Let's go with initdb -S in a
first iteration, then we can, if somebody is interest enough, work on
making this nicer in master.

Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

David Steele

Date:

15 March 2016, 17:50:15

On 3/15/16 10:39 AM, Michael Paquier wrote:

> On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote:
> 
>> Note that we currently have some frontend programs with the equivalent
>> problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
>> are missing pretty much the same directory fsyncs.  And at least for
>> pg_recvxlog it's critical, especially now that receivexlog support
>> syncrep.  I've not done anything about that; there's pretty much no
>> chance to share backend code with the frontend in the back-branches.
> 
> Yeah, true. We definitely need to do something for that, even for HEAD
> it seems like an overkill to have something in for example src/common
> to allow frontends to have something if the fix is localized
> (pg_rewind may use something else), and it would be nice to finish
> wrapping that for the next minor release, so I propose the attached
> patches.

I noticed this when reviewing the pg_receive_xlog refactor and was going
to submit a patch after the CF.  It didn't occur to me to piggyback on
this work but I think it makes sense.

+1 from me for fixing this in pg_receivexlog and back-patching.

-- 
-David
david@pgmasters.net

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

17 March 2016, 14:05:52

On Wed, Mar 16, 2016 at 2:46 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-15 15:39:50 +0100, Michael Paquier wrote:
>> Yeah, true. We definitely need to do something for that, even for HEAD
>> it seems like an overkill to have something in for example src/common
>> to allow frontends to have something if the fix is localized
>> (pg_rewind may use something else), and it would be nice to finish
>> wrapping that for the next minor release, so I propose the attached
>> patches. At the same time, I think that adminpack had better be fixed
>> as well, so there are actually three patches in this series, things
>> that I shaped thinking about a backpatch btw, particularly for 0002.
>
> I'm doubtful about "fixing" adminpack. We don't know how it's used, and
> this could *seriously* increase its overhead for something potentially
> used at a high rate.

I think that Dave or Guillaume added here in CC could bring some light
on the matter. Let's see if that's a problem for them. I would tend to
think that it is not that critical, still I would imagine that this
function is not called at a high frequency.

>>  /*
>> + * Wrapper of rename() similar to the backend version with the same function
>> + * name aimed at making the renaming durable on disk. Note that this version
>> + * does not fsync the old file before the rename as all the code paths leading
>> + * to this function are already doing this operation. The new file is also
>> + * normally not present on disk before the renaming so there is no need to
>> + * bother about it.
>> + */
>> +static int
>> +durable_rename(const char *oldfile, const char *newfile)
>> +{
>> +     int             fd;
>> +     char    parentpath[MAXPGPATH];
>> +
>> +     if (rename(oldfile, newfile) != 0)
>> +     {
>> +             /* durable_rename produced a log entry */
>> +             fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
>> +                             progname, current_walfile_name, strerror(errno));
>> +             return -1;
>> +     }
>> +
>> +     /*
>> +      * To guarantee renaming of the file is persistent, fsync the file with its
>> +      * new name, and its containing directory.
>> +      */
>> +     fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
>
> Why is S_IRUSR | S_IWUSR specified here?

Oops. I have removed it and updated that as attached.

> Are you working on a fix for pg_rewind? Let's go with initdb -S in a
> first iteration, then we can, if somebody is interest enough, work on
> making this nicer in master.

I am really -1 for this approach. Wrapping initdb -S with
find_other_exec is intrusive in back-branches knowing that all the I/O
write operations manipulating file descriptors go through file_ops.c,
and we actually just need to fsync the target file in
close_target_file(), making the fix being a 7-line patch, and there is
no need to depend on another binary at all. The backup label file, as
well as the control file are using the same code path in file_ops.c...
And I like simple things :)

At the same time, I found a legit bug when the modified backup_label
file is created in createBackupLabel: the file is opened, written, but
not closed with close_target_file(), and it should be.
--
Michael

Attachment

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

17 March 2016, 15:04:51

On 2016-03-17 23:05:42 +0900, Michael Paquier wrote:
> > Are you working on a fix for pg_rewind? Let's go with initdb -S in a
> > first iteration, then we can, if somebody is interest enough, work on
> > making this nicer in master.
> 
> I am really -1 for this approach. Wrapping initdb -S with
> find_other_exec is intrusive in back-branches knowing that all the I/O
> write operations manipulating file descriptors go through file_ops.c,
> and we actually just need to fsync the target file in
> close_target_file(), making the fix being a 7-line patch, and there is
> no need to depend on another binary at all. The backup label file, as
> well as the control file are using the same code path in file_ops.c...
> And I like simple things :)

This is a *much* more expensive approach though. Doing the fsync
directly after modifying the file. One file by one file. Will usually
result in each fsync blocking for a while.

In comparison of doing a flush and then an fsync pass over the whole
directory will usually only block seldomly. The flushes for all files
can be combined into very few barrier operations.

Besides that, you're not syncing the directories, despite
open_target_file() potentially creating the directory.

Greetings,

Andres Freund

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

18 March 2016, 06:08:38

On Fri, Mar 18, 2016 at 12:03 AM, Andres Freund <andres@anarazel.de> wrote:
> This is a *much* more expensive approach though. Doing the fsync
> directly after modifying the file. One file by one file. Will usually
> result in each fsync blocking for a while.
>
> In comparison of doing a flush and then an fsync pass over the whole
> directory will usually only block seldomly. The flushes for all files
> can be combined into very few barrier operations.

Hm... OK. I'd really like to keep the run of pg_rewind minimal as well
if possible. So here you go.
--
Michael

Attachment

Re: silent data loss with ext4 / all current versions

From

Robert Haas

Date:

18 March 2016, 15:21:05

On Thu, Mar 17, 2016 at 11:03 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-17 23:05:42 +0900, Michael Paquier wrote:
>> > Are you working on a fix for pg_rewind? Let's go with initdb -S in a
>> > first iteration, then we can, if somebody is interest enough, work on
>> > making this nicer in master.
>>
>> I am really -1 for this approach. Wrapping initdb -S with
>> find_other_exec is intrusive in back-branches knowing that all the I/O
>> write operations manipulating file descriptors go through file_ops.c,
>> and we actually just need to fsync the target file in
>> close_target_file(), making the fix being a 7-line patch, and there is
>> no need to depend on another binary at all. The backup label file, as
>> well as the control file are using the same code path in file_ops.c...
>> And I like simple things :)
>
> This is a *much* more expensive approach though. Doing the fsync
> directly after modifying the file. One file by one file. Will usually
> result in each fsync blocking for a while.
>
> In comparison of doing a flush and then an fsync pass over the whole
> directory will usually only block seldomly. The flushes for all files
> can be combined into very few barrier operations.

Yeah, I'm pretty sure this was discussed when initdb -S went in.  I
think reusing that is a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: silent data loss with ext4 / all current versions

From

Andres Freund

Date:

27 March 2016, 23:25:29

Hi,

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:
> +/*
> + * Sync data directory to ensure that what has been generated up to now is
> + * persistent in case of a crash, and this is done once globally for
> + * performance reasons as sync requests on individual files would be
> + * a negative impact on the running time of pg_rewind. This is invoked at
> + * the end of processing once everything has been processed and written.
> + */
> +static void
> +syncTargetDirectory(const char *argv0)
> +{
> +    int        ret;
> +    char    exec_path[MAXPGPATH];
> +    char    cmd[MAXPGPATH];
> +
> +    if (dry_run)
> +        return;

I think it makes more sense to return after detecting the binary, so
you'd find out about problems around not finding initdb during the dry
run, not later.


> +    /* Grab and invoke initdb to perform the sync */
> +    if ((ret = find_other_exec(argv0, "initdb",
> +                               "initdb (PostgreSQL) " PG_VERSION "\n",
> +                               exec_path)) < 0)
> +    {
> +        char        full_path[MAXPGPATH];
> +
> +        if (find_my_exec(argv0, full_path) < 0)
> +            strlcpy(full_path, progname, sizeof(full_path));
> +
> +        if (ret == -1)
> +            pg_fatal("The program \"initdb\" is needed by %s but was \n"
> +                     "not found in the same directory as \"%s\".\n"
> +                     "Check your installation.\n", progname, full_path);
> +        else
> +            pg_fatal("The program \"postgres\" was found by \"%s\" but was \n"
> +                     "not the same version as %s.\n"
> +                     "Check your installation.\n", progname, full_path);

Wrong binary name.


> +    }
> +
> +    /* now run initdb */
> +    if (debug)
> +        snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
> +                 exec_path, datadir_target);
> +    else
> +        snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
> +                 exec_path, datadir_target, DEVNULL);
> +
> +    if (system(cmd) != 0)
> +        pg_fatal("sync of target directory with initdb -S failed\n");
> +    pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
> +}

Don't see need for emitting "done", for now at least.


>  /*
> + * Wrapper of rename() similar to the backend version with the same function
> + * name aimed at making the renaming durable on disk. Note that this version
> + * does not fsync the old file before the rename as all the code paths leading
> + * to this function are already doing this operation. The new file is also
> + * normally not present on disk before the renaming so there is no need to
> + * bother about it.

I don't think it's a good idea to skip fsyncing the old file based on
that; it's way too likely that that'll not be done for the next user of
durable_rename.


> + */
> +static int
> +durable_rename(const char *oldfile, const char *newfile)
> +{
> +    int        fd;
> +    char    parentpath[MAXPGPATH];
> +
> +    if (rename(oldfile, newfile) != 0)
> +    {
> +        /* durable_rename produced a log entry */

Uh?


> +        fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
> +                progname, current_walfile_name, strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.


> +    if (fsync(fd) != 0)
> +    {
> +        close(fd);
> +        fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +                progname, newfile, strerror(errno));
> +        return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).


> +    fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
> +
> +    /*
> +     * Some OSs don't allow us to open directories at all (Windows returns
> +     * EACCES), just ignore the error in that case.  If desired also silently
> +     * ignoring errors about unreadable files. Log others.
> +     */

Comment is not applicable as a whole.


> +    if (fd < 0 && (errno == EISDIR || errno == EACCES))
> +        return 0;
> +    else if (fd < 0)
> +    {
> +        fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
> +                progname, parentpath, strerror(errno));
> +        return -1;
> +    }
> +
> +    if (fsync(fd) != 0)
> +    {
> +        close(fd);
> +        fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +                progname, parentpath, strerror(errno));
> +        return -1;

close() needs to be moved again.


It'd be easier to apply these if the rate of trivial issues were lower
:(.


Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
  code this way, and it'll make unifying the code easier
* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
  - otherwise the directory entry itself isn't safely persistent, as we
    don't fsync the directory in the stream->synchronous fsync() cases.
  - we refuse to resume in open_walfile(), if a file isn't 16MB when
    restarting. Without an fsync that's actually not unlikely after a
    crash. Even with an fsync that's not guaranteed not to happen, but
    the chance of it is much lower.

I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...


I've also noticed that

a) pg_basebackup doesn't do anything about durability (it probably needs
   a very similar patch to the one pg_rewind just received).
b) nor does pg_dump[all]

I think it's pretty unacceptable for backup tools to be so cavalier
about durability.


So we're going to have another round of fsync stuff in the next set of
releases anyway...


Greetings,

Andres Freund

Attachment

0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patch

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

28 March 2016, 04:49:52

On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:
>> +             fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
>> +                             progname, current_walfile_name, strerror(errno));
>
> current_walfile_name doesn't look right, that's a largely independent
> global variable.

Stupid mistake.

>> +     if (fsync(fd) != 0)
>> +     {
>> +             close(fd);
>> +             fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
>> +                             progname, newfile, strerror(errno));
>> +             return -1;
>
> Close should be after the strerror() call (yes, that mistake already
> existed once in receivelog.c).

Right.

> It'd be easier to apply these if the rate of trivial issues were lower
> :(.

Sorry about that. I'll be more careful.

> Attached is an heavily revised version of the patch. Besides the above,
> I've:
> * copied fsync_fname_ext from initdb, I think it's easier to understand
>   code this way, and it'll make unifying the code easier

OK.

> * added fsyncing of the directory in mark_file_as_archived()
> * The WAL files also need to be fsynced when created in open_walfile():
>   - otherwise the directory entry itself isn't safely persistent, as we
>     don't fsync the directory in the stream->synchronous fsync() cases.
>   - we refuse to resume in open_walfile(), if a file isn't 16MB when
>     restarting. Without an fsync that's actually not unlikely after a
>     crash. Even with an fsync that's not guaranteed not to happen, but
>     the chance of it is much lower.
> I'm too tired to push this at the eleventh hour. Besides a heavily
> revised patch, backpatching will likely include a number of conflicts.
> If somebody in the US has the energy to take care of this...

Close enough to the US. Attached are backpatchable versions based on
the corrected version you sent. 9.3 and 9.4 share the same patch, more
work has been necessary for 9.2 but that's not huge either.

> So we're going to have another round of fsync stuff in the next set of
> releases anyway...

Yes, seeing how 9.5.2 is close by, I think that it would be wiser to
push this stuff after the upcoming minor release.
--
Michael

Attachment

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

12 May 2016, 05:58:36

On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:
> I've also noticed that

Coming back to this issue because...

> a) pg_basebackup doesn't do anything about durability (it probably needs
>    a very similar patch to the one pg_rewind just received).

I think that one of the QE tests running here just got bitten by that.
A base backup was taken with pg_basebackup and more or less after a VM
was plugged off. The trick is that for pg_basebackup we cannot rely on
initdb: pg_basebackup is a client-side utility. In most of the PG
packages (Fedora, RHEL), it is put on the client-side package, where
initdb is not. So it seems to me that the correct fix is not to use
initdb -S but to have copies of fsync_parent_path, durable_rename and
fsync_fname_ext in streamutil.c, and then we reuse them for both
pg_receivexlog and pg_basebackup. At least that's less risky for
back-branches this way.

> b) nor does pg_dump[all]

I have not hacked up that yet, but I would think that we would need
one extra copy of some of those fsync_* routines in src/bin/pg_dump/.
There is another thread for that already... On master I guess we'd end
with something centralized in src/common/, but let's close the
existing holes first.

> So we're going to have another round of fsync stuff in the next set of
> releases anyway...

The sooner the better I think. Any people caring about this problem
are now limited in using initdb -S after calling pg_basebackup or
pg_dump. That's a solution, though the flushes should be contained
inside each utility.
--
Michael

Attachment

0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patch

Re: silent data loss with ext4 / all current versions

From

Michael Paquier

Date:

12 May 2016, 11:09:10

On Thu, May 12, 2016 at 2:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:
>> I've also noticed that
>
> Coming back to this issue because...
>
>> a) pg_basebackup doesn't do anything about durability (it probably needs
>>    a very similar patch to the one pg_rewind just received).
>
> I think that one of the QE tests running here just got bitten by that.
> A base backup was taken with pg_basebackup and more or less after a VM
> was plugged off. The trick is that for pg_basebackup we cannot rely on
> initdb: pg_basebackup is a client-side utility. In most of the PG
> packages (Fedora, RHEL), it is put on the client-side package, where
> initdb is not. So it seems to me that the correct fix is not to use
> initdb -S but to have copies of fsync_parent_path, durable_rename and
> fsync_fname_ext in streamutil.c, and then we reuse them for both
> pg_receivexlog and pg_basebackup. At least that's less risky for
> back-branches this way.
>
>> b) nor does pg_dump[all]
>
> I have not hacked up that yet, but I would think that we would need
> one extra copy of some of those fsync_* routines in src/bin/pg_dump/.
> There is another thread for that already... On master I guess we'd end
> with something centralized in src/common/, but let's close the
> existing holes first.
>
>> So we're going to have another round of fsync stuff in the next set of
>> releases anyway...
>
> The sooner the better I think. Any people caring about this problem
> are now limited in using initdb -S after calling pg_basebackup or
> pg_dump. That's a solution, though the flushes should be contained
> inside each utility.

And actually this won't fly high if there is no equivalent of
walkdir() or if the fsync()'s are not applied recursively. On master
at least the refactoring had better be done cleanly first... For the
back branches, we could just have some recursive call like
fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
you think that this should be part of fe_utils or src/common/? I'd
tend to think the latter is more adapted as there is an equivalent in
the backend. On back-branches, we could just have something like
fsync_recursively that walks though the paths. An even more simple
approach would be to fsync() individually things that have been
written, but that would suck in performance.

Thoughts from others?
-- 
Michael