Re: [HACKERS] Potential data loss of 2PC files - Mailing list pgsql-hackers

From David Steele
Subject Re: [HACKERS] Potential data loss of 2PC files
Date
Msg-id 22ca5ddf-71ed-da12-ddef-a7624af92a39@pgmasters.net
Whole thread Raw
In response to Re: [HACKERS] Potential data loss of 2PC files  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] Potential data loss of 2PC files  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
On 2/13/17 12:10 AM, Michael Paquier wrote:
> On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> If that can happen, don't we have the same problem in many other places?
>>> Like, all the SLRUs? They don't fsync the directory either.
>>
>> Right, pg_commit_ts and pg_clog enter in this category.
> 
> Implemented as attached.
> 
>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>> not, then we need to fsync() the directory even if there are no files in it
>>> at the moment, because some might've been removed earlier in the checkpoint
>>> cycle.
>>
>> Hm... I am not an expert in file systems. At least on ext4 I can see
>> that unlink() is atomic, but not durable. So if an unlink() is
>> followed by a power failure, the previously unlinked file could be
>> here if the parent directory is not fsync'd.
> 
> So I have been doing more work on this patch, with the following things done:
> - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
> ensure their durability.
> - Create a durable_unlink() routine to give a way to perform a durable
> file removal.
> I am now counting 111 calls to unlink() in the backend code, and
> looking at all of them most look fine with plain unlink() if they are
> not made durable as they work on temporary files (see timeline.c for
> example), with some exceptions:
> - In pg_stop_backup, the old backup_label and tablespace_map removal
> should be durable to avoid putting the system in a wrong state after
> power loss. Other calls of unlink() are followed by durable_rename so
> they are fine if let as such.
> - Removal of old WAL segments should be durable as well. There is
> already an effort to rename them durably in case of a segment
> recycled. In case of a power loss, a file that should have been
> removed could remain in pg_xlog.
> 
> Looking around, I have bumped as well on the following bug report for
> SQlite which is in the same category of things:
>
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
> Scary to see that in this case durability can be a problem at
> transaction commit...

This patch applies cleanly and compiles at cccbdde.

Ashutosh, do you know when you'll have a chance to review?

-- 
-David
david@pgmasters.net



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: [HACKERS] Time to up bgwriter_lru_maxpages?
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Push down more full joins in postgres_fdw