Thread: [HACKERS] Potential data loss of 2PC files

[HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

22 December 2016, 04:30:48

Hi all,

2PC files are created using RecreateTwoPhaseFile() in two places currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().

Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece is missing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.

It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as if there
are many 2PC transactions to replay performance would be impacted, and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.

I am adding that as well to the next CF for consideration.

Thoughts?
--
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

2pc-loss.patch

Re: [HACKERS] Potential data loss of 2PC files

From

Robert Haas

Date:

22 December 2016, 19:50:38

On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hi all,
>
> 2PC files are created using RecreateTwoPhaseFile() in two places currently:
> - at replay on a XLOG_XACT_PREPARE record.
> - At checkpoint with CheckPointTwoPhase().
>
> Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
> that the 2PC files find their way into disk. But one piece is missing:
> the parent directory pg_twophase is not fsync'd. At replay this is
> more sensitive if there is a PREPARE record followed by a checkpoint
> record. If there is a power failure after the checkpoint completes
> there is a risk to lose 2PC status files here.
>
> It seems to me that we really should have CheckPointTwoPhase() call
> fsync() on pg_twophase to be sure that no files are lost here. There
> is no point to do this operation in RecreateTwoPhaseFile() as if there
> are many 2PC transactions to replay performance would be impacted, and
> we don't care about the durability of those files until a checkpoint
> moves the redo pointer. I have drafted the patch attached to address
> this issue.
>
> I am adding that as well to the next CF for consideration.

It's pretty stupid that operating systems don't guarantee that calling
pg_fsync() on the file, the file will also be visible in the parent
directory.  There's not much use case for wanting to make sure that
the file will have the correct contents but not caring whether it's
there at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Potential data loss of 2PC files

From

Andres Freund

Date:

22 December 2016, 20:39:12


On December 22, 2016 5:50:38 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
><michael.paquier@gmail.com> wrote:
>> Hi all,
>>
>> 2PC files are created using RecreateTwoPhaseFile() in two places
>currently:
>> - at replay on a XLOG_XACT_PREPARE record.
>> - At checkpoint with CheckPointTwoPhase().
>>
>> Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
>> that the 2PC files find their way into disk. But one piece is
>missing:
>> the parent directory pg_twophase is not fsync'd. At replay this is
>> more sensitive if there is a PREPARE record followed by a checkpoint
>> record. If there is a power failure after the checkpoint completes
>> there is a risk to lose 2PC status files here.
>>
>> It seems to me that we really should have CheckPointTwoPhase() call
>> fsync() on pg_twophase to be sure that no files are lost here. There
>> is no point to do this operation in RecreateTwoPhaseFile() as if
>there
>> are many 2PC transactions to replay performance would be impacted,
>and
>> we don't care about the durability of those files until a checkpoint
>> moves the redo pointer. I have drafted the patch attached to address
>> this issue.
>>
>> I am adding that as well to the next CF for consideration.
>
>It's pretty stupid that operating systems don't guarantee that calling
>pg_fsync() on the file, the file will also be visible in the parent
>directory.  There's not much use case for wanting to make sure that
>the file will have the correct contents but not caring whether it's
>there at all.

It makes more sense of you mentally separate between filename(s) and file contents.  Having to do filesystem metatata
transactionsfor an fsync intended to sync contents would be annoying...
 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Potential data loss of 2PC files

From

Robert Haas

Date:

22 December 2016, 20:44:22

On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
> It makes more sense of you mentally separate between filename(s) and file contents.  Having to do filesystem metatata
transactionsfor an fsync intended to sync contents would be annoying...
 

I thought that's why there's fdatasync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Potential data loss of 2PC files

From

Andres Freund

Date:

22 December 2016, 21:02:39

On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> It makes more sense of you mentally separate between filename(s) and
>file contents.  Having to do filesystem metatata transactions for an
>fsync intended to sync contents would be annoying...
>
>I thought that's why there's fdatasync.

Not quite IIRC: that doesn't deal with file size increase.  All this would be easier if hardlinks wouldn't exist IIUC.
It'sbasically a question whether dentry, inode or contents need to be synced.   Yes, it sucks.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Potential data loss of 2PC files

From

Jim Nasby

Date:

23 December 2016, 00:33:52

On 12/22/16 12:02 PM, Andres Freund wrote:
>
> On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
>> wrote:
>>> It makes more sense of you mentally separate between filename(s) and
>> file contents.  Having to do filesystem metatata transactions for an
>> fsync intended to sync contents would be annoying...
>>
>> I thought that's why there's fdatasync.
> Not quite IIRC: that doesn't deal with file size increase.  All this would be easier if hardlinks wouldn't exist
IIUC.It's basically a question whether dentry, inode or contents need to be synced.   Yes, it sucks.
 

IIRC this isn't the first time we've run into this problem... should 
pg_fsync() automatically fsync the directory as well? I guess we'd need 
a flag to disable that for performance critical areas where we know we 
don't need it (presumably just certain WAL fsyncs).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

23 December 2016, 04:27:14

On Fri, Dec 23, 2016 at 6:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 12/22/16 12:02 PM, Andres Freund wrote:
>>
>>
>> On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas
>> <robertmhaas@gmail.com> wrote:
>>>
>>> On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
>>> wrote:
>>>>
>>>> It makes more sense of you mentally separate between filename(s) and
>>>
>>> file contents.  Having to do filesystem metatata transactions for an
>>> fsync intended to sync contents would be annoying...
>>>
>>> I thought that's why there's fdatasync.
>>
>> Not quite IIRC: that doesn't deal with file size increase.  All this would
>> be easier if hardlinks wouldn't exist IIUC. It's basically a question
>> whether dentry, inode or contents need to be synced.   Yes, it sucks.
>
>
> IIRC this isn't the first time we've run into this problem... should
> pg_fsync() automatically fsync the directory as well? I guess we'd need a
> flag to disable that for performance critical areas where we know we don't
> need it (presumably just certain WAL fsyncs).

I am not sure if that would be performance-wise. The case of the 2PC
files is quite special anyway as just doing the sync at checkpoint
phase for everything would be enough.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

27 December 2016, 08:09:05

On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
> Not quite IIRC: that doesn't deal with file size increase.  All this would be easier if hardlinks wouldn't exist
IIUC.It's basically a question whether dentry, inode or contents need to be synced.   Yes, it sucks.
 

I did more monitoring of the code... Looking at unlogged tables and
empty() routines of access methods, isn't there a risk as well for
unlogged tables? mdimmedsync() does not fsync() the parent directory
either! It seems to me that we want to fsync() the parent folder in
this code path especially and not just at checkpoint as this assumes
that it does not care about shared buffers. We could do that at
checkpoint as well, actually, by looping through all the tablespaces
and fsync the database directories.

Robert, as the former author of unlogged tables and Andres, as you
have done a lot of work on durability, could you share your opinion on
the matter? Of course opinions of others are welcome!
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Andres Freund

Date:

27 December 2016, 14:31:42

On 2016-12-27 14:09:05 +0900, Michael Paquier wrote:
> On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
> > Not quite IIRC: that doesn't deal with file size increase.  All this would be easier if hardlinks wouldn't exist
IIUC.It's basically a question whether dentry, inode or contents need to be synced.   Yes, it sucks.

>
> I did more monitoring of the code... Looking at unlogged tables and
> empty() routines of access methods, isn't there a risk as well for
> unlogged tables? mdimmedsync() does not fsync() the parent directory
> either!

But the files aren't created there, so I don't generally see the
problem.  And the creation of the metapages etc. should be WAL logged
anyway.  So I don't think we should / need to do anything special
there.  You can argue however that we wouldn't necessarily fsync the
parent directory for the file creation, ever.  But that'd be more
smgrcreate's responsibility than anything.

> We could do that at checkpoint as well, actually, by looping through
> all the tablespaces and fsync the database directories.

That seems like a bad idea, as it'd force fsyncs on unlogged tables and
such.

Regards,

Andres

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

29 December 2016, 12:41:28

On Thu, Dec 22, 2016 at 7:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hi all,
>
> 2PC files are created using RecreateTwoPhaseFile() in two places currently:
> - at replay on a XLOG_XACT_PREPARE record.
> - At checkpoint with CheckPointTwoPhase().
>
> Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
> that the 2PC files find their way into disk. But one piece is missing:
> the parent directory pg_twophase is not fsync'd. At replay this is
> more sensitive if there is a PREPARE record followed by a checkpoint
> record. If there is a power failure after the checkpoint completes
> there is a risk to lose 2PC status files here.
>
> It seems to me that we really should have CheckPointTwoPhase() call
> fsync() on pg_twophase to be sure that no files are lost here. There
> is no point to do this operation in RecreateTwoPhaseFile() as if there
> are many 2PC transactions to replay performance would be impacted, and
> we don't care about the durability of those files until a checkpoint
> moves the redo pointer. I have drafted the patch attached to address
> this issue.

I agree with this.
If no prepared transactions were required to be fsynced
CheckPointTwoPhase(), do we want to still fsync the directory?
Probably not.

May be you want to call fsync_fname(TWOPHASE_DIR, true); if
serialized_xacts > 0.

The comment just states what has been done earlier in the function,
which is documented in the prologue as well.   /*    * Make sure that the content created is persistent on disk to
prevent   * data loss in case of an OS crash or a power failure. Each 2PC file    * has been already created and
flushedto disk individually by    * RecreateTwoPhaseFile() using the list of GXACTs available for    * normal
processingas well as at recovery when replaying individually    * each XLOG_XACT_PREPARE record.    */

Instead, you may want to just say "If we have flushed any 2PC files,
flush the metadata in the pg_twophase directory."

Although, I thought that a simple case of creating a persistent table
which requires creating a file also would need to flush the directory.
I tried to locate the corresponding code to see if it also uses
fsync_fname(). I couldn't locate the code. I have looked at
mdcreate(), mdpreckpt(), mdpostckpt(). But may be that's not relevant
here.

>
> I am adding that as well to the next CF for consideration.
>
> Thoughts?
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

30 December 2016, 08:52:53

On Thu, Dec 29, 2016 at 6:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I agree with this.
> If no prepared transactions were required to be fsynced
> CheckPointTwoPhase(), do we want to still fsync the directory?
> Probably not.
>
> May be you want to call fsync_fname(TWOPHASE_DIR, true); if
> serialized_xacts > 0.

Definitely true for the non-recovery code path. But not for restart
points as there is no GXACT entry created by the redo routine of 2PC
prepare. We could have a static counter tracking how many 2PC files
have been flushed since last restart point or not but I am not
convinced if that's worth the facility.

> The comment just states what has been done earlier in the function,
> which is documented in the prologue as well.
>     /*
>      * Make sure that the content created is persistent on disk to prevent
>      * data loss in case of an OS crash or a power failure. Each 2PC file
>      * has been already created and flushed to disk individually by
>      * RecreateTwoPhaseFile() using the list of GXACTs available for
>      * normal processing as well as at recovery when replaying individually
>      * each XLOG_XACT_PREPARE record.
>      */
> Instead, you may want to just say "If we have flushed any 2PC files,
> flush the metadata in the pg_twophase directory."

That's not true for recovery. So I could go for something like that:
"If any 2PC files have been flushed, do the same for the parent
directory to make this information durable on disk. On recovery, issue
the fsync() anyway, individual 2PC files have already been flushed whe
replaying their respective XLOG_XACT_PREPARE record.

> Although, I thought that a simple case of creating a persistent table
> which requires creating a file also would need to flush the directory.
> I tried to locate the corresponding code to see if it also uses
> fsync_fname(). I couldn't locate the code. I have looked at
> mdcreate(), mdpreckpt(), mdpostckpt(). But may be that's not relevant
> here.

I have been playing with unlogged tables created, then followed by
checkpoints, and a VM plugged off but I could not see those files
getting away. That was on ext4, but like ext3 things can disappear if
the fsync() is forgotten on the parent directory.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

30 December 2016, 11:20:53

On Fri, Dec 30, 2016 at 11:22 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 29, 2016 at 6:41 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I agree with this.
>> If no prepared transactions were required to be fsynced
>> CheckPointTwoPhase(), do we want to still fsync the directory?
>> Probably not.
>>
>> May be you want to call fsync_fname(TWOPHASE_DIR, true); if
>> serialized_xacts > 0.
>
> Definitely true for the non-recovery code path. But not for restart
> points as there is no GXACT entry created by the redo routine of 2PC
> prepare. We could have a static counter tracking how many 2PC files
> have been flushed since last restart point or not but I am not
> convinced if that's worth the facility.

As per the prologue of the function, it doesn't expect any 2PC files
to be written out in the function i.e. between two checkpoints. Most
of those are created and deleted between two checkpoints. Same would
be true for recovery as well. Thus in most of the cases we shouldn't
need to flush the two phase directory in this function whether during
normal operation or during the recovery. So, we should avoid flushing
repeatedly when it's not needed. I agree that serialized_xacts > 0 is
not the right condition during recovery on standby to flush the two
phase directory.

During crash recovery, 2PC files are present on the disk, which means
the two phase directory has correct record of it. This record can not
change. So, we shouldn't need to flush it again. If that's true
serialized_xacts will be 0 during recovery thus serialized_xacts > 0
condition will still hold.

On a standby however we will have to flush the two phase directory as
part of checkpoint if there were any files left behind in that
directory. We need a different condition there.

>
> That's not true for recovery. So I could go for something like that:
> "If any 2PC files have been flushed, do the same for the parent
> directory to make this information durable on disk. On recovery, issue
> the fsync() anyway, individual 2PC files have already been flushed whe
> replaying their respective XLOG_XACT_PREPARE record.
>

We need to specify recovery (log replay) on standby specifically, I guess.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

30 December 2016, 16:45:34

On Fri, Dec 30, 2016 at 5:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> As per the prologue of the function, it doesn't expect any 2PC files
> to be written out in the function i.e. between two checkpoints. Most
> of those are created and deleted between two checkpoints. Same would
> be true for recovery as well. Thus in most of the cases we shouldn't
> need to flush the two phase directory in this function whether during
> normal operation or during the recovery. So, we should avoid flushing
> repeatedly when it's not needed. I agree that serialized_xacts > 0 is
> not the right condition during recovery on standby to flush the two
> phase directory.

This is assuming that 2PC transactions are not long-lived, which is
likely true for anything doing sharding, like XC, XL or Citus (?). So
yes that's true to expect that.

> During crash recovery, 2PC files are present on the disk, which means
> the two phase directory has correct record of it. This record can not
> change. So, we shouldn't need to flush it again. If that's true
> serialized_xacts will be 0 during recovery thus serialized_xacts > 0
> condition will still hold.
>
> On a standby however we will have to flush the two phase directory as
> part of checkpoint if there were any files left behind in that
> directory. We need a different condition there.

Well, flushing the meta-data of pg_twophase is really going to be far
cheaper than the many pages done until CheckpointTwoPhase is reached.
There should really be a check on serialized_xacts for the
non-recovery code path, but considering how cheap that's going to be
compared to the rest of the restart point stuff it is not worth the
complexity of adding a counter, for example in shared memory with
XLogCtl (the counter gets reinitialized at each checkpoint,
incremented when replaying a 2PC prepare, decremented with a 2PC
commit). So to reduce the backpatch impact I would just do the fsync
if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

30 December 2016, 16:59:07

>
> Well, flushing the meta-data of pg_twophase is really going to be far
> cheaper than the many pages done until CheckpointTwoPhase is reached.
> There should really be a check on serialized_xacts for the
> non-recovery code path, but considering how cheap that's going to be
> compared to the rest of the restart point stuff it is not worth the
> complexity of adding a counter, for example in shared memory with
> XLogCtl (the counter gets reinitialized at each checkpoint,
> incremented when replaying a 2PC prepare, decremented with a 2PC
> commit). So to reduce the backpatch impact I would just do the fsync
> if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.

Sounds ok to me. I will leave it to the committer to decide further.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

31 December 2016, 03:23:37

On Fri, Dec 30, 2016 at 10:59 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>>
>> Well, flushing the meta-data of pg_twophase is really going to be far
>> cheaper than the many pages done until CheckpointTwoPhase is reached.
>> There should really be a check on serialized_xacts for the
>> non-recovery code path, but considering how cheap that's going to be
>> compared to the rest of the restart point stuff it is not worth the
>> complexity of adding a counter, for example in shared memory with
>> XLogCtl (the counter gets reinitialized at each checkpoint,
>> incremented when replaying a 2PC prepare, decremented with a 2PC
>> commit). So to reduce the backpatch impact I would just do the fsync
>> if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.
>
> Sounds ok to me. I will leave it to the committer to decide further.

Attached is an updated patch. I also noticed that it is better to do
the fsync call before the dtrace call for checkpoint end as well as
logging.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

2pc-loss-v2.patch

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

03 January 2017, 09:32:34

On Sat, Dec 31, 2016 at 5:53 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Dec 30, 2016 at 10:59 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>>
>>> Well, flushing the meta-data of pg_twophase is really going to be far
>>> cheaper than the many pages done until CheckpointTwoPhase is reached.
>>> There should really be a check on serialized_xacts for the
>>> non-recovery code path, but considering how cheap that's going to be
>>> compared to the rest of the restart point stuff it is not worth the
>>> complexity of adding a counter, for example in shared memory with
>>> XLogCtl (the counter gets reinitialized at each checkpoint,
>>> incremented when replaying a 2PC prepare, decremented with a 2PC
>>> commit). So to reduce the backpatch impact I would just do the fsync
>>> if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.
>>
>> Sounds ok to me. I will leave it to the committer to decide further.
>
> Attached is an updated patch. I also noticed that it is better to do
> the fsync call before the dtrace call for checkpoint end as well as
> logging.

The patch looks good to me.

I am wondering what happens if a 2PC file gets created, at the time of
checkpoint we flush the pg_twophase directory, then the file gets
removed. Do we need to flush the directory to ensure that the removal
persists? Whatever material I look for fsync() on directory, it gives
examples of file creation, not that of deleting a file. If we want to
persist the removal, probably we have to flush pg_twophase always or
add code to track whether any activity happened in pg_twophase between
two checkpoints. The later seems complication not worth the benefit.

I guess, it's hard to construct a case to reproduce the issue
described in your first mail. But still checking if you have any
reproduction. May be we could use similar reproduction to test the
deletion of two phase file.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

03 January 2017, 12:20:15

On Tue, Jan 3, 2017 at 3:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I am wondering what happens if a 2PC file gets created, at the time of
> checkpoint we flush the pg_twophase directory, then the file gets
> removed. Do we need to flush the directory to ensure that the removal
> persists? Whatever material I look for fsync() on directory, it gives
> examples of file creation, not that of deleting a file. If we want to
> persist the removal, probably we have to flush pg_twophase always or
> add code to track whether any activity happened in pg_twophase between
> two checkpoints. The later seems complication not worth the benefit.

There is already the delay checkpoint machinery to cover timing
problems here. Have a look for example at EndPrepare()@twophase.c.

> I guess, it's hard to construct a case to reproduce the issue
> described in your first mail. But still checking if you have any
> reproduction. May be we could use similar reproduction to test the
> deletion of two phase file.

Not really. You can just do the test on a VM (one transaction
generating a 2PC file, followed by a checkpoint), then kill-9 its
parent instance. That's radical to emulate the power loss. I do that
on macos with VMware Fusion.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

03 January 2017, 14:41:11

On Tue, Jan 3, 2017 at 2:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jan 3, 2017 at 3:32 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I am wondering what happens if a 2PC file gets created, at the time of
>> checkpoint we flush the pg_twophase directory, then the file gets
>> removed. Do we need to flush the directory to ensure that the removal
>> persists? Whatever material I look for fsync() on directory, it gives
>> examples of file creation, not that of deleting a file. If we want to
>> persist the removal, probably we have to flush pg_twophase always or
>> add code to track whether any activity happened in pg_twophase between
>> two checkpoints. The later seems complication not worth the benefit.
>
> There is already the delay checkpoint machinery to cover timing
> problems here. Have a look for example at EndPrepare()@twophase.c.

Are you talking about   /*    * Now we can mark ourselves as out of the commit critical section: a    * checkpoint
startingafter this will certainly see the gxact as a    * candidate for fsyncing.    */   MyPgXact->delayChkpt =
false;

That's while creating the file. I do not see similar code in
FinishPreparedTransaction() where the 2PC file is removed.

>
>> I guess, it's hard to construct a case to reproduce the issue
>> described in your first mail. But still checking if you have any
>> reproduction. May be we could use similar reproduction to test the
>> deletion of two phase file.
>
> Not really. You can just do the test on a VM (one transaction
> generating a 2PC file, followed by a checkpoint), then kill-9 its
> parent instance. That's radical to emulate the power loss. I do that
> on macos with VMware Fusion.
> --
> Michael



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

03 January 2017, 15:08:43

On Tue, Jan 3, 2017 at 8:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Are you talking about
>     /*
>      * Now we can mark ourselves as out of the commit critical section: a
>      * checkpoint starting after this will certainly see the gxact as a
>      * candidate for fsyncing.
>      */
>     MyPgXact->delayChkpt = false;
>
> That's while creating the file. I do not see similar code in
> FinishPreparedTransaction() where the 2PC file is removed.

RecordTransactionCommitPrepared() does the same work for COMMIT PREPARED.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

04 January 2017, 07:23:57

On Tue, Jan 3, 2017 at 5:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jan 3, 2017 at 8:41 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Are you talking about
>>     /*
>>      * Now we can mark ourselves as out of the commit critical section: a
>>      * checkpoint starting after this will certainly see the gxact as a
>>      * candidate for fsyncing.
>>      */
>>     MyPgXact->delayChkpt = false;
>>
>> That's while creating the file. I do not see similar code in
>> FinishPreparedTransaction() where the 2PC file is removed.
>
> RecordTransactionCommitPrepared() does the same work for COMMIT PREPARED.

Ok.

I don't have anything more to review in this patch. I will leave that
commitfest entry in "needs review" status for few days in case anyone
else wants to review it. If none is going to review it, we can mark it
as "ready for committer".

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

04 January 2017, 09:46:54

On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I don't have anything more to review in this patch. I will leave that
> commitfest entry in "needs review" status for few days in case anyone
> else wants to review it. If none is going to review it, we can mark it
> as "ready for committer".

Thanks for the input!
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

06 January 2017, 15:26:20

Marking this as ready for committer.

On Wed, Jan 4, 2017 at 12:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I don't have anything more to review in this patch. I will leave that
>> commitfest entry in "needs review" status for few days in case anyone
>> else wants to review it. If none is going to review it, we can mark it
>> as "ready for committer".
>
> Thanks for the input!
> --
> Michael



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

Heikki Linnakangas

Date:

30 January 2017, 16:52:59

On 12/27/2016 01:31 PM, Andres Freund wrote:
> On 2016-12-27 14:09:05 +0900, Michael Paquier wrote:
>> On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
>>> Not quite IIRC: that doesn't deal with file size increase.  All this would be easier if hardlinks wouldn't exist
IIUC.It's basically a question whether dentry, inode or contents need to be synced.   Yes, it sucks.

>>
>> I did more monitoring of the code... Looking at unlogged tables and
>> empty() routines of access methods, isn't there a risk as well for
>> unlogged tables? mdimmedsync() does not fsync() the parent directory
>> either!
>
> But the files aren't created there, so I don't generally see the
> problem.  And the creation of the metapages etc. should be WAL logged
> anyway.  So I don't think we should / need to do anything special
> there.  You can argue however that we wouldn't necessarily fsync the
> parent directory for the file creation, ever.  But that'd be more
> smgrcreate's responsibility than anything.

So, if I understood correctly, the problem scenario is:

1. Create and write to a file.
2. fsync() the file.
3. Crash.
4. After restart, the file is gone.

If that can happen, don't we have the same problem in many other places? 
Like, all the SLRUs? They don't fsync the directory either.

Is unlink() guaranteed to be durable, without fsyncing the directory? If 
not, then we need to fsync() the directory even if there are no files in 
it at the moment, because some might've been removed earlier in the 
checkpoint cycle.

- Heikki

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

31 January 2017, 05:07:10

On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> So, if I understood correctly, the problem scenario is:
>
> 1. Create and write to a file.
> 2. fsync() the file.
> 3. Crash.
> 4. After restart, the file is gone.

Yes, that's a problem with fsync's durability, and we need to achieve
that at checkpoint. I find [1] a good read on the matter. That's
easier to decrypt than [2] or [3] in the POSIX spec..

[1]: http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/
[2]: http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
[3]: http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html

> If that can happen, don't we have the same problem in many other places?
> Like, all the SLRUs? They don't fsync the directory either.

Right, pg_commit_ts and pg_clog enter in this category.

> Is unlink() guaranteed to be durable, without fsyncing the directory? If
> not, then we need to fsync() the directory even if there are no files in it
> at the moment, because some might've been removed earlier in the checkpoint
> cycle.

Hm... I am not an expert in file systems. At least on ext4 I can see
that unlink() is atomic, but not durable. So if an unlink() is
followed by a power failure, the previously unlinked file could be
here if the parent directory is not fsync'd.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

31 January 2017, 05:07:57

On Fri, Jan 6, 2017 at 9:26 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Wed, Jan 4, 2017 at 12:16 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> I don't have anything more to review in this patch. I will leave that
>>> commitfest entry in "needs review" status for few days in case anyone
>>> else wants to review it. If none is going to review it, we can mark it
>>> as "ready for committer".
>>
>> Thanks for the input!
>
> Marking this as ready for committer.

Patch is moved to CF 2017-03.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

13 February 2017, 08:10:34

On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> If that can happen, don't we have the same problem in many other places?
>> Like, all the SLRUs? They don't fsync the directory either.
>
> Right, pg_commit_ts and pg_clog enter in this category.

Implemented as attached.

>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>> not, then we need to fsync() the directory even if there are no files in it
>> at the moment, because some might've been removed earlier in the checkpoint
>> cycle.
>
> Hm... I am not an expert in file systems. At least on ext4 I can see
> that unlink() is atomic, but not durable. So if an unlink() is
> followed by a power failure, the previously unlinked file could be
> here if the parent directory is not fsync'd.

So I have been doing more work on this patch, with the following things done:
- Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
ensure their durability.
- Create a durable_unlink() routine to give a way to perform a durable
file removal.
I am now counting 111 calls to unlink() in the backend code, and
looking at all of them most look fine with plain unlink() if they are
not made durable as they work on temporary files (see timeline.c for
example), with some exceptions:
- In pg_stop_backup, the old backup_label and tablespace_map removal
should be durable to avoid putting the system in a wrong state after
power loss. Other calls of unlink() are followed by durable_rename so
they are fine if let as such.
- Removal of old WAL segments should be durable as well. There is
already an effort to rename them durably in case of a segment
recycled. In case of a power loss, a file that should have been
removed could remain in pg_xlog.

Looking around, I have bumped as well on the following bug report for
SQlite which is in the same category of things:

http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
Scary to see that in this case durability can be a problem at
transaction commit...
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

unlink-2pc-loss-v3.patch

Re: [HACKERS] Potential data loss of 2PC files

From

David Steele

Date:

16 March 2017, 19:47:54

On 2/13/17 12:10 AM, Michael Paquier wrote:
> On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> If that can happen, don't we have the same problem in many other places?
>>> Like, all the SLRUs? They don't fsync the directory either.
>>
>> Right, pg_commit_ts and pg_clog enter in this category.
> 
> Implemented as attached.
> 
>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>> not, then we need to fsync() the directory even if there are no files in it
>>> at the moment, because some might've been removed earlier in the checkpoint
>>> cycle.
>>
>> Hm... I am not an expert in file systems. At least on ext4 I can see
>> that unlink() is atomic, but not durable. So if an unlink() is
>> followed by a power failure, the previously unlinked file could be
>> here if the parent directory is not fsync'd.
> 
> So I have been doing more work on this patch, with the following things done:
> - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
> ensure their durability.
> - Create a durable_unlink() routine to give a way to perform a durable
> file removal.
> I am now counting 111 calls to unlink() in the backend code, and
> looking at all of them most look fine with plain unlink() if they are
> not made durable as they work on temporary files (see timeline.c for
> example), with some exceptions:
> - In pg_stop_backup, the old backup_label and tablespace_map removal
> should be durable to avoid putting the system in a wrong state after
> power loss. Other calls of unlink() are followed by durable_rename so
> they are fine if let as such.
> - Removal of old WAL segments should be durable as well. There is
> already an effort to rename them durably in case of a segment
> recycled. In case of a power loss, a file that should have been
> removed could remain in pg_xlog.
> 
> Looking around, I have bumped as well on the following bug report for
> SQlite which is in the same category of things:
>
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
> Scary to see that in this case durability can be a problem at
> transaction commit...

This patch applies cleanly and compiles at cccbdde.

Ashutosh, do you know when you'll have a chance to review?

-- 
-David
david@pgmasters.net

Re: [HACKERS] Potential data loss of 2PC files

From

Ashutosh Bapat

Date:

17 March 2017, 05:26:27

On Thu, Mar 16, 2017 at 10:17 PM, David Steele <david@pgmasters.net> wrote:
> On 2/13/17 12:10 AM, Michael Paquier wrote:
>> On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> If that can happen, don't we have the same problem in many other places?
>>>> Like, all the SLRUs? They don't fsync the directory either.
>>>
>>> Right, pg_commit_ts and pg_clog enter in this category.
>>
>> Implemented as attached.
>>
>>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>>> not, then we need to fsync() the directory even if there are no files in it
>>>> at the moment, because some might've been removed earlier in the checkpoint
>>>> cycle.
>>>
>>> Hm... I am not an expert in file systems. At least on ext4 I can see
>>> that unlink() is atomic, but not durable. So if an unlink() is
>>> followed by a power failure, the previously unlinked file could be
>>> here if the parent directory is not fsync'd.
>>
>> So I have been doing more work on this patch, with the following things done:
>> - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
>> ensure their durability.
>> - Create a durable_unlink() routine to give a way to perform a durable
>> file removal.
>> I am now counting 111 calls to unlink() in the backend code, and
>> looking at all of them most look fine with plain unlink() if they are
>> not made durable as they work on temporary files (see timeline.c for
>> example), with some exceptions:
>> - In pg_stop_backup, the old backup_label and tablespace_map removal
>> should be durable to avoid putting the system in a wrong state after
>> power loss. Other calls of unlink() are followed by durable_rename so
>> they are fine if let as such.
>> - Removal of old WAL segments should be durable as well. There is
>> already an effort to rename them durably in case of a segment
>> recycled. In case of a power loss, a file that should have been
>> removed could remain in pg_xlog.
>>
>> Looking around, I have bumped as well on the following bug report for
>> SQlite which is in the same category of things:
>>
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
>> Scary to see that in this case durability can be a problem at
>> transaction commit...
>
> This patch applies cleanly and compiles at cccbdde.
>
> Ashutosh, do you know when you'll have a chance to review?

The scope of this work has expanded, since last time I reviewed and
marked it as RFC. Right now I am busy with partition-wise joins and do
not have sufficient time to take a look at the expanded scope.
However, I can come back to this after partition-wise join, but that
may stretch to the end of the commitfest. Sorry.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Potential data loss of 2PC files

From

"Tsunakawa, Takayuki"

Date:

17 March 2017, 11:11:25

From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
> The scope of this work has expanded, since last time I reviewed and marked
> it as RFC. Right now I am busy with partition-wise joins and do not have
> sufficient time to take a look at the expanded scope.
> However, I can come back to this after partition-wise join, but that may
> stretch to the end of the commitfest. Sorry.

I marked this as ready for committer.  The code looks good, make check passed, ran read/write pgbench for some period
tocause checkpoint with WAL file removal, and pg_ctl stop.  The checkpoint emitted the following message, and there
wereno message like "could not ..." that indicates a file unlink or directory sync failure.
 

LOG:  checkpoint complete: wrote 1835 buffers (11.2%); 0 transaction log file(s) added, 1 removed, 0 recycled;
write=0.725s, sync=0.002 s, total=0.746 s; s\
 
ync files=9, longest=0.002 s, average=0.000 s; distance=16381 kB, estimate=16381 kB

Regards
Takayuki Tsunakawa

Re: [HACKERS] Potential data loss of 2PC files

From

Teodor Sigaev

Date:

21 March 2017, 18:46:45

>>> If that can happen, don't we have the same problem in many other places?
>>> Like, all the SLRUs? They don't fsync the directory either.
>> Right, pg_commit_ts and pg_clog enter in this category.
>
> Implemented as attached.
>
>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>> not, then we need to fsync() the directory even if there are no files in it
>>> at the moment, because some might've been removed earlier in the checkpoint
>>> cycle.

What is protection if pg crashes after unlimk() but before fsync()? Right, it's 
rather small window for such scenario, but isn't better to  have another 
protection? Like WAL-logging of WAL segment removing...

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

22 March 2017, 08:50:34

On Wed, Mar 22, 2017 at 12:46 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>>> If that can happen, don't we have the same problem in many other places?
>>>> Like, all the SLRUs? They don't fsync the directory either.
>>>
>>> Right, pg_commit_ts and pg_clog enter in this category.
>>
>>
>> Implemented as attached.
>>
>>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If
>>>> not, then we need to fsync() the directory even if there are no files in
>>>> it
>>>> at the moment, because some might've been removed earlier in the
>>>> checkpoint
>>>> cycle.
>
> What is protection if pg crashes after unlimk() but before fsync()? Right,
> it's rather small window for such scenario, but isn't better to  have
> another protection?

This may apply in some cases, but my lookup on the matter regarding
this patch is that we don't actually need something like that yet,
because there are no cases where we could apply it:
- Removal of past WAL segments is on a per-node base.
- Installation of new segments is guessable.
- Removal of backup_label and tablespace map is based on the timings
of pg_start/stop_backup, once again on a per-node basis.
This patch reduces the window to the minimum we can do: once
durable_unlink() leaves, we can guarantee that the system is in a
durable state.

> Like WAL-logging of WAL segment removing...

The pace of removal of the WAL segments is different on each node, and
should be different on each node (for example there could be
replication slots with different retention policies on cascading
downstream standbys). So that's not a concept you can apply here.
-- 
Michael

Re: [HACKERS] Potential data loss of 2PC files

From

Teodor Sigaev

Date:

23 March 2017, 23:08:52

Hmm, it doesn't work (but appplies) on current HEAD:

% uname -a
FreeBSD *** 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0 r315651: Tue Mar 21 
02:44:23 MSK 2017     teodor@***:/usr/obj/usr/src/sys/XOR  amd64
% pg_config --configure
'--enable-depend' '--enable-cassert' '--enable-debug' '--enable-tap-tests' 
'--with-perl' 'CC=clang' 'CFLAGS=-O0'
% rm -rf /spool/pg_data/*
% initdb -n -E UTF8 --locale ru_RU.UTF-8 --lc-messages C --lc-monetary C 
--lc-numeric C --lc-time C -D /spool/pg_data
Running in no-clean mode.  Mistakes will not be cleaned up.
The files belonging to this database system will be owned by user "teodor".
This user must also own the server process.

The database cluster will be initialized with locales  COLLATE:  ru_RU.UTF-8  CTYPE:    ru_RU.UTF-8  MESSAGES: C
MONETARY:C  NUMERIC:  C  TIME:     C
 
The default text search configuration will be set to "russian".

Data page checksums are disabled.

fixing permissions on existing directory /spool/pg_data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... 2017-03-23 23:07:14.705 MSK [40286] FATAL:  could 
not open file "pg_clog": No such file or directory
child process exited with exit code 1
initdb: data directory "/spool/pg_data" not removed at user's request


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: [HACKERS] Potential data loss of 2PC files

From

Michael Paquier

Date:

24 March 2017, 00:54:59

On Fri, Mar 24, 2017 at 5:08 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Hmm, it doesn't work (but appplies) on current HEAD:
> [...]
> Data page checksums are disabled.
>
> fixing permissions on existing directory /spool/pg_data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... posix
> creating configuration files ... ok
> running bootstrap script ... 2017-03-23 23:07:14.705 MSK [40286] FATAL:
> could not open file "pg_clog": No such file or directory
> child process exited with exit code 1
> initdb: data directory "/spool/pg_data" not removed at user's request

And the renaming of pg_clog to pg_xact is also my fault. Attached is
an updated patch.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

unlink-2pc-loss-v4.patch

Re: Potential data loss of 2PC files

From

Teodor Sigaev

Date:

24 March 2017, 17:36:18

> And the renaming of pg_clog to pg_xact is also my fault. Attached is
> an updated patch.

Thank you. One more question: what about symlinks? If DBA moves, for example, 
pg_xact to another dist and leaves the symlink in data directoty. Suppose, fsync 
on symlink will do nothing actually.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Potential data loss of 2PC files

From

Michael Paquier

Date:

25 March 2017, 02:15:18

On Fri, Mar 24, 2017 at 11:36 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> And the renaming of pg_clog to pg_xact is also my fault. Attached is
>> an updated patch.
>
>
> Thank you. One more question: what about symlinks? If DBA moves, for
> example, pg_xact to another dist and leaves the symlink in data directoty.
> Suppose, fsync on symlink will do nothing actually.

I did not think of this case, but is that really common? There is even
no option to configure that at command level. And surely we cannot
support any fancy scenario that people figure out using PGDATA.
Existing callers of fsync_fname don't bother about this case as well
by the way, take the calls related to pg_logical and pg_repslot.

If something should be done in this area, that would be surely in
fsync_fname directly to centralize all the calls, and I would think of
that as a separate patch, and a separate discussion.
-- 
Michael

Re: Potential data loss of 2PC files

From

Teodor Sigaev

Date:

27 March 2017, 19:25:49

>> Thank you. One more question: what about symlinks? If DBA moves, for
>> example, pg_xact to another dist and leaves the symlink in data directoty.
>> Suppose, fsync on symlink will do nothing actually.
>
> I did not think of this case, but is that really common? There is even
I saw a lot such cases.

> If something should be done in this area, that would be surely in
> fsync_fname directly to centralize all the calls, and I would think of
> that as a separate patch, and a separate discussion.
Agree

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Potential data loss of 2PC files

From

Teodor Sigaev

Date:

27 March 2017, 19:34:31

Thank you, pushed

Michael Paquier wrote:
> On Fri, Mar 24, 2017 at 11:36 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>> And the renaming of pg_clog to pg_xact is also my fault. Attached is
>>> an updated patch.
>>
>>
>> Thank you. One more question: what about symlinks? If DBA moves, for
>> example, pg_xact to another dist and leaves the symlink in data directoty.
>> Suppose, fsync on symlink will do nothing actually.
>
> I did not think of this case, but is that really common? There is even
> no option to configure that at command level. And surely we cannot
> support any fancy scenario that people figure out using PGDATA.
> Existing callers of fsync_fname don't bother about this case as well
> by the way, take the calls related to pg_logical and pg_repslot.
>
> If something should be done in this area, that would be surely in
> fsync_fname directly to centralize all the calls, and I would think of
> that as a separate patch, and a separate discussion.
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Potential data loss of 2PC files

From

Michael Paquier

Date:

28 March 2017, 02:17:10

On Tue, Mar 28, 2017 at 1:34 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you, pushed.

Thanks!

Do you think that this qualifies as a bug fix for a backpatch? I would
think so, but I would not mind waiting for some dust to be on it
before considering applying that on back-branches. Thoughts from
others?
-- 
Michael

Re: Potential data loss of 2PC files

From

"Tsunakawa, Takayuki"

Date:

28 March 2017, 02:38:12

From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
> Do you think that this qualifies as a bug fix for a backpatch? I would think
> so, but I would not mind waiting for some dust to be on it before considering
> applying that on back-branches. Thoughts from others?

I think so, too.  As change from rename() to durable_rename() was applied to all releases, maybe we can follow that.
Inaddition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes to
allversions.
 

Regards
Takayuki Tsunakawa

Re: Potential data loss of 2PC files

From

Michael Paquier

Date:

28 March 2017, 03:37:53

On Tue, Mar 28, 2017 at 8:38 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
>> Do you think that this qualifies as a bug fix for a backpatch? I would think
>> so, but I would not mind waiting for some dust to be on it before considering
>> applying that on back-branches. Thoughts from others?
>
> I think so, too.  As change from rename() to durable_rename() was applied to all releases, maybe we can follow that.
Inaddition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes to
allversions.
 

The patch of HEAD applies cleanly on REL9_6_STABLE, further down it
needs some work but not much either. I am fine to produce those
patches should there be any need to.
-- 
Michael

Re: Potential data loss of 2PC files

From

Michael Paquier

Date:

28 March 2017, 04:36:20

On Tue, Mar 28, 2017 at 9:37 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 8:38 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> From: pgsql-hackers-owner@postgresql.org
>>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
>>> Do you think that this qualifies as a bug fix for a backpatch? I would think
>>> so, but I would not mind waiting for some dust to be on it before considering
>>> applying that on back-branches. Thoughts from others?
>>
>> I think so, too.  As change from rename() to durable_rename() was applied to all releases, maybe we can follow that.
In addition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes
toall versions.
 
>
> The patch of HEAD applies cleanly on REL9_6_STABLE,

Actually that's not correct as well as pg_clog has been renamed to
pg_xact. Let's be careful here.
-- 
Michael