Thread: reorder pg_rewind control file sync

reorder pg_rewind control file sync

From
Fabien COELHO
Date:
Bonjour Michaël,

On Sat, 23 Mar 2019, Michael Paquier wrote:
> On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote:
>> Attached is a quick patch about "pg_rewind", so that the control file 
>> is updated after everything else is committed to disk.
>
> Could you start a new thread about that please?  This one has already 
> been used for too many things.

Here it is.

The attached patch reorders the cluster fsyncing and control file changes 
in "pg_rewind" so that the later is done after all data are committed to 
disk, so as to reflect the actual cluster status, similarly to what is 
done by "pg_checksums", per discussion in the thread about offline 
enabling of checksums:

https://www.postgresql.org/message-id/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

-- 
Fabien.
Attachment

Re: reorder pg_rewind control file sync

From
Michael Paquier
Date:
On Sat, Mar 23, 2019 at 06:18:27AM +0100, Fabien COELHO wrote:
> Here it is.

Thanks.

> The attached patch reorders the cluster fsyncing and control file changes in
> "pg_rewind" so that the later is done after all data are committed to disk,
> so as to reflect the actual cluster status, similarly to what is done by
> "pg_checksums", per discussion in the thread about offline enabling of
> checksums:

It would be an interesting property to see that it is possible to
retry a rewind of a node which has been partially rewound already,
but the operation failed in the middle.  Because that's the real deal
here: as long as we know that its control file is in its previous
state, we can rely on it for retrying the operation.  Logically, I
think that it should work, because we would still try to fetch the
same blocks from the source server since WAL has forked by looking at
the records of the target up from the last checkpoint before WAL has
forked up to the last shutdown checkpoint, and the operation is lossy
by design when it comes to deal with file differences.

Have you tried to see if pg_rewind is able to repeat its operation for
specific scenarios?  One is for example a database created on the
promoted standby, used as a source, and a second, different database
created on the primary after the standby has been promoted.  You could
make the tool exit() before the rewind finishes, just before updating
the control file, and see if the operation is repeatable.
Interrupting the tool would be fine as well, still less controllable.

It would be good to mention in the patch why the order matters.
--
Michael

Attachment

Re: reorder pg_rewind control file sync

From
Fabien COELHO
Date:
Bonjour Michaël,

>> The attached patch reorders the cluster fsyncing and control file changes in
>> "pg_rewind" so that the later is done after all data are committed to disk,
>> so as to reflect the actual cluster status, similarly to what is done by
>> "pg_checksums", per discussion in the thread about offline enabling of
>> checksums:
>
> It would be an interesting property to see that it is possible to
> retry a rewind of a node which has been partially rewound already,
> but the operation failed in the middle.

Yes. I understand that the question is whether the Warning in pg_rewind 
documentation can be partially lifted. The short answer is that it is not 
obvious.

> Because that's the real deal here: as long as we know that its control 
> file is in its previous state, we can rely on it for retrying the 
> operation.  Logically, I think that it should work, because we would 
> still try to fetch the same blocks from the source server since WAL has 
> forked by looking at the records of the target up from the last 
> checkpoint before WAL has forked up to the last shutdown checkpoint, and 
> the operation is lossy by design when it comes to deal with file 
> differences.
>
> Have you tried to see if pg_rewind is able to repeat its operation for
> specific scenarios?

I have run the non regression tests. I'm not sure of what scenarii are 
covered there, but probably not an interruption in the middle of a fsync, 
specially if fsync is usually disabled to ease the tests:-)

> One is for example a database created on the promoted standby, used as a 
> source, and a second, different database created on the primary after 
> the standby has been promoted.  You could make the tool exit() before 
> the rewind finishes, just before updating the control file, and see if 
> the operation is repeatable. Interrupting the tool would be fine as 
> well, still less controllable.
>
> It would be good to mention in the patch why the order matters.

Yep. This requires a careful analysis of pg_rewind inner working, that I 
do not have to do in the short terme.

-- 
Fabien.

Re: reorder pg_rewind control file sync

From
Michael Paquier
Date:
On Mon, Mar 25, 2019 at 10:29:46AM +0100, Fabien COELHO wrote:
> I have run the non regression tests. I'm not sure of what scenarii are
> covered there, but probably not an interruption in the middle of a fsync,
> specially if fsync is usually disabled to ease the tests:-)

Force the tool to stop at a specific point requires a booby-trap.  And
even if fsync is not killed, you could just enforce the tool to stop
once before updating the control file, and attempt a re-run without
the trap, checking if it works at the second attempt, so the problem
is quite independent from the timing of fsync().
--
Michael

Attachment