Thread: reorder pg_rewind control file sync
Bonjour Michaël, On Sat, 23 Mar 2019, Michael Paquier wrote: > On Fri, Mar 22, 2019 at 03:18:26PM +0100, Fabien COELHO wrote: >> Attached is a quick patch about "pg_rewind", so that the control file >> is updated after everything else is committed to disk. > > Could you start a new thread about that please? This one has already > been used for too many things. Here it is. The attached patch reorders the cluster fsyncing and control file changes in "pg_rewind" so that the later is done after all data are committed to disk, so as to reflect the actual cluster status, similarly to what is done by "pg_checksums", per discussion in the thread about offline enabling of checksums: https://www.postgresql.org/message-id/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de -- Fabien.
Attachment
On Sat, Mar 23, 2019 at 06:18:27AM +0100, Fabien COELHO wrote: > Here it is. Thanks. > The attached patch reorders the cluster fsyncing and control file changes in > "pg_rewind" so that the later is done after all data are committed to disk, > so as to reflect the actual cluster status, similarly to what is done by > "pg_checksums", per discussion in the thread about offline enabling of > checksums: It would be an interesting property to see that it is possible to retry a rewind of a node which has been partially rewound already, but the operation failed in the middle. Because that's the real deal here: as long as we know that its control file is in its previous state, we can rely on it for retrying the operation. Logically, I think that it should work, because we would still try to fetch the same blocks from the source server since WAL has forked by looking at the records of the target up from the last checkpoint before WAL has forked up to the last shutdown checkpoint, and the operation is lossy by design when it comes to deal with file differences. Have you tried to see if pg_rewind is able to repeat its operation for specific scenarios? One is for example a database created on the promoted standby, used as a source, and a second, different database created on the primary after the standby has been promoted. You could make the tool exit() before the rewind finishes, just before updating the control file, and see if the operation is repeatable. Interrupting the tool would be fine as well, still less controllable. It would be good to mention in the patch why the order matters. -- Michael
Attachment
Bonjour Michaël, >> The attached patch reorders the cluster fsyncing and control file changes in >> "pg_rewind" so that the later is done after all data are committed to disk, >> so as to reflect the actual cluster status, similarly to what is done by >> "pg_checksums", per discussion in the thread about offline enabling of >> checksums: > > It would be an interesting property to see that it is possible to > retry a rewind of a node which has been partially rewound already, > but the operation failed in the middle. Yes. I understand that the question is whether the Warning in pg_rewind documentation can be partially lifted. The short answer is that it is not obvious. > Because that's the real deal here: as long as we know that its control > file is in its previous state, we can rely on it for retrying the > operation. Logically, I think that it should work, because we would > still try to fetch the same blocks from the source server since WAL has > forked by looking at the records of the target up from the last > checkpoint before WAL has forked up to the last shutdown checkpoint, and > the operation is lossy by design when it comes to deal with file > differences. > > Have you tried to see if pg_rewind is able to repeat its operation for > specific scenarios? I have run the non regression tests. I'm not sure of what scenarii are covered there, but probably not an interruption in the middle of a fsync, specially if fsync is usually disabled to ease the tests:-) > One is for example a database created on the promoted standby, used as a > source, and a second, different database created on the primary after > the standby has been promoted. You could make the tool exit() before > the rewind finishes, just before updating the control file, and see if > the operation is repeatable. Interrupting the tool would be fine as > well, still less controllable. > > It would be good to mention in the patch why the order matters. Yep. This requires a careful analysis of pg_rewind inner working, that I do not have to do in the short terme. -- Fabien.
On Mon, Mar 25, 2019 at 10:29:46AM +0100, Fabien COELHO wrote: > I have run the non regression tests. I'm not sure of what scenarii are > covered there, but probably not an interruption in the middle of a fsync, > specially if fsync is usually disabled to ease the tests:-) Force the tool to stop at a specific point requires a booby-trap. And even if fsync is not killed, you could just enforce the tool to stop once before updating the control file, and attempt a re-run without the trap, checking if it works at the second attempt, so the problem is quite independent from the timing of fsync(). -- Michael