Thread: Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Thu, Jan 30, 2025 at 03:36:20PM +0900, Michael Paquier wrote:
> And I am beginning a new thread about going through an issue that Noah
> has mentioned at [1], which is that the 2PC code may attempt to do
> CLOG lookups at very early stage of recovery, where the cluster is not
> in a consistent state.

It's broader than CLOG lookups.  I wrote in [1], "We must not read the old
pg_twophase file, which may contain an unfinished write."  Until recovery
reaches consistency, none of the checks in ProcessTwoPhaseBuffer() or its
callee ReadTwoPhaseFile() are safe.

> [1]: https://www.postgresql.org/message-id/20250117005221.05.nmisch@google.com
> [2]: https://www.postgresql.org/message-id/20250116205254.65.nmisch@google.com

On Fri, Jan 31, 2025 at 09:21:53AM +0900, Michael Paquier wrote:
> --- a/src/test/recovery/t/009_twophase.pl
> +++ b/src/test/recovery/t/009_twophase.pl

> +    log_like => [
> +        qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
> +        qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
> +    ]);

> +    log_like => [
> +        qr/removing past two-phase state file for transaction 4095 of epoch 238/,
> +        qr/removing future two-phase state file for transaction 4095 of epoch 511/
> +    ]);

As I wrote in [1], "By the time we reach consistency, every file in
pg_twophase will be applicable (not committed or aborted)."  If we find
otherwise, the user didn't follow the backup protocol (or there's another
bug).  Hence, long-term, we should stop these removals and just fail recovery.
We can't fix all data loss consequences of not following the backup protocol,
so the biggest favor we can do the user is draw their attention to the
problem.  How do you see it?

For back branches, the ideal is less clear.  If we can convince ourselves that
enough of these events will indicate damaging problems (user error, hardware
failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
right for back branches.  However, it could be too hard to convince ourselves
of that.  If so, that could justify keeping these removals in back branches.



On Tue, Feb 18, 2025 at 04:57:47PM -0800, Noah Misch wrote:
> As I wrote in [1], "By the time we reach consistency, every file in
> pg_twophase will be applicable (not committed or aborted)."  If we find
> otherwise, the user didn't follow the backup protocol (or there's another
> bug). Hence, long-term, we should stop these removals and just fail recovery.
> We can't fix all data loss consequences of not following the backup protocol,
> so the biggest favor we can do the user is draw their attention to the
> problem.  How do you see it?

Deciding to not trust at all any of the contents of pg_twophase/ until
consistency is reached is not something we should aim for, IMO.  Going
in this direction would mean to delay restoreTwoPhaseData() until
consistency is reached, but there are cases where we can read that
safely, and where we should do so.  For example, this flow is
perfectly OK to do in the wasShutdown case, where
PrescanPreparedTransactions() would be able to do its initialization
job before performing WAL recovery to get a clean list of running
XIDs.

I agree that moving towards a solution where we get rid entirely of
the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
and actually is there a reason to not just nuke and replace them
something based on the checkpoint record itself?  I have to admit that
I don't quite see the issue with ReadTwoPhaseFile() when it comes to
crash recovery.  For example, in the case of a partial write, doesn't
the CRC32 check offer some protection about the contents of the file?
Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway?  We have one case based on
reachedConsistency in PrepareRedoAdd(), should this happen, for
crashes happening during checkpoints where we know that the files
could be found in WAL but they could have been loaded at the beginning
of recovery.

The base backup issue is a different one, of course, and I think that
we are going to require more data in the 2PC file to provide a better
cross-check barrier, which would be the addition to the 2PC file of
the end LSN where the 2PC file record has been inserted.  Then we
could cross-check that with the redo location, and see that it's
actually safe to discard the file because we know it will be in WAL.
This seems like a hefty cost to pay for, though, meaning 8 bytes in
each 2PC file because base backups were done wrong.  Bleh.

One extra thing that I have mentioned is that we could replace the
CLOG safeguards based on what we know from the checkpoint record based
on the oldest XID horizon of the checkpoint record and its next XID:
- If we have a 2PC file older than the oldest XID horizon, we know
that it should not exist.
- If we have a 2PC file newer than the next XID, same, or we'll know
about it while replaying.

WDYT?

> For back branches, the ideal is less clear.  If we can convince ourselves that
> enough of these events will indicate damaging problems (user error, hardware
> failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
> right for back branches.  However, it could be too hard to convince ourselves
> of that.  If so, that could justify keeping these removals in back branches.

While I would like to see something in back-branches, I am not seeing
occurences of the current behavior being hit by users in the wild, so
I'd rather consider that as material only for v19 at this stage.  If
we figure out a clear picture on HEAD, perhaps it will be possible to
salvage some of it for the back branches, but I'm not clear if this
would be possible yet.  That would be nice, but we'll see.
--
Michael

Attachment