Thread: Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

On Thu, Jan 30, 2025 at 03:36:20PM +0900, Michael Paquier wrote:
> And I am beginning a new thread about going through an issue that Noah
> has mentioned at [1], which is that the 2PC code may attempt to do
> CLOG lookups at very early stage of recovery, where the cluster is not
> in a consistent state.

It's broader than CLOG lookups.  I wrote in [1], "We must not read the old
pg_twophase file, which may contain an unfinished write."  Until recovery
reaches consistency, none of the checks in ProcessTwoPhaseBuffer() or its
callee ReadTwoPhaseFile() are safe.

> [1]: https://www.postgresql.org/message-id/20250117005221.05.nmisch@google.com
> [2]: https://www.postgresql.org/message-id/20250116205254.65.nmisch@google.com

On Fri, Jan 31, 2025 at 09:21:53AM +0900, Michael Paquier wrote:
> --- a/src/test/recovery/t/009_twophase.pl
> +++ b/src/test/recovery/t/009_twophase.pl

> +    log_like => [
> +        qr/removing stale two-phase state from memory for transaction $commit_prepared_xid of epoch 0/,
> +        qr/removing stale two-phase state from memory for transaction $abort_prepared_xid of epoch 0/
> +    ]);

> +    log_like => [
> +        qr/removing past two-phase state file for transaction 4095 of epoch 238/,
> +        qr/removing future two-phase state file for transaction 4095 of epoch 511/
> +    ]);

As I wrote in [1], "By the time we reach consistency, every file in
pg_twophase will be applicable (not committed or aborted)."  If we find
otherwise, the user didn't follow the backup protocol (or there's another
bug).  Hence, long-term, we should stop these removals and just fail recovery.
We can't fix all data loss consequences of not following the backup protocol,
so the biggest favor we can do the user is draw their attention to the
problem.  How do you see it?

For back branches, the ideal is less clear.  If we can convince ourselves that
enough of these events will indicate damaging problems (user error, hardware
failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
right for back branches.  However, it could be too hard to convince ourselves
of that.  If so, that could justify keeping these removals in back branches.



On Tue, Feb 18, 2025 at 04:57:47PM -0800, Noah Misch wrote:
> As I wrote in [1], "By the time we reach consistency, every file in
> pg_twophase will be applicable (not committed or aborted)."  If we find
> otherwise, the user didn't follow the backup protocol (or there's another
> bug). Hence, long-term, we should stop these removals and just fail recovery.
> We can't fix all data loss consequences of not following the backup protocol,
> so the biggest favor we can do the user is draw their attention to the
> problem.  How do you see it?

Deciding to not trust at all any of the contents of pg_twophase/ until
consistency is reached is not something we should aim for, IMO.  Going
in this direction would mean to delay restoreTwoPhaseData() until
consistency is reached, but there are cases where we can read that
safely, and where we should do so.  For example, this flow is
perfectly OK to do in the wasShutdown case, where
PrescanPreparedTransactions() would be able to do its initialization
job before performing WAL recovery to get a clean list of running
XIDs.

I agree that moving towards a solution where we get rid entirely of
the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
and actually is there a reason to not just nuke and replace them
something based on the checkpoint record itself?  I have to admit that
I don't quite see the issue with ReadTwoPhaseFile() when it comes to
crash recovery.  For example, in the case of a partial write, doesn't
the CRC32 check offer some protection about the contents of the file?
Wouldn't it be OK in this case to assume that the contents of this
file will be in WAL anyway?  We have one case based on
reachedConsistency in PrepareRedoAdd(), should this happen, for
crashes happening during checkpoints where we know that the files
could be found in WAL but they could have been loaded at the beginning
of recovery.

The base backup issue is a different one, of course, and I think that
we are going to require more data in the 2PC file to provide a better
cross-check barrier, which would be the addition to the 2PC file of
the end LSN where the 2PC file record has been inserted.  Then we
could cross-check that with the redo location, and see that it's
actually safe to discard the file because we know it will be in WAL.
This seems like a hefty cost to pay for, though, meaning 8 bytes in
each 2PC file because base backups were done wrong.  Bleh.

One extra thing that I have mentioned is that we could replace the
CLOG safeguards based on what we know from the checkpoint record based
on the oldest XID horizon of the checkpoint record and its next XID:
- If we have a 2PC file older than the oldest XID horizon, we know
that it should not exist.
- If we have a 2PC file newer than the next XID, same, or we'll know
about it while replaying.

WDYT?

> For back branches, the ideal is less clear.  If we can convince ourselves that
> enough of these events will indicate damaging problems (user error, hardware
> failure, or PostgreSQL bugs), the long-term ideal of failing recovery is also
> right for back branches.  However, it could be too hard to convince ourselves
> of that.  If so, that could justify keeping these removals in back branches.

While I would like to see something in back-branches, I am not seeing
occurences of the current behavior being hit by users in the wild, so
I'd rather consider that as material only for v19 at this stage.  If
we figure out a clear picture on HEAD, perhaps it will be possible to
salvage some of it for the back branches, but I'm not clear if this
would be possible yet.  That would be nice, but we'll see.
--
Michael

Attachment
On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
> The wasShutdown case reaches consistency from the beginning, so I don't see
> that as an example of a time we benefit from reading pg_twophase before
> reaching consistency.  Can you elaborate on that?
>
> What's the benefit you're trying to get by reading pg_twophase before reaching
> consistency?

My point is mostly about code complicity and consistency, by using the
same logic (aka reading the contents of pg_twophase) at the same point
in the recovery process and perform some filtering of the contents we
know we will not be able to trust.  So I mean to do that once at its
beginning of recovery, where we only compare the names of the 2PC file
names with the XID boundaries in the checkpoint record:
- Discard any files with an XID newer than the next XID.  We know that
these would be in WAL anyway, if they replay from a timeline where it
matters.
- Discard any files that are older than the oldest XID.  We know that
these files don't matter as 2PC transactions hold the XID horizon.
- Keep the others for later evaluation.

So there's no actual need to check the contents of the files, still
that implies trusting the names of the 2PC files in pg_twophase/.

> I can think of one benefit of attempting to read pg_twophase before reaching
> consistency.  Suppose we can prove that a pg_twophase file will cause an error
> by end of recovery, regardless of what WAL contains.  It's nice to fail
> recovery immediately instead of failing recovery when we reach consistency.
> However, I doubt that benefit is important enough to depart from our usual
> principle and incur additional storage seeks in order to achieve that benefit.
> If recovery will certainly fail, you are going to have a bad day anyway.
> Accelerating recovery failure is a small benefit, particularly when we'd
> accelerate failure for only a small slice of recovery failure causes.

Well, I kind of disagree here.  Failing recovery faster can be
beneficial.  It perhaps has less merit since 7ff23c6d277d, meaning
that we should replay less after a failure during crash recovery,
still that seems useful to me if we can do that.  That depends on the
amount of trust put in the data read, of course, and if only WAL is
trusted, there's not much that can be done at the beginning of
recovery.

>> I agree that moving towards a solution where we get rid entirely of
>> the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
>> and actually is there a reason to not just nuke and replace them
>> something based on the checkpoint record itself?
>
> I don't know what this means.

As of the contents of the last patch posted on this thread, I am
referring to the calls of TransactionIdDidCommit() and
TransactionIdDidAbort() moved from ProcessTwoPhaseBuffer(), which is
called mostly always when WAL is replayed (consistency may or may not
be reached), to RecoverPreparedTransactions(), which is run just
before consistency is marked as such in the system.  When
RecoverPreparedTransactions() is called, we're ready to mark the
cluster as OK for writes and WAL has been already fully replayed with
all sanity checks done.  CLOG accesses at this stage would not be an
issue.

>> Wouldn't it be OK in this case to assume that the contents of this
>> file will be in WAL anyway?
>
> Sure.  Meanwhile, if a twophase file is going to be in later WAL, what's the
> value in opening the file before we get to that WAL?

True.  We could avoid loading some 2PC files if we know that these
could be in WAL when replaying.

There is still one point that I'm really unclear about.  Some 2PC
transactions are flushed at checkpoint time, so we will have to trust
the contents of pg_twophase/ at some point.  Do you mean to delay that
to happen always when consistency is reached or if we know that we're
starting from a clean state?  What I'd really prefer to avoid is
having two code paths in charge of reading the contents of
pg_twophase.

The point I am trying to make is that there has to be a certain level
of trust in the contents of pg_twophase, at some point during replay.
Or, we invent a new mechanism where all the twophase files go through
WAL and remove the need for pg_twophase/ when recovering.  For
example, we could have an extra record generated at each checkpoint
with the contents of the 2PC files still pending for commit
(potentially costly if the same transaction persists across multiple
checkpoints as this gets repeated), or something entirely different.
Something like that would put WAL as the sole source of trust by
design.

Or are you seeing things differently?  In which case, I am not sure
what's the line you think would be "correct" here (well, you did say
only WAL until consistency is reached), nor do I completely understand
how much 2PC transaction state we should have in shared memory until
consistency is reached.  Or perhaps you mean to somehow eliminate more
that?  I'm unsure how much this would imply for the existing recovery
mechanisms (replication origin advancing at replay for prepared
transactions may be one area to look at, for example).
--
Michael

Attachment
On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:
> On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
> > The wasShutdown case reaches consistency from the beginning, so I don't see
> > that as an example of a time we benefit from reading pg_twophase before
> > reaching consistency.  Can you elaborate on that?
> > 
> > What's the benefit you're trying to get by reading pg_twophase before reaching
> > consistency?
> 
> My point is mostly about code complicity and consistency, by using the
> same logic (aka reading the contents of pg_twophase) at the same point
> in the recovery process

That's a nice goal.

> and perform some filtering of the contents we
> know we will not be able to trust.  So I mean to do that once at its
> beginning of recovery, where we only compare the names of the 2PC file
> names with the XID boundaries in the checkpoint record:

> - Discard any files with an XID newer than the next XID.  We know that
> these would be in WAL anyway, if they replay from a timeline where it
> matters.

v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
to emit a WARNING.  However, this can happen without any exceptional events in
the backup's history.  The message should be LOG or DEBUG if done this early.
This is not new in the patch, and it would be okay to not change it as part of
$SUBJECT.  I mention it because, if we scanned pg_twophase after reaching
consistency, an ERROR would be appropriate.  (A too-new pg_twophase file then
signifies the backup creator having copied files after the backup stop, a
violation of the backup protocol.)

> - Discard any files that are older than the oldest XID.  We know that
> these files don't matter as 2PC transactions hold the XID horizon.

In contrast, a WARNING is fine here.  A too-old file implies one of these
three, each of which counts as an exceptional event:

- an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
- unlink() failed in RemoveTwoPhaseFile()
- backup protocol violation

I agree start-of-recovery can correctly do your list of two discard actions;
they do not require a consistent state.  If the patch brings us to the point
that recovery does only those two things with pg_twophase before reaching
consistency, that sounds fine.  Doing them that early doesn't sound optimal to
me, but I've not edited that area as much as you have.  If it gets the
user-visible behaviors right and doesn't look fragile, I'll be fine with it.

> - Keep the others for later evaluation.
> 
> So there's no actual need to check the contents of the files, still
> that implies trusting the names of the 2PC files in pg_twophase/.

Trusting file names is fine.

> > I can think of one benefit of attempting to read pg_twophase before reaching
> > consistency.  Suppose we can prove that a pg_twophase file will cause an error
> > by end of recovery, regardless of what WAL contains.

Thinking about that more, few errors are detectable at that time.  Here too,
if your patch achieves some of this while getting the user-visible behaviors
right and not looking fragile, I'll be fine with it.

> >> Wouldn't it be OK in this case to assume that the contents of this
> >> file will be in WAL anyway?
> > 
> > Sure.  Meanwhile, if a twophase file is going to be in later WAL, what's the
> > value in opening the file before we get to that WAL?
> 
> True.  We could avoid loading some 2PC files if we know that these
> could be in WAL when replaying.

If the 2PC data is in future WAL, the file on disk may be a partially-written
file that fails the CRC check.  That's a key benefit of not reading the file.

> There is still one point that I'm really unclear about.  Some 2PC
> transactions are flushed at checkpoint time, so we will have to trust
> the contents of pg_twophase/ at some point.  Do you mean to delay that
> to happen always when consistency is reached

Yep.

> or if we know that we're
> starting from a clean state?

I don't know what "clean slate" means.  If "clean state"=wasShutdown, that's
just a special case of "consistency is reached".

> What I'd really prefer to avoid is
> having two code paths in charge of reading the contents of
> pg_twophase.

That's a good goal.  I suspect you'll achieve that.

> The point I am trying to make is that there has to be a certain level
> of trust in the contents of pg_twophase, at some point during replay.
> Or, we invent a new mechanism where all the twophase files go through
> WAL and remove the need for pg_twophase/ when recovering.  For
> example, we could have an extra record generated at each checkpoint
> with the contents of the 2PC files still pending for commit
> (potentially costly if the same transaction persists across multiple
> checkpoints as this gets repeated), or something entirely different.
> Something like that would put WAL as the sole source of trust by
> design.

Adding such a record would be nonstandard and unnecessary.  Before reaching
consistency, the standard is that recovery reads and writes what WAL directs
it to read and write.  Upon reaching consistency, the server can trust the
entire data directory, including parts that WAL never referenced.

> Or are you seeing things differently?  In which case, I am not sure
> what's the line you think would be "correct" here (well, you did say
> only WAL until consistency is reached), nor do I completely understand
> how much 2PC transaction state we should have in shared memory until
> consistency is reached.

Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
beyond the point recovery has reached.  In other words, recovery should keep
anachronistic information out of shared memory.  For example, one could
imagine a straw man implementation in which, before reaching consistency, we
load each pg_twophase file that does pass its CRC check.  I'd dislike that,
because it would facilitate anachronisms involving the
GlobalTransactionData.gid field.  We could end up with two GXACT having the
same gid, one being the present (according to recovery progress) instance of
that gid and another being a future instance.  The system might not
malfunction today, but I'd consider that fragile.  Anachronistic entries might
cause recovery to need more shared memory than max_prepared_transactions has
allocated.

You might find some more-important goal preempts that no-anachronisms goal.



On Fri, Jun 06, 2025 at 03:34:13PM -0700, Noah Misch wrote:
> On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:
>> On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
>> and perform some filtering of the contents we
>> know we will not be able to trust.  So I mean to do that once at its
>> beginning of recovery, where we only compare the names of the 2PC file
>> names with the XID boundaries in the checkpoint record:
>>
>> - Discard any files with an XID newer than the next XID.  We know that
>> these would be in WAL anyway, if they replay from a timeline where it
>> matters.
>
> v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
> to emit a WARNING.  However, this can happen without any exceptional events in
> the backup's history.  The message should be LOG or DEBUG if done this early.
> This is not new in the patch, and it would be okay to not change it as part of
> $SUBJECT.  I mention it because, if we scanned pg_twophase after reaching
> consistency, an ERROR would be appropriate.  (A too-new pg_twophase file then
> signifies the backup creator having copied files after the backup stop, a
> violation of the backup protocol.)

I need to ponder more about this point, I think.

>> - Discard any files that are older than the oldest XID.  We know that
>> these files don't matter as 2PC transactions hold the XID horizon.
>
> In contrast, a WARNING is fine here.  A too-old file implies one of these
> three, each of which counts as an exceptional event:
>
> - an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
> - unlink() failed in RemoveTwoPhaseFile()
> - backup protocol violation
>
> I agree start-of-recovery can correctly do your list of two discard actions;
> they do not require a consistent state.  If the patch brings us to the point
> that recovery does only those two things with pg_twophase before reaching
> consistency, that sounds fine.  Doing them that early doesn't sound optimal to
> me, but I've not edited that area as much as you have.  If it gets the
> user-visible behaviors right and doesn't look fragile, I'll be fine with it.

Noted.

>> So there's no actual need to check the contents of the files, still
>> that implies trusting the names of the 2PC files in pg_twophase/.
>
> Trusting file names is fine.

Noted, thanks.

>> or if we know that we're
>> starting from a clean state?
>
> I don't know what "clean slate" means.  If "clean state"=wasShutdown, that's
> just a special case of "consistency is reached".

By "clean state", I meant wasShutdown = true, yes.  Sorry for the
confusion.

>> Or are you seeing things differently?  In which case, I am not sure
>> what's the line you think would be "correct" here (well, you did say
>> only WAL until consistency is reached), nor do I completely understand
>> how much 2PC transaction state we should have in shared memory until
>> consistency is reached.
>
> Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
> beyond the point recovery has reached.  In other words, recovery should keep
> anachronistic information out of shared memory.  For example, one could
> imagine a straw man implementation in which, before reaching consistency, we
> load each pg_twophase file that does pass its CRC check.  I'd dislike that,
> because it would facilitate anachronisms involving the
> GlobalTransactionData.gid field.  We could end up with two GXACT having the
> same gid, one being the present (according to recovery progress) instance of
> that gid and another being a future instance.  The system might not
> malfunction today, but I'd consider that fragile.  Anachronistic entries might
> cause recovery to need more shared memory than max_prepared_transactions has
> allocated.

So this comes down to attempting to translate the information from the
2PC files to shared memory at a point earlier than what we could.
Thinking more about this point, there is something that I can see us
potentially do here: how about tracking the file names in shared
memory when we go through them in restoreTwoPhaseData()?  That
addresses my point about doing only one scan of pg_twophase/ at the
beginning of recovery.  I have not studied in details how doable it
is, but we should be able to delay any calls to
ProcessTwoPhaseBuffer() until we are sure that it is safe to do, which
would in charge of reading the files based on the names we've picked
at the early stages.

RecoverPreparedTransactions() is an easy one, we only call it at the
end of recovery.  The sticky point to study is the call of
StandbyRecoverPreparedTransactions() when replaying a shutdown
checkpoint..  Ugh.

> You might find some more-important goal preempts that no-anachronisms goal.

Right, thanks.

By the way, what do you think we should do with 0001 at this stage
(your refactoring work with some changes I've sprinkled)?  I was
looking at it this morning with fresh eyes and my opinion about it is
that it improves the situation in all these 2PC callers on HEAD now
that we need to deal with fxids for the file names, so that's an
independent piece.  Perhaps this should be applied once v19 opens for
business while we sort out the rest that 0002 was trying to cover?
It's not something that we can backpatch safely due to the ABI change
in 2PC callback, unfortunately, just a cleanup piece.

Note: this introduced two calls to FullTransactionIdFromAllowableAt()
in StandbyTransactionIdIsPrepared() and PrepareRedoAdd(), which worried
me a bit on second look and gave me a pause:
- The call in StandbyTransactionIdIsPrepared() should use
AdjustToFullTransactionId(), ReadTwoPhaseFile() calls
TwoPhaseFilePath(), which uses AdjustToFullTransactionId() on HEAD.
- The call in PrepareRedoAdd() slightly worried me, but I think that
we're OK with using FullTransactionIdFromAllowableAt(), as we are in
the case of a 2PC transaction recovered from WAL, with the XID coming
from the file's header.  I think that we'd better put an
Assert(InRecovery) in this path, just for safety.

What do you think?  0002 needs more thoughts, so I'm attaching a
slightly-reviewed version of 0001 for now.
--
Michael

Attachment
On Fri, Jun 20, 2025 at 09:02:00AM +0900, Michael Paquier wrote:
> On Fri, Jun 06, 2025 at 03:34:13PM -0700, Noah Misch wrote:
> > Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
> > beyond the point recovery has reached.  In other words, recovery should keep
> > anachronistic information out of shared memory.  For example, one could
> > imagine a straw man implementation in which, before reaching consistency, we
> > load each pg_twophase file that does pass its CRC check.  I'd dislike that,
> > because it would facilitate anachronisms involving the
> > GlobalTransactionData.gid field.  We could end up with two GXACT having the
> > same gid, one being the present (according to recovery progress) instance of
> > that gid and another being a future instance.  The system might not
> > malfunction today, but I'd consider that fragile.  Anachronistic entries might
> > cause recovery to need more shared memory than max_prepared_transactions has
> > allocated.
> 
> So this comes down to attempting to translate the information from the
> 2PC files to shared memory at a point earlier than what we could.
> Thinking more about this point, there is something that I can see us
> potentially do here: how about tracking the file names in shared
> memory when we go through them in restoreTwoPhaseData()?  That
> addresses my point about doing only one scan of pg_twophase/ at the
> beginning of recovery.  I have not studied in details how doable it
> is,

A challenge is that the number of files in pg_twophase can exceed
max_prepared_transactions if we've not yet reached consistency.  That happens
with a sequence like this:

- exactly max_prepared_transactions files are in pg_twophase
- backup process copies pg_twophase/$file1, then pauses before the next readdir
- some backend deletes pg_twophase/$file1
- checkpointer creates pg_twophase/$fileN
- backup process wakes up and copies remaining pg_twophase files

That's the only problem coming to mind.  If that problem doesn't cancel out
the benefits of scanning pg_twophase early, you may as well do it.

> By the way, what do you think we should do with 0001 at this stage
> (your refactoring work with some changes I've sprinkled)?  I was
> looking at it this morning with fresh eyes and my opinion about it is
> that it improves the situation in all these 2PC callers on HEAD now
> that we need to deal with fxids for the file names, so that's an
> independent piece.  Perhaps this should be applied once v19 opens for
> business while we sort out the rest that 0002 was trying to cover?

That looks harmless.

> Note: this introduced two calls to FullTransactionIdFromAllowableAt()
> in StandbyTransactionIdIsPrepared() and PrepareRedoAdd(), which worried
> me a bit on second look and gave me a pause:
> - The call in StandbyTransactionIdIsPrepared() should use
> AdjustToFullTransactionId(), ReadTwoPhaseFile() calls
> TwoPhaseFilePath(), which uses AdjustToFullTransactionId() on HEAD.
> - The call in PrepareRedoAdd() slightly worried me, but I think that
> we're OK with using FullTransactionIdFromAllowableAt(), as we are in
> the case of a 2PC transaction recovered from WAL, with the XID coming
> from the file's header.  I think that we'd better put an
> Assert(InRecovery) in this path, just for safety.
> 
> What do you think?

It's an appropriate level of risk-taking.



On Sun, Jul 06, 2025 at 11:07:11AM -0700, Noah Misch wrote:
> A challenge is that the number of files in pg_twophase can exceed
> max_prepared_transactions if we've not yet reached consistency.  That happens
> with a sequence like this:
>
> - exactly max_prepared_transactions files are in pg_twophase
> - backup process copies pg_twophase/$file1, then pauses before the next readdir
> - some backend deletes pg_twophase/$file1
> - checkpointer creates pg_twophase/$fileN
> - backup process wakes up and copies remaining pg_twophase files
>
> That's the only problem coming to mind.  If that problem doesn't cancel out
> the benefits of scanning pg_twophase early, you may as well do it.

Yeah..  That's where a linked link could become useful, and where
we'll need something different for PreparedXactProcs and the fake proc
entries.  Not sure about that yet.

>> By the way, what do you think we should do with 0001 at this stage
>> (your refactoring work with some changes I've sprinkled)?  I was
>> looking at it this morning with fresh eyes and my opinion about it is
>> that it improves the situation in all these 2PC callers on HEAD now
>> that we need to deal with fxids for the file names, so that's an
>> independent piece.  Perhaps this should be applied once v19 opens for
>> business while we sort out the rest that 0002 was trying to cover?
>
> That looks harmless.

Thanks.  I've applied the refactoring piece for now, and marked the CF
entry as returned with feedback.  I'm hoping to get back to that for
the September commit fest.
--
Michael

Attachment