Thread: [BUG] recovery of prepared transactions during promotion can fail

[BUG] recovery of prepared transactions during promotion can fail

From
Julian Markwort
Date:
Hey everyone,

I've discovered a serious bug that leads to a server crash upon promoting an instance that crashed previously and did
recovery in standby mode.

The bug is present in PostgreSQL versions 13 and 14 (and in earlier versions, though it doesn't manifest itself so
catastrophically).
The circumstances to trigger the bug are as follows:
- postgresql is configured for hot_standby, archiving, and prepared transactions
- prepare a transaction
- crash postgresql
- create standby.signal file
- start postgresql, wait for recovery to finish
- promote

The promotion will fail with a FATAL error, stating that "requested WAL segment .* has already been removed".
The FATAL error causes the startup process to exit, so postmaster shuts down again.

Here's an exemplary log output, maybe this helps people to find this issue when they search for it online:

LOG:  consistent recovery state reached at 0/15D8AB0
LOG:  database system is ready to accept read only connections
LOG:  received promote request
LOG:  redo done at 0/15D89B8
LOG:  last completed transaction was at log time 2023-06-16 13:09:53.71118+02
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
FATAL:  requested WAL segment pg_wal/000000010000000000000001 has already been removed
LOG:  startup process (PID 1650358) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down


The cause of this failure is an oversight (rather obvious in hindsight):
The renaming of the WAL file (that was last written to before the crash happened) to .partial is done *before*
PostgreSQL
might have to read this very file to recover prepared transactions from it.
The relevant function calls here are durable_rename() and RecoverPreparedTransactions() in xlog.c .

Note that it is important that the PREPARE entry is in the WAL file that PostgreSQL is writing to prior to the inital
crash.
This has happened repeatedly in production already with a customer that uses prepared transactions quite frequently.
I assume that this has happened for others too, but the circumstances of the crash and the cause are very dubious, and
troubleshooting it is pretty difficult.


This behaviour has - apparently unintentionally - been fixed in PG 15 and upwards (see commit 811051c ), as part of a
general restructure and reorganization of this portion of xlog.c (see commit 6df1543 ).

Furthermore, it seems this behaviour does not appear in PG 12 and older, due to another possible bug:
In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead() before reading WAL in XlogReadTwoPhaseData() in
twophase.c .
In the older releases (PG <= 12), this reset is not done, so the requested LSN containing the prepared transaction can
(by happy coincidence) be read from in-memory buffers, and PostgreSQL consequently manages to come up just fine (as the
WAL has already been read into buffers prior to the .partial rename).
If the older releases also where to properly reset the XLogReaderState, they would also fail to find the LSN on disk,
and
hence PostgreSQL would crash again.

I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 811051c ) and reorder the crucial
events,
placing the recovery of prepared transactions *before* renaming the file.
I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be used to verify that promote after
recovery
with prepared transactions works.

A note for myself in the future and whomever may find it useful:
The test can be copied to src/test/recovery/t/ and selectively run (after you've ./configure'd for TAP testing and
compiled everything) from within the src/test/recovery directory using something like:
    make check PROVE_TESTS='t/PG_geq_12_promote_prepare_xact.pl'


My humble opinion is that this fix should be backpatched to PG 14 and PG 13.
It's debatable whether the fix needs to be brought back to 12 and older also, as those do not exhibit this issue, but
the
order of renaming is still wrong.
I'm not sure if there could be cases where the in-memory buffers of the walreader are too small to cover a whole WAL
file.
There could also be other issues from operations that require reading WAL that happen after the .partial rename, I
haven't checked in depth what else happens in the affected codepath.
Please let me know if you think this should also be fixed in PG 12 and earlier, so I can produce the patches for those
versions as well.


Kind regards
Julian


Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Michael Paquier
Date:
On Fri, Jun 16, 2023 at 04:27:40PM +0200, Julian Markwort wrote:
> I've discovered a serious bug that leads to a server crash upon
> promoting an instance that crashed previously and did recovery in
> standby mode.

Reproduced here, for the versions mentioned.

> The bug is present in PostgreSQL versions 13 and 14 (and in earlier
> versions, though it doesn't manifest itself so catastrophically).
> The circumstances to trigger the bug are as follows:
> - postgresql is configured for hot_standby, archiving, and prepared transactions
> - prepare a transaction
> - crash postgresql
> - create standby.signal file
> - start postgresql, wait for recovery to finish
> - promote

hot_standby allows one to run queries on a standby running recovery,
so it seems to me that it does not really matter.  Enabling archiving
is the critical piece.  The nodes set in the TAP test 009_twophase.pl
don't use any kind of archiving.  But once it is enabled on London the
first promotion command of the test fails the same way as you report.
 # Setup london node
 my $node_london = get_new_node("london");
-$node_london->init(allows_streaming => 1);
+# Archiving is used to force tests with .partial segment creations
+# done at the end of recovery.
+$node_london->init(allows_streaming => 1, has_archiving => 1);

Enabling the archiving does not impact any of the tests, as we don't
use restore_command during recovery and only rely on streaming.

> The cause of this failure is an oversight (rather obvious in
> hindsight): The renaming of the WAL file (that was last written to
> before the crash happened) to .partial is done *before* PostgreSQL
> might have to read this very file to recover prepared transactions
> from it.  The relevant function calls here are durable_rename() and
> RecoverPreparedTransactions() in xlog.c.
>
> Note that it is important that the PREPARE entry is in the WAL file
> that PostgreSQL is writing to prior to the inital crash.
> This has happened repeatedly in production already with a customer
> that uses prepared transactions quite frequently.  I assume that
> this has happened for others too, but the circumstances of the crash
> and the cause are very dubious, and troubleshooting it is pretty
> difficult.

I guess that this is a possibility yes.  I have not heard directly
about such a report, but perhaps that's just because few people use
2PC.

> This behaviour has - apparently unintentionally - been fixed in PG
> 15 and upwards (see commit 811051c ), as part of a general
> restructure and reorganization of this portion of xlog.c (see commit
> 6df1543 ).
>
> Furthermore, it seems this behaviour does not appear in PG 12 and
> older, due to another possible bug: In PG 13 and newer, the
> XLogReaderState is reset in XLogBeginRead() before reading WAL in
> XlogReadTwoPhaseData() in twophase.c .
> In the older releases (PG <= 12), this reset is not done, so the
> requested LSN containing the prepared transaction can (by happy
> coincidence) be read from in-memory buffers, and PostgreSQL
> consequently manages to come up just fine (as the WAL has already
> been read into buffers prior to the .partial rename).  If the older
> releases also where to properly reset the XLogReaderState, they
> would also fail to find the LSN on disk, and hence PostgreSQL would
> crash again.

That's debatable, but I think that I would let v12 and v11 be as they
are.  v11 is going to be end-of-life soon and we did not have any
complains on this matter as far as I know, so there is a risk of
breaking something upon its last release.  (Got some, Err..
experiences with that in the past).  On REL_11_STABLE, note for
example the slight difference with the handling of
recovery_end_command, where we rely on InRecovery rather than
ArchiveRecoveryRequested.  REL_12_STABLE is in a more consistent shape
than v11 regarding that.

> I've attached patches for PG 14 and PG 13 that mimic the change in
> PG15 (commit 811051c ) and reorder the crucial events, placing the
> recovery of prepared transactions *before* renaming the file.

Yes, I think that's OK.  I would like to add two things to your
proposal for all the existing branches.
- Addition of a comment where RecoverPreparedTransactions() is called
at the end of recovery to tell that we'd better do that before working
on the last partial segment of the old timeline.
- Enforce the use of archiving in 009_twophase.pl.

> My humble opinion is that this fix should be backpatched to PG 14
> and PG 13.  It's debatable whether the fix needs to be brought back
> to 12 and older also, as those do not exhibit this issue, but the
> order of renaming is still wrong.

Yeah, I'd rather wait for somebody to complain about that.  And v11 is
not worth taking risks with at this time of the year, IMHO.

With your fix included, the patch for REL_14_STABLE would be like the
attached.  Is that OK for you?
--
Michael

Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Kyotaro Horiguchi
Date:
Thanks for the report, reproducer and the patches.

At Fri, 16 Jun 2023 16:27:40 +0200, Julian Markwort <julian.markwort@cybertec.at> wrote in 
> - prepare a transaction
> - crash postgresql
> - create standby.signal file
> - start postgresql, wait for recovery to finish
> - promote
..
> The promotion will fail with a FATAL error, stating that "requested WAL segment .* has already been removed".
> The FATAL error causes the startup process to exit, so postmaster shuts down again.
> 
> Here's an exemplary log output, maybe this helps people to find this issue when they search for it online:

> LOG:  redo done at 0/15D89B8
> LOG:  last completed transaction was at log time 2023-06-16 13:09:53.71118+02
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> FATAL:  requested WAL segment pg_wal/000000010000000000000001 has already been removed
> LOG:  startup process (PID 1650358) exited with exit code 1

Reproduced here.

> The cause of this failure is an oversight (rather obvious in hindsight):
> The renaming of the WAL file (that was last written to before the crash happened) to .partial is done *before*
PostgreSQL
> might have to read this very file to recover prepared transactions from it.
> The relevant function calls here are durable_rename() and RecoverPreparedTransactions() in xlog.c .
> This behaviour has - apparently unintentionally - been fixed in PG 15 and upwards (see commit 811051c ), as part of
a
> general restructure and reorganization of this portion of xlog.c (see commit 6df1543 ).

I think so, the reordering might have done for some other reasons, though.

> Furthermore, it seems this behaviour does not appear in PG 12 and older, due to another possible bug:

<snip>...

> In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead()
> before reading WAL in XlogReadTwoPhaseData() in twophase.c .

I arraived at the same conclusion.

> In the older releases (PG <= 12), this reset is not done, so the requested LSN containing the prepared transaction
can
> (by happy coincidence) be read from in-memory buffers, and PostgreSQL consequently manages to come up just fine (as
the
> WAL has already been read into buffers prior to the .partial rename).
> If the older releases also where to properly reset the XLogReaderState, they would also fail to find the LSN on disk,
and
> hence PostgreSQL would crash again.

From the perspective of loading WAL for prepared transactions, the
current code in those versions seems fine. Although I suspect Windows
may not like to rename currently-open segments, it's likely acceptable
as the current test set operates without issue.. (I didn't tested this.)

> I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 811051c ) and reorder the crucial
events,
> placing the recovery of prepared transactions *before* renaming the file.

It appears to move the correct part of the code to the proper
location, modifying the steps to align with later versions.

> I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be used to verify that promote after
recovery
> with prepared transactions works.

It effectively detects the bug, though it can't be directly used in
the tree as-is. I'm unsure whether we need this in the tree, though.

> My humble opinion is that this fix should be backpatched to PG 14 and PG 13.

I agree with you.

> It's debatable whether the fix needs to be brought back to 12 and older also, as those do not exhibit this issue, but
the
> order of renaming is still wrong.
> I'm not sure if there could be cases where the in-memory buffers of the walreader are too small to cover a whole WAL
> file.
> There could also be other issues from operations that require reading WAL that happen after the .partial rename, I
> haven't checked in depth what else happens in the affected codepath.
> Please let me know if you think this should also be fixed in PG 12 and earlier, so I can produce the patches for
those
> versions as well.

There's no immediate need to change the versions. However, I would
prefer to backpatch them to the older versions for the following
reasons.

1. Applying this eases future backpatching in this area, if any.

2. I have reservations about renaming possibly-open WAL segments.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [BUG] recovery of prepared transactions during promotion can fail

From
Kyotaro Horiguchi
Date:
At Mon, 19 Jun 2023 14:24:44 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Jun 16, 2023 at 04:27:40PM +0200, Julian Markwort wrote:
> > Note that it is important that the PREPARE entry is in the WAL file
> > that PostgreSQL is writing to prior to the inital crash.
> > This has happened repeatedly in production already with a customer
> > that uses prepared transactions quite frequently.  I assume that
> > this has happened for others too, but the circumstances of the crash
> > and the cause are very dubious, and troubleshooting it is pretty
> > difficult.
> 
> I guess that this is a possibility yes.  I have not heard directly
> about such a report, but perhaps that's just because few people use
> 2PC.

+1

> > This behaviour has - apparently unintentionally - been fixed in PG
> > 15 and upwards (see commit 811051c ), as part of a general
> > restructure and reorganization of this portion of xlog.c (see commit
> > 6df1543 ).
> > 
> > Furthermore, it seems this behaviour does not appear in PG 12 and
> > older, due to another possible bug: In PG 13 and newer, the
> > XLogReaderState is reset in XLogBeginRead() before reading WAL in
> > XlogReadTwoPhaseData() in twophase.c .
> > In the older releases (PG <= 12), this reset is not done, so the
> > requested LSN containing the prepared transaction can (by happy
> > coincidence) be read from in-memory buffers, and PostgreSQL
> > consequently manages to come up just fine (as the WAL has already
> > been read into buffers prior to the .partial rename).  If the older
> > releases also where to properly reset the XLogReaderState, they
> > would also fail to find the LSN on disk, and hence PostgreSQL would
> > crash again.
> 
> That's debatable, but I think that I would let v12 and v11 be as they
> are.  v11 is going to be end-of-life soon and we did not have any
> complains on this matter as far as I know, so there is a risk of
> breaking something upon its last release.  (Got some, Err..
> experiences with that in the past).  On REL_11_STABLE, note for
> example the slight difference with the handling of
> recovery_end_command, where we rely on InRecovery rather than
> ArchiveRecoveryRequested.  REL_12_STABLE is in a more consistent shape
> than v11 regarding that.

Agree about 11, it's no use patching. About 12, I slightly prefer
applying this but I'm fine without it since no actual problem are
seen.


> > I've attached patches for PG 14 and PG 13 that mimic the change in
> > PG15 (commit 811051c ) and reorder the crucial events, placing the
> > recovery of prepared transactions *before* renaming the file. 
> 
> Yes, I think that's OK.  I would like to add two things to your
> proposal for all the existing branches.
> - Addition of a comment where RecoverPreparedTransactions() is called
> at the end of recovery to tell that we'd better do that before working
> on the last partial segment of the old timeline.
> - Enforce the use of archiving in 009_twophase.pl.

Both look good to me.

> > My humble opinion is that this fix should be backpatched to PG 14
> > and PG 13.  It's debatable whether the fix needs to be brought back
> > to 12 and older also, as those do not exhibit this issue, but the 
> > order of renaming is still wrong.
> 
> Yeah, I'd rather wait for somebody to complain about that.  And v11 is
> not worth taking risks with at this time of the year, IMHO.

I don't have a complaint as the whole.

> With your fix included, the patch for REL_14_STABLE would be like the
> attached.  Is that OK for you?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [BUG] recovery of prepared transactions during promotion can fail

From
Michael Paquier
Date:
On Mon, Jun 19, 2023 at 02:41:54PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Jun 2023 14:24:44 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> On Fri, Jun 16, 2023 at 04:27:40PM +0200, Julian Markwort wrote:
>>> I've attached patches for PG 14 and PG 13 that mimic the change in
>>> PG15 (commit 811051c ) and reorder the crucial events, placing the
>>> recovery of prepared transactions *before* renaming the file.
>>
>> Yes, I think that's OK.  I would like to add two things to your
>> proposal for all the existing branches.
>> - Addition of a comment where RecoverPreparedTransactions() is called
>> at the end of recovery to tell that we'd better do that before working
>> on the last partial segment of the old timeline.
>> - Enforce the use of archiving in 009_twophase.pl.
>
> Both look good to me.

Okay, cool.  Thanks for double-checking, so let's do something down to
13, then..
--
Michael

Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Michael Paquier
Date:
On Mon, Jun 19, 2023 at 04:27:27PM +0900, Michael Paquier wrote:
> Okay, cool.  Thanks for double-checking, so let's do something down to
> 13, then..

And done for v13 and v14.  I have split the test and comment changes
into their own commit, doing that for v13~HEAD.
--
Michael

Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Nathan Bossart
Date:
On Tue, Jun 20, 2023 at 10:49:03AM +0900, Michael Paquier wrote:
> And done for v13 and v14.  I have split the test and comment changes
> into their own commit, doing that for v13~HEAD.

I've started seen sporadic timeouts for 009_twophase.pl in cfbot, and I'm
wondering if it's related to this change.

    https://api.cirrus-ci.com/v1/task/4978271838797824/logs/test_world.log
    https://api.cirrus-ci.com/v1/task/5477247717474304/logs/test_world.log
    https://api.cirrus-ci.com/v1/task/5931749746671616/logs/test_world.log
    https://api.cirrus-ci.com/v1/task/6353051175354368/logs/test_world.log
    https://api.cirrus-ci.com/v1/task/5687888986243072/logs/test_world.log

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [BUG] recovery of prepared transactions during promotion can fail

From
Michael Paquier
Date:
On Tue, Jun 20, 2023 at 09:33:45PM -0700, Nathan Bossart wrote:
> I've started seen sporadic timeouts for 009_twophase.pl in cfbot, and I'm
> wondering if it's related to this change.
>
>     https://api.cirrus-ci.com/v1/task/4978271838797824/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5477247717474304/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5931749746671616/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/6353051175354368/logs/test_world.log
>     https://api.cirrus-ci.com/v1/task/5687888986243072/logs/test_world.log

Thanks for the poke, missed that.

The logs are enough to know what's happening here.  All the tests
finish after this step:
[02:29:33.169] # Now paris is primary and london is standby
[02:29:33.169] ok 13 - Restore prepared transactions from records with
primary down

Here are some log files:

https://api.cirrus-ci.com/v1/artifact/task/5477247717474304/testrun/build/testrun/recovery/009_twophase/log/009_twophase_london.log

https://api.cirrus-ci.com/v1/artifact/task/5477247717474304/testrun/build/testrun/recovery/009_twophase/log/009_twophase_paris.log

Just after that, we start a previous primary as standby:
# restart old primary as new standby
$cur_standby->enable_streaming($cur_primary);
$cur_standby->start;

And the startup of the node gets stuck as the last partial segment is
now getting renamed, but the other node expects it to be available via
streaming.  From london, which is the new standby starting up:
2023-06-21 02:13:03.421 UTC [24652][walreceiver] LOG:  primary server
contains no more WAL on requested timeline 3
2023-06-21 02:13:03.421 UTC [24652][walreceiver] FATAL:  terminating
walreceiver process due to administrator command
2023-06-21 02:13:03.421 UTC [24647][startup] LOG:  new timeline 4
forked off current database system timeline 3 before current recovery
point 0/60000A0

And paris complains about that:
2023-06-21 02:13:03.515 UTC [24661][walsender] [london][4/0:0] LOG:
received replication command: START_REPLICATION 0/6000000 TIMELINE 3
2023-06-21 02:13:03.515 UTC [24661][walsender] [london][4/0:0]
STATEMENT:  START_REPLICATION 0/6000000 TIMELINE 3

But that won't connect work as the segment requested is now a partial
one in the primary's pg_wal, still the standby wants it.  Just
restoring the segments won't help much as we don't have anything for
partial segments in the TAP routines yet, so I think that it is better
for now to just undo has_archiving in has_archiving, and tackle the
coverage with a separate test, perhaps only for HEAD.
--
Michael

Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Julian Markwort
Date:
First off, thanks for the quick reaction and reviews, I appreciate it.

On Wed, 2023-06-21 at 14:14 +0900, Michael Paquier wrote:
> But that won't connect work as the segment requested is now a partial
> one in the primary's pg_wal, still the standby wants it.

I think since 009_twophase.pl doesn't use archiving so far, it's not a good idea to enable it generally, for all those
tests. It changes too much of the behaviour.

> I think that it is better
> for now to just undo has_archiving in has_archiving, and tackle the
> coverage with a separate test, perhaps only for HEAD.

I see you've already undone it.
Attached is a patch for 009_twophase.pl to just try this corner case at the very end, so as not to influence other
existing tests in suite.

When I run this on REL_14_8 I get the error again, sort of as a sanity check...

Kind regards
Julian

Attachment

Re: [BUG] recovery of prepared transactions during promotion can fail

From
Michael Paquier
Date:
On Wed, Jun 21, 2023 at 11:11:55AM +0200, Julian Markwort wrote:
> I see you've already undone it.
> Attached is a patch for 009_twophase.pl to just try this corner case at the very end, so as not to influence other
> existing tests in suite.
>
> When I run this on REL_14_8 I get the error again, sort of as a sanity check...

+$cur_primary->enable_archiving;

enable_archiving is a routine aimed at being used internally by
Cluster.pm, so this does not sound like a good idea to me.

Relying on a single node to avoid the previous switchover problem is a
much better idea than what I have tried, but wouldn't it be better to
just move the new test to a separate script and parallelize more?  The
runtime of 009_twophase.pl is already quite long.

It is worth noting that I am not going to take any bets with the
buildfarm before 16beta2.  Doing that after REL_16_STABLE is created
will limit the risks.
--
Michael

Attachment