Re: BUG: Former primary node might stuck when started as a standby - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: BUG: Former primary node might stuck when started as a standby
Date
Msg-id aafDsb5snkfkNfdS@paquier.xyz
Whole thread Raw
In response to RE: BUG: Former primary node might stuck when started as a standby  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: BUG: Former primary node might stuck when started as a standby
Re: BUG: Former primary node might stuck when started as a standby
List pgsql-hackers
On Tue, Mar 03, 2026 at 09:17:16AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Thanks for the info. So I can provide the patch after the issue for 009_twophase.pl
> is fixed. For better understanding we may be able to fork new
> thread.

Regarding your posted v4, I am actually not convinced that there is a
need for injection points and disabling standby snapshots, for the
three sequences of tests proposed.

While the first wait_for_replay_catchup() can be useful before the
teardown_node() of the primary in the "Check that prepared
transactions can be committed on promoted standby" sequence, it still
has a limited impact.  It looks like we could have other parasite
records as well, depending on how slowly the primary is stopped?  I
think that we should switch to a plain stop() of the primary, the test
wants to check that prepared transactions can be committed on a
standby.  Stopping the primary abruptly does not matter for this
sequence.

For the second wait_for_replay_catchup(), after the PREPARE of
xact_009_11.  I may be missing something but in how does it change
things?  A plain stop() of the primary means that it would have
received all the WAL records from the primary on disk in its pg_wal,
no?  Upon restart, it should replay everything it finds in pg_wal/.  I
don't see a change required here.

For the third wait_for_replay_catchup(), after the PREPARE of
xact_009_12, same dance.  The primary is cleanly stopped first.  All
the WAL records of the primary should have been flushed to the
standby.

As a whole, it looks like we should just switch the teardown() call to
a stop() call in the first test with xact_009_10, backpatch it, and
call it a day.  No need for injection points and no need for GUC
tweaks.  I have not looked at 004_timeline_switch yet.

> I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
> the walsender waits till there are nothing to be sent, see WalSndLoop().
> Do let me know if you have observed the similar failure here.

Exactly.  Doing a clean stop of the primary offers a strong guarantee
here.  We are sure that the standby will have received all the records
from the primary.  Timeline forking is an impossible thing in
012_subtransactions.pl based on how the switchover from the primary to
the standby happens.  I don't see a need for tweaking this test at
all.  Or perhaps you did see a failure of some kind in this test,
Alexander?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Improve checks for GUC recovery_target_xid
Next
From: Xuneng Zhou
Date:
Subject: Re: Refactor recovery conflict signaling a little