On Wed, Mar 04, 2026 at 10:00:00AM +0200, Alexander Lakhin wrote:
> Yes, 012_subtransactions doesn't fail with aggressive bgwriter, as I noted
> before. I mentioned it exactly to show that stop does matter here. But if
> we recognize teardown_node in this context as risky, maybe it would make
> sense to review also other tests in recovery/. I already wrote about
> 004_timeline_switch, but probably there are more. E.g., 028_pitr_timelines
> (I haven't tested it intensively yet) does:
> $node_primary->stop('immediate');
>
> # Promote the standby, and switch WAL so that it archives a WAL segment
> # that contains all the INSERTs, on a new timeline.
> $node_standby->promote;
I think that your take about 004 is actually right, looking at it more
closely. By tearing down the primary, it could be possible that
standby_2 receives more records than standby_1. Then, when we try to
reconnect standby_2 to the promoted standby_1, the TLI could fork, in
theory. The fix would be the same: by switching to stop(), we'd make
sure that both standby_1 and standby_2 have received all the records
from the primary. We can also remove the wait_for_catchup() before
the primary is stopped, this offers no protection for standby_2
receiving more records from the primary than standby_1.
It is not surprising that this failure with a three-node scenario is
much harder to reproduce. I have run the same loop as 009 but things
are super stable even after 50-ish iteractions. By reading the code,
I agree that the failure is possible to reach in theory, though. Some
hardcoded sleeps would do the trick (make bgwriter aggressive, patch
the checkpointer so as we do not send the last standby snapshot
records to standby_2, only to standby_1, etc.).
Did you find any buildfarm failures involving 028? I cannot get
excited in changing tests where nothing has happened, and this test
looks OK as we don't do a switchover. For 004, we have at least one
failure recorded based on what you said. That's a fact sufficient for
me to fix things, for 004.
--
Michael