Re: Startup PANIC on standby promotion due to zero-filled WAL segment - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Startup PANIC on standby promotion due to zero-filled WAL segment
Date
Msg-id aUplOdaM4kGYd4t3@paquier.xyz
Whole thread Raw
In response to Re: Startup PANIC on standby promotion due to zero-filled WAL segment  (Alena Vinter <dlaaren8@gmail.com>)
Responses Re: Startup PANIC on standby promotion due to zero-filled WAL segment
List pgsql-hackers
On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:
> Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
> replicated to the standby before the timeline switch. The test then
> intentionally slows down replication on TLI 2 (e.g., by delaying WAL
> shipping), reproducing the scenario I illustrated. As far as I’m aware,
> `fsync` is `on` by default, and the test does not modify it — so no WAL
> records are lost due to unsafe flushing.

Don't think so, based on what is in the tree:
$ git grep "fsync = " -- *.pm
src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";

> The core issue is that the new timeline’s segment is zero-initialized
> instead of copying the same segment from the previous timeline (as done in
> crash-recovery startup).  As a result, startup cannot finish recovery due
> to non-replicated end of WAL causing failures like “invalid magic number”.

The following addition to your proposed test is telling me an entirely
 different story, making the test pass as the records of TLI 1 are
 around:
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);
+#$node_primary->append_conf('postgresql.conf', 'fsync=on');
 $node_primary->start;
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Alena Vinter
Date:
Subject: Re: Startup PANIC on standby promotion due to zero-filled WAL segment
Next
From: shveta malik
Date:
Subject: Re: Proposal: Conflict log history table for Logical Replication