Re: Startup PANIC on standby promotion due to zero-filled WAL segment - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Startup PANIC on standby promotion due to zero-filled WAL segment
Date
Msg-id aUpU9ThApQND8zVy@paquier.xyz
Whole thread Raw
In response to Startup PANIC on standby promotion due to zero-filled WAL segment  (Alena Vinter <dlaaren8@gmail.com>)
Responses Re: Startup PANIC on standby promotion due to zero-filled WAL segment
List pgsql-hackers
On Tue, Dec 23, 2025 at 02:02:15PM +0700, Alena Vinter wrote:
> If a standby is promoted before the WAL segment containing the last record
> of the previous timeline has been fully copied to the new timeline, startup
> may fail. We have observed this in production, where recovery aborts with
> "PANIC: invalid magic number 0000 in WAL segment ..."
>
> I’ve attached:
> * a patch and a TAP test that reproduce the issue;
> * a draft patch that, on timeline switch during recovery, copies the
> remainder of the current WAL segment from the old timeline — matching the
> behavior used after crash recovery at startup.
> All existing regression tests pass with the patch applied, but I plan to
> add more targeted test cases.
>
> I’d appreciate your feedback. In particular:
> * Is this behavior (not copying the segment during replication) intentional?
> * Are there edge cases I might be overlooking?

The failure pattern is different in v18/master vs the rest of the
world.  v17 and older branches just wait for the standby node to start
at the end with your test.  Anyway, the problem is the same as far as
I can see, with the test generating the following post-patch:
2025-12-23 17:08:37.494 JST startup[32689] LOG:  unexpected pageaddr
0/0305E000 in WAL segment 000000020000000000000003, LSN 0/03060000,
offset 393216
2025-12-23 17:08:37.494 JST startup[32689] FATAL:  according to
history file, WAL location 0/0305FFD0 belongs to timeline 1, but
previous recovered WAL file came from timeline 2

This would be right, because you are losing the records of the first
INSERT and TLI 1 diverges on the primary.  Now, the reason why you are
losing these records is because of the way the test is set up.  fsync
is off on the primary, hence you are forcing what looks like a
corruption scenario by forcing a node to be promoted with some of its
WAL records missing.  I am unconvinced with the problem the way you
are showing it.  This primarily shows that setting fsync=off is a bad
idea to force a divergence in timelines, with the segment missing
while the records should be there.

Perhaps it is a matter of proving your point in a cleaner way?  I am
open to your potential arguments, but I don't see something here based
on the test you are sending; I am just seeing something that should
not be done.

I am not asking how you are able to see these failures in your
Postgres setups, but perhaps there is something in your HA flow that
you should not do, especially if you do the same things as in this
test..  Just saying.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: DOCS - "\d mytable" also shows any publications that publish mytable
Next
From: Fujii Masao
Date:
Subject: Re: Fix wrong reference in pg_overexplain's doc