Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CAPpHfdteS1w_ieKO9mChBonQ5xspwtVv69uxht0N6JmG81NsLg@mail.gmail.com
Whole thread Raw
In response to Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
Hi, Vignesh!

On Fri, Jun 20, 2025 at 3:42 PM vignesh C <vignesh21@gmail.com> wrote:
> On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > > > > Regarding assertion failure, I've found that assert in
> > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > > > previously set by ReplicationSlotReserveWal().  As I can see,
> > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > > > > backward.  The commit in ReplicationSlotReserveWal() even states there
> > > > > is a "chance that we have to retry".
> > > > >
> > > >
> > > > I don't see how this theory can lead to a restart_lsn of a slot going
> > > > backwards. The retry mentioned there is just a retry to reserve the
> > > > slot's position again if the required WAL is already removed. Such a
> > > > retry can only get the position later than the previous restart_lsn.
> > >
> > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
> > > and confirmed that restart_lsn can go backward. This meant that Assert() added
> > > by the ca307d5 can cause crash.
> > >
> > > Background
> > > ===========
> > > When pg_receivewal starts the replication and it uses the replication slot, it
> > > sets as the beginning of the segment where restart_lsn exists, as the startpoint.
> > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
> > > from 0/B00000.
> > > More detail of this behavior, see f61e1dd2 and d9bae531.
> > >
> > > What happened here
> > > ==================
> > > Based on above theory, walsender sent from the beginning of segment (0/B00000).
> > > When walreceiver receives, it tried to send reply. At that time the flushed
> > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
> > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).
> > >
> > > The assertion failure could happen if CHECKPOINT happened at that time.
> > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
> > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().
> >
> > Thank you for your detailed explanation!
> >
> > > Note
> > > ====
> > > 1.
> > > In this case, starting from the beginning of the segment is not a problem, because
> > > the checkpoint process only removes WAL files from segments that precede the
> > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
> > > so there is no risk of data loss or inconsistency.
> > >
> > > 2.
> > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the
> > > requested streaming position to the start of the segment, and it replies the
> > > received LSN as flushed.
> > >
> > > 3.
> > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
> > > because it is a timing issue. Have someone else reproduced?
> > >
> > > We are still investigating failure caused at 040_standby_failover_slots_sync.
> >
> > I didn't manage to reproduce this.  But as I see from the logs [1] on
> > mamba that START_REPLICATION command was issued just before assert
> > trap.  Could it be something similar to what I described in [2].
> > Namely:
> > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
> > 2. Concurrent checkpoint flushes that restart_lsn to the disk.
> > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
> > the beginning of the segment.
>
> Here is my analysis for the 040_standby_failover_slots_sync test
> failure where in physical replication slot can point to backward lsn:
> In certain rare cases the restart LSN can go backwards. This scenario
> can be reproduced by performing checkpoint continuously and slowing
> the WAL applying. I have a patch with changes to handle this.
>
> Here's a summary of the sequence of events:
> 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents:
> After standby writes the received WAL contents in XLogWalRcvWrite, the
> standby sends this lsn 0/40369C8 as the confirmed flush location. The
> primary acknowledges this and updates the replication slot's
> restart_lsn accordingly:
> 2025-06-20 14:33:21.777 IST [134998] standby1 LOG:
> PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
> restart_lsn to 0/40369C8
> 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT:
> START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
> Checkpoint persists the new restart_lsn:
>
> 2) Shortly after receiving the new LSN, a checkpoint occurs which
> saves this restart_lsn:
> 2025-06-20 14:33:21.780 IST [134913] LOG:  checkpoint complete: wrote
> 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0
> removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync
> files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0
> kB; lsn=0/4036A20, redo lsn=0/40369C8
>
> 3)Streaming replication is restarted because of primary_conninfo
> change and reload
> The WAL receiver process is restarted:
> 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver
> process due to administrator command
>
> 4) Standby sends an older flush pointer after restart:
> Upon restart, the WAL receiver sends a flush location (0/401D448)
> derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older
> than the previously confirmed restart_lsn. It is important to note
> that we are sending the lastReplayedEndRecPtr which is the last
> successfully replayed lsn in this case:
> 5-06-20 14:33:21.796 IST [135135] LOG:  WalReceiverMain
> LogstreamResult.Flush initialized to 0/401D448
> 2025-06-20 14:33:21.796 IST [135135] LOG:  sending write 0/401D448
> flush 0/401D448 apply 0/401D448
>
> This is done from here:
> ....
> /* Initialize LogstreamResult and buffers for processing messages */
> LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
>
> initStringInfo(&reply_message);
>
> /* Initialize nap wakeup times. */
> now = GetCurrentTimestamp();
> for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> WalRcvComputeNextWakeup(i, now);
>
> /* Send initial reply/feedback messages. */
> XLogWalRcvSendReply(true, false);
> ...
>
> In case of step 1, we are sending the lsn of the WAL that is written,
> since we have slowed down the WAL applying the replay location is
> lesser and the replay location is being sent here.
>
> 5) I have added logs to detect this inconsistency:
> This leads to a scenario where the standby tries to confirm a
> restart_lsn older than the one currently held by the primary:
> 2025-06-20 14:33:21.797 IST [135136] standby1 LOG:  crash scenario -
> slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older
> than the current one (0/40369C8)
>
> If a checkpoint happens concurrently during this condition, it may
> trigger an assertion failure on the primary due to the restart_lsn
> being less than the last_saved_restart_lsn.
> Currently this does not break physical replication, but I'm not sure
> if the gap increases to many WAL files and if the WAL files get
> deleted, how it will behave.
> Attached the patch changes with which you can reproduce. Grep for
> "crash scenario" in the logs. For me it occurs with every run. The
> reproduced logs are attached.
>
> This proves that the restart_lsn can go backward in cases where the
> standby is slowly applying. But this has nothing to do with this
> thread, I felt you can commit the assert removal patch. I will
> continue the analysis further and see if there is any impact or not
> and we can later add comments accordingly.

Thank you and other thread participants for the analysis!
I'd like to also ask for help with [1].  It seems that new test
triggers another bug, which I suppose was there before.

Links.
1. https://www.postgresql.org/message-id/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Extended Statistics set/restore/clear functions.
Next
From: shihao zhong
Date:
Subject: Re: problems with toast.* reloptions