Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers

From vignesh C
Subject Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Date
Msg-id CALDaNm3cBTVX7fd+3bfQ5QZmt5=nQ5zE_kDcFEpDaWwhoxOH+w@mail.gmail.com
Whole thread Raw
In response to Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List pgsql-hackers
On Mon, 23 Jun 2025 at 04:36, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Fri, Jun 20, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:
> > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > > Dear Kuroda-san,
> > >
> > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
> > > <kuroda.hayato@fujitsu.com> wrote:
> > > > > > Regarding assertion failure, I've found that assert in
> > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > > > > previously set by ReplicationSlotReserveWal().  As I can see,
> > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn.
> > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes
> > > > > > backward.  The commit in ReplicationSlotReserveWal() even states there
> > > > > > is a "chance that we have to retry".
> > > > > >
> > > > >
> > > > > I don't see how this theory can lead to a restart_lsn of a slot going
> > > > > backwards. The retry mentioned there is just a retry to reserve the
> > > > > slot's position again if the required WAL is already removed. Such a
> > > > > retry can only get the position later than the previous restart_lsn.
> > > >
> > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal,
> > > > and confirmed that restart_lsn can go backward. This meant that Assert() added
> > > > by the ca307d5 can cause crash.
> > > >
> > > > Background
> > > > ===========
> > > > When pg_receivewal starts the replication and it uses the replication slot, it
> > > > sets as the beginning of the segment where restart_lsn exists, as the startpoint.
> > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs
> > > > from 0/B00000.
> > > > More detail of this behavior, see f61e1dd2 and d9bae531.
> > > >
> > > > What happened here
> > > > ==================
> > > > Based on above theory, walsender sent from the beginning of segment (0/B00000).
> > > > When walreceiver receives, it tried to send reply. At that time the flushed
> > > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn
> > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000).
> > > >
> > > > The assertion failure could happen if CHECKPOINT happened at that time.
> > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn
> > > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot().
> > >
> > > Thank you for your detailed explanation!
> > >
> > > > Note
> > > > ====
> > > > 1.
> > > > In this case, starting from the beginning of the segment is not a problem, because
> > > > the checkpoint process only removes WAL files from segments that precede the
> > > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed,
> > > > so there is no risk of data loss or inconsistency.
> > > >
> > > > 2.
> > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the
> > > > requested streaming position to the start of the segment, and it replies the
> > > > received LSN as flushed.
> > > >
> > > > 3.
> > > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync
> > > > because it is a timing issue. Have someone else reproduced?
> > > >
> > > > We are still investigating failure caused at 040_standby_failover_slots_sync.
> > >
> > > I didn't manage to reproduce this.  But as I see from the logs [1] on
> > > mamba that START_REPLICATION command was issued just before assert
> > > trap.  Could it be something similar to what I described in [2].
> > > Namely:
> > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
> > > 2. Concurrent checkpoint flushes that restart_lsn to the disk.
> > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
> > > the beginning of the segment.
> >
> > Here is my analysis for the 040_standby_failover_slots_sync test
> > failure where the physical replication slot can point to backward lsn:
> > In certain rare cases the restart LSN can go backwards. This scenario
> > can be reproduced by performing checkpoint continuously and slowing
> > the WAL applying. I have a patch with changes to handle this.
> > Here's a summary of the sequence of events:
> > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents:
> > After the standby writes the received WAL contents in XLogWalRcvWrite,
> > the standby sends this lsn 0/40369C8 as the confirmed flush location.
> > The primary acknowledges this and updates the replication slot's
> > restart_lsn accordingly:
> > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG:
> > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
> > restart_lsn to 0/40369C8
> > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT:
> > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
> >
> > 2) Shortly after receiving the new LSN, a checkpoint occurs which
> > saves this restart_lsn:
> > 2025-06-20 14:33:21.780 IST [134913] LOG:  checkpoint complete: wrote
> > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0
> > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync
> > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0
> > kB; lsn=0/4036A20, redo lsn=0/40369C8
> >
> > 3) Streaming replication is restarted because of primary_conninfo guc
> > change and reload
> > The WAL receiver process is restarted:
> > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver
> > process due to administrator command
> >
> > 4) Standby sends an older flush pointer after restart:
> > Upon restart, the WAL receiver sends a flush location (0/401D448)
> > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older
> > than the previously confirmed restart_lsn. It is important to note
> > that we are sending the lastReplayedEndRecPtr which is the last
> > successfully replayed lsn in this case:
> > 5-06-20 14:33:21.796 IST [135135] LOG:  WalReceiverMain
> > LogstreamResult.Flush initialized to 0/401D448
> > 2025-06-20 14:33:21.796 IST [135135] LOG:  sending write 0/401D448
> > flush 0/401D448 apply 0/401D448
> >
> > This is done from here:
> > ....
> > /* Initialize LogstreamResult and buffers for processing messages */
> > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
> >
> > initStringInfo(&reply_message);
> >
> > /* Initialize nap wakeup times. */
> > now = GetCurrentTimestamp();
> > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> > WalRcvComputeNextWakeup(i, now);
> >
> > /* Send initial reply/feedback messages. */
> > XLogWalRcvSendReply(true, false);
> > ...
> >
> > In case of step 1, we are sending the lsn of the WAL that is written,
> > since we have slowed down the WAL applying the replay location is
> > lesser.
> >
> > 5) I have added logs to detect this inconsistency:
> > This leads to a scenario where the standby tries to confirm a
> > restart_lsn older than the one currently held by the primary:
> > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG:  crash scenario -
> > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older
> > than the current one (0/40369C8)
> >
> > If a checkpoint happens concurrently during this condition, it may
> > trigger an assertion failure on the primary due to the restart_lsn
> > being less than the last_saved_restart_lsn.
> > Currently this does not break physical replication, but I'm not sure
> > if the gap increases to many WAL files and if the WAL files get
> > deleted, how it will behave.
> >
> > Attached the restart_lsn_backup_repro_v1 patch changes with which you
> > can reproduce the 040_standby_failover_slots_sync failure. grep for
> > "crash scenario - slot sb1_slot" in the logs. For me it occurs with
> > every run. The reproduced logs
> > 040_standby_failover_slots_sync_publisher and
> > 040_standby_failover_slots_sync_standby1.log are attached.
> >
> > This proves that the restart_lsn can go backward in cases where the
> > standby is slowly applying. But this has nothing to do with this
> > thread, I felt you can commit the assert removal patch. I will
> > continue the analysis further and see if there is any impact or not on
> > physical replication, and we can later add comments accordingly.
>
> I think this thread now have well-documented cases when restart_lsn
> goes backward.  I suppose the first thing we should do is to remove
> the assert to make buildfarm stop failing on this.  I've prepared a
> patch with revised commit message.  Then we can focus on documenting
> that correctly in the code comments.

+1 to proceed this way, one minor suggestion we can include
pg_basebackup also in the commit message as I noticed this can happen
with pg_basebackup too.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: BackendKeyData is mandatory?
Next
From: Tomas Vondra
Date:
Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view