Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly |
Date | |
Msg-id | CALDaNm3cBTVX7fd+3bfQ5QZmt5=nQ5zE_kDcFEpDaWwhoxOH+w@mail.gmail.com Whole thread Raw |
In response to | Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly ("Vitaly Davydov" <v.davydov@postgrespro.ru>) |
List | pgsql-hackers |
On Mon, 23 Jun 2025 at 04:36, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Fri, Jun 20, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > Dear Kuroda-san, > > > > > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Regarding assertion failure, I've found that assert in > > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > > > is a "chance that we have to retry". > > > > > > > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > > > backwards. The retry mentioned there is just a retry to reserve the > > > > > slot's position again if the required WAL is already removed. Such a > > > > > retry can only get the position later than the previous restart_lsn. > > > > > > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > > > > and confirmed that restart_lsn can go backward. This meant that Assert() added > > > > by the ca307d5 can cause crash. > > > > > > > > Background > > > > =========== > > > > When pg_receivewal starts the replication and it uses the replication slot, it > > > > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > > > > from 0/B00000. > > > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > > > > > What happened here > > > > ================== > > > > Based on above theory, walsender sent from the beginning of segment (0/B00000). > > > > When walreceiver receives, it tried to send reply. At that time the flushed > > > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > > > > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > > > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). > > > > > > Thank you for your detailed explanation! > > > > > > > Note > > > > ==== > > > > 1. > > > > In this case, starting from the beginning of the segment is not a problem, because > > > > the checkpoint process only removes WAL files from segments that precede the > > > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > > > > so there is no risk of data loss or inconsistency. > > > > > > > > 2. > > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > > > > requested streaming position to the start of the segment, and it replies the > > > > received LSN as flushed. > > > > > > > > 3. > > > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > > > > because it is a timing issue. Have someone else reproduced? > > > > > > > > We are still investigating failure caused at 040_standby_failover_slots_sync. > > > > > > I didn't manage to reproduce this. But as I see from the logs [1] on > > > mamba that START_REPLICATION command was issued just before assert > > > trap. Could it be something similar to what I described in [2]. > > > Namely: > > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > > > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > > > the beginning of the segment. > > > > Here is my analysis for the 040_standby_failover_slots_sync test > > failure where the physical replication slot can point to backward lsn: > > In certain rare cases the restart LSN can go backwards. This scenario > > can be reproduced by performing checkpoint continuously and slowing > > the WAL applying. I have a patch with changes to handle this. > > Here's a summary of the sequence of events: > > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents: > > After the standby writes the received WAL contents in XLogWalRcvWrite, > > the standby sends this lsn 0/40369C8 as the confirmed flush location. > > The primary acknowledges this and updates the replication slot's > > restart_lsn accordingly: > > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: > > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set > > restart_lsn to 0/40369C8 > > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: > > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1 > > > > 2) Shortly after receiving the new LSN, a checkpoint occurs which > > saves this restart_lsn: > > 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote > > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 > > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync > > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 > > kB; lsn=0/4036A20, redo lsn=0/40369C8 > > > > 3) Streaming replication is restarted because of primary_conninfo guc > > change and reload > > The WAL receiver process is restarted: > > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver > > process due to administrator command > > > > 4) Standby sends an older flush pointer after restart: > > Upon restart, the WAL receiver sends a flush location (0/401D448) > > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older > > than the previously confirmed restart_lsn. It is important to note > > that we are sending the lastReplayedEndRecPtr which is the last > > successfully replayed lsn in this case: > > 5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain > > LogstreamResult.Flush initialized to 0/401D448 > > 2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448 > > flush 0/401D448 apply 0/401D448 > > > > This is done from here: > > .... > > /* Initialize LogstreamResult and buffers for processing messages */ > > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); > > > > initStringInfo(&reply_message); > > > > /* Initialize nap wakeup times. */ > > now = GetCurrentTimestamp(); > > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > > WalRcvComputeNextWakeup(i, now); > > > > /* Send initial reply/feedback messages. */ > > XLogWalRcvSendReply(true, false); > > ... > > > > In case of step 1, we are sending the lsn of the WAL that is written, > > since we have slowed down the WAL applying the replay location is > > lesser. > > > > 5) I have added logs to detect this inconsistency: > > This leads to a scenario where the standby tries to confirm a > > restart_lsn older than the one currently held by the primary: > > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario - > > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older > > than the current one (0/40369C8) > > > > If a checkpoint happens concurrently during this condition, it may > > trigger an assertion failure on the primary due to the restart_lsn > > being less than the last_saved_restart_lsn. > > Currently this does not break physical replication, but I'm not sure > > if the gap increases to many WAL files and if the WAL files get > > deleted, how it will behave. > > > > Attached the restart_lsn_backup_repro_v1 patch changes with which you > > can reproduce the 040_standby_failover_slots_sync failure. grep for > > "crash scenario - slot sb1_slot" in the logs. For me it occurs with > > every run. The reproduced logs > > 040_standby_failover_slots_sync_publisher and > > 040_standby_failover_slots_sync_standby1.log are attached. > > > > This proves that the restart_lsn can go backward in cases where the > > standby is slowly applying. But this has nothing to do with this > > thread, I felt you can commit the assert removal patch. I will > > continue the analysis further and see if there is any impact or not on > > physical replication, and we can later add comments accordingly. > > I think this thread now have well-documented cases when restart_lsn > goes backward. I suppose the first thing we should do is to remove > the assert to make buildfarm stop failing on this. I've prepared a > patch with revised commit message. Then we can focus on documenting > that correctly in the code comments. +1 to proceed this way, one minor suggestion we can include pg_basebackup also in the commit message as I noticed this can happen with pg_basebackup too. Regards, Vignesh
pgsql-hackers by date: