RE: Newly created replication slot may be invalidated by checkpoint - Mailing list pgsql-hackers
From | Vitaly Davydov |
---|---|
Subject | RE: Newly created replication slot may be invalidated by checkpoint |
Date | |
Msg-id | 36d1f4-68e3e480-1f-4cb02b00@259576124 Whole thread Raw |
In response to | RE: Newly created replication slot may be invalidated by checkpoint ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
List | pgsql-hackers |
Dear Hayato, All On Friday, October 03, 2025 14:14 MSK, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: >> I'm working on the issue. Give me, please, a couple of days to finalize my work. > Oh, sorry. I was rude. It is okay. I very appreciate your help. > Per my understanding, this happened because there is a lag that restart_lsn of > the slot is set, and it is protected by the system. Your idea is to ensure the > restart_lsn is protected by the system before obtaining on-memory LSN, right? Not sure what you mean by on-memory LSN, but, the issue happens because we have a lag between restart_lsn assignment and update of XLogCtl->replicationSlotsMinLSN which is used to protect the WAL. Yes, I propose to ensure that the protection happens when we assign restart_lsn. It seems to be wrong that we invalidate slots by its restart_lsn but protect the wal for slots using XLogCtl->replicationSlotsMinLSN. Below I tried to write some summary and propose the patch which fixes the problem. The issue was originally reported at [1] and it seems to appear in 17 and earlier versions. The issue is not reproducible in 18+ versions. The issue may appear when we create a persistent slot during checkpoint. The WAL reservation in slots happens in ReplicationSlotReserveWal and executed in three steps: 1. Assignment of slot->data.restart_lsn 2. Update of XLogCtl->replicationSlotMinLSN 3. Check if WAL segments at restart_lsn are removed, go to step 1 if removed. When the checkpointer calculates the oldest lsn which is used as the lsn horizon when removing old WAL segments, it takes XLogCtl->replicationSlotMinLSN. There is a race condition may happen when slot's restart_lsn is already assigned but XLogCtl->replicationSlotMinLSN is not updated yet. Consider the following scenario with two processes executing in parallel (checkpointer and backend, where a new slot is creating): 1. Assign of slot.data->restart_lsn in the backend from GetRedoRecPtr() 2. Assign a new redo LSN in the checkpointer 3. Assign slotsMinReqLSN from XLogCtl->replicationSlotMinLSN in the checkpointer 4. Update of XLogCtl->replicationSlotMinLSN in the backend. 5. Calculation of the WAL horizon for old segments cleanup (KeepLogSeg before the call of InvalidateObsoleteReplicationSlots) in the checkpointer. 6. Exit from ReplicationSlotReserveWal in the backend, once the reserved WAL segments are not removed at this moment (XLogGetLastRemovedSegno() < segno). 7. Call of InvalidateObsoleteReplicationSlots in the checkpointer will invalidate the creating slot because its restart_lsn will be less than the calculated WAL horizon (the min of slotsMinReqLSN and RedoRecPtr). To fix the issue I propose to consider the following assumptions: 1. Slots do not cross WAL segment borders backward when moving. 2. Old WAL segments are removed in the checkpointer only. 3. The following LSNs are initially assigned during slot reservation: - GetRedoRecPtr() for physical slots - GetXLogInsertRecPtr() for logical slots - GetXLogReplayRecPtr() for logical slots in recovery Taking into account these assumptions, I would like to propose the fix [2]. There is an idea to think that the WAL reservation happens when we assign restart_lsn to the slot. The call of ReplicationSlotsComputeRequiredLSN() is not required to be executed immediately in the backend where the slot is creating concurrently. In the checkpointer we have to guarantee that we do WAL horizon calculations based on actual values of restart_lsn of existing slots. If we call ReplicationSlotsComputeRequiredLSN() in the checkpointer after a new REDO assignment and before the calculation of WAL horizon, the value of XLogCtl->replicationSlotMinLSN will correctly define the oldest LSN for existing slots. If the WAL reservation by a new slot happens during checkpoint before a new REDO assignment, it is guaranteed that its restart_lsn will be accounted when we call ReplicationSlotsComputeRequiredLSN() in the checkpointer. If the WAL reservation happens after a new redo LSN assignment, the slot's restart_lsn will be protected by this new redo LSN, because this LSN will be lesser or equal to initial restart_lsn (see assumption 3). There is one subtle thing. Once, the operation of restart_lsn assignment is not an atomic, the following scenario may happen theoretically: 1. Read GetRedoRecPtr() in the backend (ReplicationSlotReserveWal) 2. Assign a new redo LSN in the checkpointer 3. Call ReplicationSlotsComputeRequiredLSN() in the checkpointer 3. Assign the old redo LSN to restart_lsn In this scenario, the restart_lsn will point to a previous redo LSN and it will be not protected by the new redo LSN. This scenario is unlikely, but it can happen theoretically. I have no ideas how to deal with it, except of assigning restart_lsn under XLogCtl->info_lck lock to avoid concurrent modification of XLogCtl->RecoRecPtr until it is assigned to restart_lsn of a creating slot. In case of recovery, when GetXLogReplayRecPtr() is used, the protection by redo LSN seems to work as well, because a new redo LSN is taken from the latest replayed checkpoint. Thus, it is guaranteed that GetXLogReplayRecPtr() will not be less than the new redo LSN, if it is called right after assignment of redo LSN in CreateRestartPoint(). I also think that the cycle in ReplicationSlotReserveWal which checks for the current restart_lsn to be greater than the XLogGetLastRemovedSegno() is not necessary because it is guaranteed that the assigned restart_lsn will be protected. Lets keep it unchanged until this suggestion will be clarified. The proposed solution doesn't break the fix in ca307d5cec (unexpected removal of old WAL segments after checkpoint). Once we call ReplicationSlotsComputeRequiredLSN() before CheckPointReplicationSlots(), the saved to disk restart_lsn values of existing slots will be not less than the previously computed XLogCtl->replicationSlotMinLSN. They just may be advanced to greater values concurrently. For new slots with restart_lsn assignment after ReplicationSlotsComputeRequiredLSN(), the current redo LSN will protect the WAL. The fix for REL_17_STABLE is in [2]. The regression test is in [3]. I apologize for so long summary. [1] https://www.postgresql.org/message-id/flat/15922-68ca9280-4f-37de2c40%40245457797#4b7aa7fe7c57b02105a56ecff06f0b67 [2] v3-0002-Fix-invalidation-when-slot-is-created-during-checkpo.patch [3] v3-0001-Newly-created-replication-slot-may-be-invalidated-by.patch With best regards, Vitaly
Attachment
pgsql-hackers by date: