Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | CAD21AoBVhYnGBuW_o=wEGgTp01qiHNAx1a14b1X9kFXmuBe=sg@mail.gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
List | pgsql-bugs |
On Tue, Feb 7, 2023 at 12:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Feb-06, Masahiko Sawada wrote: > > > I've also confirmed that this issue is fixed by the attached patch, > > which clears candidate_restart_lsn and friends during > > ReplicationSlotRelease(). > > Hmm, interesting -- I was studying some other bug recently involving the > xmin of a slot that had been invalidated and I remember wondering if > these "candidate" fields were being properly ignored when the slot is > marked not in use; but I didn't check. Are you sure that resetting them > when the slot is released is the appropriate thing to do? I mean, > shouldn't they be kept set while the slot is in use, and only reset if > we destroy it? Thinking about the root cause more, it seems to me that the root cause is not the fact that candidate_xxx values are not cleared when being released. In the scenario I reproduced, after restarting the logical decoding, the walsender sets the restart_lsn to a candidate_restart_lsn left in the slot upon receiving the ack from the subscriber. Then, when decoding a RUNNING_XACTS record, the walsender updates the restart_lsn, which actually retreats it. The patch, which clears the candidate_xxx values at releasing, fixes this issue by not updating the restart_lsn upon receiving the ack in this case, but I think it's legitimate that the walsender sets the restart_lsn to the candidate_restart_lsn left in the slot in this case. The root cause here seems to me that it allows the restart_lsn to retreat. In LogicalIncreaseRestartDecodingForSlot(), we have the following codes: /* don't overwrite if have a newer restart lsn */ if (restart_lsn <= slot->data.restart_lsn) { } /* * We might have already flushed far enough to directly accept this lsn, * in this case there is no need to check for existing candidate LSNs */ else if (current_lsn <= slot->data.confirmed_flush) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; /* our candidate can directly be used */ updated_lsn = true; } /* * Only increase if the previous values have been applied, otherwise we * might never end up updating if the receiver acks too slowly. A missed * value here will just cause some extra effort after reconnecting. */ if (slot->candidate_restart_valid == InvalidXLogRecPtr) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; SpinLockRelease(&slot->mutex); } If slot->candidate_restart_valid is InvalidXLogRecPtr, he candidate_restart_lsn could be set to a LSN lower than the slot's restart_lsn in spite of the first comment saying "don't overwrite if have a newer restart lsn". So I think the right fix here is to not allow the restart_lsn to retreat by changing "if" to "else if" of the third block. > > (But, actually, in that case, maybe we don't need to reset them: instead > we need to make sure to ignore the values for slots that are not in_use. > However, I don't know what the semantics are for these "candidate" > fields, so maybe this is wrong.) IIUC in that case, the WAL segments are removed while the slot is in_use. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: