Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Kyotaro Horiguchi |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | 20230213.154549.151232349843665846.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Andres Freund <andres@anarazel.de>) |
Responses |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
|
List | pgsql-bugs |
This seems to have a thin connection to the issue, but. At Wed, 16 Nov 2022 23:25:12 -0800, Andres Freund <andres@anarazel.de> wrote in > I'm a bit confused by the changes the changes c6550776394e made to > KeepLogSeg() etc. > > Before: > /* then check whether slots limit removal further */ > if (max_replication_slots > 0 && keep != InvalidXLogRecPtr) > { > XLogSegNo slotSegNo; > > XLByteToSeg(keep, slotSegNo, wal_segment_size); > > if (slotSegNo <= 0) > segno = 1; > else if (slotSegNo < segno) > segno = slotSegNo; > } > > Current: > keep = XLogGetReplicationSlotMinimumLSN(); > if (keep != InvalidXLogRecPtr) > { > XLByteToSeg(keep, segno, wal_segment_size); > > /* Cap by max_slot_wal_keep_size ... */ > if (max_slot_wal_keep_size_mb >= 0) > { > uint64 slot_keep_segs; > > slot_keep_segs = > ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size); > > if (currSegNo - segno > slot_keep_segs) > segno = currSegNo - slot_keep_segs; > } > } > > Which looks to me to have lost the protection against segno being 0. Which > afaict would break the callers, because they do _logSegNo--. Which would then > allow all WAL to be reused, no? I guess the argument for changing this was > that we should never have such a dangerous value returned by > XLogGetReplicationSlotMinimumLSN()? Mmm. Right. If XLogGetReplicationSlotMinimumLSN() returned an LSN within the zeroth segment, segno ends up with being 0, which is dangerous if happens. However, if it happens, slots are evidently already broken, but there's no cost to have that protection. > It seems decidedly not great to not log at least a debug1 (but probably it > should be LOG) message when KeepLogSeg() decides to limit based on > max_slot_wal_keep_size. It's easy to do that, but that log is highly accompanied by a LOG line "terminating process %d to release replication slot \"%s\"". I don't mind adding it if it is a DEBUGx. > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside, > only actually remove WAL if older than the segment that RedoRecPtr (the > logical start of the checkpoint) is in. If the checkpoint is large we'll end > up removing replication slots even though they potentially would only have > retained one additional WAL segment. I think that it is a controversial part, but that variable is defined the similar way to wal_keep_size. And I think that all max_wal_size, wal_keep_size and max_slot_wal_keep_size being defined with the same base LSN makes things simpler for users (also for developers). Regardless of checkpoint length, if slots get frequently invalidated, the setting should be considered to be too small for the system requirements. > Isn't it problematic to use ConvertToXSegs() to implement > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large > wal_segment_size that'd afaict lead to being much more aggressive invalidating > slots. I think max_slot_wal_keep_size is, like max_wal_size for checkpoints, a safeguard for slots not to fill-up WAL directory. Thus they both are rounded down. If you have 1GB WAL directory and set wal_segment_size to 4192MB, I don't see it a sane setup. But if segment size is smaller than one hundredth of max_wal_size, that difference won't matter for anyone. But anyway, it's a pain in the a.. that the both of them (and wal_keep_size) don't work in a immediate manner, though.. > Also, why do we do something as expensive as > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no > effect? Indeed. Maybe we didn't regard that process as complex at start? I think we can compare the cutoff segno against XLogGetReplicationSlotMinimumLSN() before entering the loop over slots. Thus I think there's room for the following improvements. - Prevent KeepLogSeg from returning 0. - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects. - Check against minimum slot LSN before actually examining through the slots in Invalidateobsoletereplicationslots. I'm not sure about the second item but the others seem back-patchable. If we need to continue further discussion, will need another thread. Anyway I'll come up with the patch for the above three items. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-bugs by date: