Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | 20230213194131.hgzs6ropcvhda5w3@awork3.anarazel.de Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-bugs |
Hi, On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote: > This seems to have a thin connection to the issue, but. I was worried that the changes could lead us to removing WAL without max_slot_wal_keep_size set. > > It seems decidedly not great to not log at least a debug1 (but probably it > > should be LOG) message when KeepLogSeg() decides to limit based on > > max_slot_wal_keep_size. > > It's easy to do that, but that log is highly accompanied by a LOG line > "terminating process %d to release replication slot \"%s\"". I don't > mind adding it if it is a DEBUGx. My problem with that is that we might *NOT* see those log messages for some reason, but that that's impossible to debug as-is. And even if we see them, it's not that easy to figure out by how much we were over max_slot_wal_keep_size, because we always report it in the context of a specific slot. Removing WAL that's still needed is a *very* severe operation. Emitting an additional line in case it happens isn't a problem. > > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end > > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside, > > only actually remove WAL if older than the segment that RedoRecPtr (the > > logical start of the checkpoint) is in. If the checkpoint is large we'll end > > up removing replication slots even though they potentially would only have > > retained one additional WAL segment. > > I think that it is a controversial part, but that variable is defined > the similar way to wal_keep_size. And I think that all max_wal_size, > wal_keep_size and max_slot_wal_keep_size being defined with the same > base LSN makes things simpler for users (also for developers). > Regardless of checkpoint length, if slots get frequently invalidated, > the setting should be considered to be too small for the system > requirements. I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that way. I don't think bringing max_wal_size into this is useful, as it influences different things. > > Isn't it problematic to use ConvertToXSegs() to implement > > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large > > wal_segment_size that'd afaict lead to being much more aggressive invalidating > > slots. > > I think max_slot_wal_keep_size is, like max_wal_size for checkpoints, > a safeguard for slots not to fill-up WAL directory. Thus they both are > rounded down. If you have 1GB WAL directory and set wal_segment_size > to 4192MB, I don't see it a sane setup. But if segment size is smaller > than one hundredth of max_wal_size, that difference won't matter for > anyone. But anyway, it's a pain in the a.. that the both of them (and > wal_keep_size) don't work in a immediate manner, though.. It doesn't matter a lot for 16MB segments, but with 1GB segments it's a different story. To me the way it's done now is a bug, one that can in extreme circumstances lead to data loss. > > Also, why do we do something as expensive as > > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no > > effect? > > Indeed. Maybe we didn't regard that process as complex at start? I > think we can compare the cutoff segno against > XLogGetReplicationSlotMinimumLSN() before entering the loop over > slots. That'd be better, but I'd probably go further, and really gate it on max_slot_wal_keep_size having had an effect. > Thus I think there's room for the following improvements. > > - Prevent KeepLogSeg from returning 0. > > - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects. > > - Check against minimum slot LSN before actually examining through the > slots in Invalidateobsoletereplicationslots. > > I'm not sure about the second item but the others seem back-patchable. > > If we need to continue further discussion, will need another > thread. Anyway I'll come up with the patch for the above three items. Yep, probably a good idea to start another thread. There's also https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de that unfortunately nobody replied to. Greetings, Andres Freund
pgsql-bugs by date: