Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

From Andres Freund
Subject Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date
Msg-id 20221117072512.q3fayfyd5ugfwvmf@awork3.anarazel.de
Whole thread Raw
In response to Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
List pgsql-bugs
Hi,

On 2022-11-11 15:50:40 +0100, hubert depesz lubaczewski wrote:
> Out of 8 servers that we prepared upgrade for, one failed with the
> same/similar problem.

Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
your wal_segment_size? And wal_keep_size?



I'm a bit confused by the changes the changes c6550776394e made to
KeepLogSeg() etc.

Before:
    /* then check whether slots limit removal further */
    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
    {
        XLogSegNo    slotSegNo;

        XLByteToSeg(keep, slotSegNo, wal_segment_size);

        if (slotSegNo <= 0)
            segno = 1;
        else if (slotSegNo < segno)
            segno = slotSegNo;
    }

Current:
    keep = XLogGetReplicationSlotMinimumLSN();
    if (keep != InvalidXLogRecPtr)
    {
        XLByteToSeg(keep, segno, wal_segment_size);

        /* Cap by max_slot_wal_keep_size ... */
        if (max_slot_wal_keep_size_mb >= 0)
        {
            uint64        slot_keep_segs;

            slot_keep_segs =
                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

            if (currSegNo - segno > slot_keep_segs)
                segno = currSegNo - slot_keep_segs;
        }
    }

Which looks to me to have lost the protection against segno being 0. Which
afaict would break the callers, because they do _logSegNo--. Which would then
allow all WAL to be reused, no?  I guess the argument for changing this was
that we should never have such a dangerous value returned by
XLogGetReplicationSlotMinimumLSN()?


It seems decidedly not great to not log at least a debug1 (but probably it
should be LOG) message when KeepLogSeg() decides to limit based on
max_slot_wal_keep_size.


It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
only actually remove WAL if older than the segment that RedoRecPtr (the
logical start of the checkpoint) is in. If the checkpoint is large we'll end
up removing replication slots even though they potentially would only have
retained one additional WAL segment.

Isn't it problematic to use ConvertToXSegs() to implement
max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
wal_segment_size that'd afaict lead to being much more aggressive invalidating
slots.

Also, why do we do something as expensive as
InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
effect?

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Next
From: Andres Freund
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.