At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> I found that KeepLogSeg() has a piece of code that is not correctly.
>
> segno may be larger than currSegNo, since the slot_keep_segs variable is of
> type "uint64", in this case the code "if (currSegNo - segno >
> slot_keep_segs)" is incorrect.
>
> "if (currSegNo - segno < keep_segs)" is also the same.
>
> Checkpoint calls the KeepLogSeg function, and there are many operations
> between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
> pg_control file, so segno may be larger than currSegNo.
Correct. Thanks for the report.
If checkpointer somehow takes a long time between inserting a
checkpoint record and removing WAL files, while replication advances a
certain distnace, it can actually happen. Although that behavior
doesn't directly affect max_slot_wal_keep_size, it does disrupt the
effect of wal_keep_size.
The thinko was that we incorrectly assumed the slot minimum LSN can't
be larger than the checkpoint record LSN. We don't need to consider
max_slot_wal_keep_size if the slot minimum LSN is already larger than
currSegNo.
The attached fix works. However, I can't come up with a reasonable
testing script.
This dates back to 13, where max_slot_wal_keep_size was introduced.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center