Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 

It happened twice that the walsender process was active for at least
one minute while not advancing restart_lsn and exited. Could you share
the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
can find out what the wal sender processes were doing?

Also, you may have logs for "failed to increase restart lsn: proposed
%X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
please share it as well.

I've tried to reproduce this issue with a similar setting in a loop
but could not.

Given that you have the log "attempting to remove WAL segments older
than log file 000000000001039D0000008A", the last WAL file we need to
retain calculated by KeepLogSeg() was wrong.

KeepLogSeg() retreated the last segment that we need to retain while
considering replication slots' minimum LSN (and wal_keep_segments) but
it seemed not to work fine. The possibilities are (1) we got the wrong
slots' minimum LSN that is newer than a slot's LSN (ie, retreated
logSegNo but it's still newer than required WAL files) or (2) we got
an invalid slots's minimum LSN (ie, didn't retreated at all). If we
know what the RedoRecPtr was, we can calculate _logSegNo that has not
been retreated by KeepLogSeg().

BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:

        if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
            MyReplicationSlot->candidate_restart_valid <= lsn)
        {
            Assert(MyReplicationSlot->candidate_restart_lsn !=
InvalidXLogRecPtr);

            MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;
            MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
            MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
            updated_restart = true;
        }

Is it worth changing the assertion to elog(ERROR) so that we can
detect that an invalid value is set to restart_lsn?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: hubert depesz lubaczewski
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Next
From: Amit Kapila
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.