Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | CAD21AoDHj2RTEjVdqNTVPKyXqjjyUXpyA+=o8UMFEN-vTU848g@mail.gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (hubert depesz lubaczewski <depesz@depesz.com>) |
Responses |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it. Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
List | pgsql-bugs |
On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 It happened twice that the walsender process was active for at least one minute while not advancing restart_lsn and exited. Could you share the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we can find out what the wal sender processes were doing? Also, you may have logs for "failed to increase restart lsn: proposed %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, please share it as well. I've tried to reproduce this issue with a similar setting in a loop but could not. Given that you have the log "attempting to remove WAL segments older than log file 000000000001039D0000008A", the last WAL file we need to retain calculated by KeepLogSeg() was wrong. KeepLogSeg() retreated the last segment that we need to retain while considering replication slots' minimum LSN (and wal_keep_segments) but it seemed not to work fine. The possibilities are (1) we got the wrong slots' minimum LSN that is newer than a slot's LSN (ie, retreated logSegNo but it's still newer than required WAL files) or (2) we got an invalid slots's minimum LSN (ie, didn't retreated at all). If we know what the RedoRecPtr was, we can calculate _logSegNo that has not been retreated by KeepLogSeg(). BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows: if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && MyReplicationSlot->candidate_restart_valid <= lsn) { Assert(MyReplicationSlot->candidate_restart_lsn != InvalidXLogRecPtr); MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn; MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; updated_restart = true; } Is it worth changing the assertion to elog(ERROR) so that we can detect that an invalid value is set to restart_lsn? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: