Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | CAA4eK1JmV90sqy9si9jWEovHruvA5EP8FbnjNTQFOmHmy0Z3vw@mail.gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-bugs |
On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > It happened twice that the walsender process was active for at least > one minute while not advancing restart_lsn and exited. Could you share > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > can find out what the wal sender processes were doing? > > Also, you may have logs for "failed to increase restart lsn: proposed > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > please share it as well. > > I've tried to reproduce this issue with a similar setting in a loop > but could not. > > Given that you have the log "attempting to remove WAL segments older > than log file 000000000001039D0000008A", the last WAL file we need to > retain calculated by KeepLogSeg() was wrong. > > KeepLogSeg() retreated the last segment that we need to retain while > considering replication slots' minimum LSN (and wal_keep_segments) but > it seemed not to work fine. The possibilities are (1) we got the wrong > slots' minimum LSN that is newer than a slot's LSN (ie, retreated > logSegNo but it's still newer than required WAL files) or (2) we got > an invalid slots's minimum LSN (ie, didn't retreated at all). If we > know what the RedoRecPtr was, we can calculate _logSegNo that has not > been retreated by KeepLogSeg(). > The other possibility is that the slot is not considered for minimum LSN computation (say in_use flag is temporarily set to false). > BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows: > > if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && > MyReplicationSlot->candidate_restart_valid <= lsn) > { > Assert(MyReplicationSlot->candidate_restart_lsn != > InvalidXLogRecPtr); > > MyReplicationSlot->data.restart_lsn = > MyReplicationSlot->candidate_restart_lsn; > MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; > MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; > updated_restart = true; > } > > Is it worth changing the assertion to elog(ERROR) so that we can > detect that an invalid value is set to restart_lsn? > This is worth considering but note that the caller of this has an assertion for invalid values being assigned to candidate_restart_* variables. Also, wouldn't be shown as invalid for restart_lsn in pg_replication_slots? -- With Regards, Amit Kapila.
pgsql-bugs by date: