Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | CAD21AoCMv9mixfO=g5XsjLo_MiiVPAyp8=N0i_+oLNAuX2+JBw@mail.gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (hubert depesz lubaczewski <depesz@depesz.com>) |
Responses |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
List | pgsql-bugs |
On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > It happened twice that the walsender process was active for at least > > one minute while not advancing restart_lsn and exited. Could you share > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > can find out what the wal sender processes were doing? > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > lot of logs to check if I don't leak any sensitive info. > > But, if I'll remove lines that are generated by walsender to pg12 > replicas (streaming replication to bionic/pg12 replicas), the logfile is > only ~ 3MB. Still a bit too much, as there can be some queries with > sensitive info. > > Perhaps I can grep something out? > > > Also, you may have logs for "failed to increase restart lsn: proposed > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > please share it as well. > > Found one in the time frame you mentioned: > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > Thanks! LSN 1039D/8B5773D8 seems to be related to this issue. If we advance slot's restart_lsn to this LSN, we remove WAL files older than 000000000001039D0000008A. In LogicalIncreaseRestartDecodingForSlot(), since "current_lsn(1039D/9170B010) < slot->data.confirmed_flush(1039D/91F41B50)", we executed the following part and called LogicalConfirmReceivedLocation(): else if (current_lsn <= slot->data.confirmed_flush) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; /* our candidate can directly be used */ updated_lsn = true; } Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code: if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && MyReplicationSlot->candidate_restart_valid <= lsn) { Assert(MyReplicationSlot->candidate_restart_lsn != InvalidXLogRecPtr); MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn; MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; updated_restart = true; } So you should have logs for "updated xmin: %u restart: %u", could you check it? After the above codes, we do: SpinLockRelease(&MyReplicationSlot->mutex); /* first write new xmin to disk, so we know what's up after a crash */ if (updated_xmin || updated_restart) { ReplicationSlotMarkDirty(); ReplicationSlotSave(); elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin, updated_restart); } Is something like the following scenario possible to happen? 1. wal sender updates slot's restart_lsn and releases the spin lock (not saved in the disk yet) 2. someone updates slots' minimum restart_lsn (note that slot's restart_lsn in memory is already updated). 3. checkpointer removes WAL files older than the minimum restart_lsn calculated at step 2. 4. wal sender restarts for some reason (or server crashed). 5. after the restarts, wal sender acquires the slot but the slot's restart_lsn is old. 6. WAL files required for restart the decoding are already removed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: