Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date
Msg-id CAD21AoCMv9mixfO=g5XsjLo_MiiVPAyp8=N0i_+oLNAuX2+JBw@mail.gmail.com
Whole thread Raw
In response to Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
List pgsql-bugs
On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0 
> > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> >
> > It happened twice that the walsender process was active for at least
> > one minute while not advancing restart_lsn and exited. Could you share
> > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > can find out what the wal sender processes were doing?
>
> That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> lot of logs to check if I don't leak any sensitive info.
>
> But, if I'll remove lines that are generated by walsender to pg12
> replicas (streaming replication to bionic/pg12 replicas), the logfile is
> only ~ 3MB. Still a bit too much, as there can be some queries with
> sensitive info.
>
> Perhaps I can grep something out?
>
> > Also, you may have logs for "failed to increase restart lsn: proposed
> > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > please share it as well.
>
> Found one in the time frame you mentioned:
> 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
>

Thanks!

LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
slot's restart_lsn to this LSN, we remove WAL files older than
000000000001039D0000008A.

In LogicalIncreaseRestartDecodingForSlot(), since
"current_lsn(1039D/9170B010) <
slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
part and called LogicalConfirmReceivedLocation():

    else if (current_lsn <= slot->data.confirmed_flush)
    {
        slot->candidate_restart_valid = current_lsn;
        slot->candidate_restart_lsn = restart_lsn;

        /* our candidate can directly be used */
        updated_lsn = true;
    }

Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:

       if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
           MyReplicationSlot->candidate_restart_valid <= lsn)
       {
           Assert(MyReplicationSlot->candidate_restart_lsn !=
InvalidXLogRecPtr);

           MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;
           MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
           MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
           updated_restart = true;
       }

So you should have logs for "updated xmin: %u restart: %u", could you check it?

After the above codes, we do:

        SpinLockRelease(&MyReplicationSlot->mutex);

        /* first write new xmin to disk, so we know what's up after a crash */
        if (updated_xmin || updated_restart)
        {
            ReplicationSlotMarkDirty();
            ReplicationSlotSave();
            elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
updated_restart);
        }

Is something like the following scenario possible to happen?

1. wal sender updates slot's restart_lsn and releases the spin lock
(not saved in the disk yet)
2. someone updates slots' minimum restart_lsn (note that slot's
restart_lsn in memory is already updated).
3. checkpointer removes WAL files older than the minimum restart_lsn
calculated at step 2.
4. wal sender restarts for some reason (or server crashed).
5. after the restarts, wal sender acquires the slot but the slot's
restart_lsn is old.
6. WAL files required for restart the decoding are already removed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: hubert depesz lubaczewski
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Next
From: Masahiko Sawada
Date:
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.