Re: walsender timeout on logical replication set - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: walsender timeout on logical replication set
Date
Msg-id CAA4eK1KK4bmTx6A6TGkcaESwzQfxgOvafdLtFfSoCsY6jwWfzg@mail.gmail.com
Whole thread Raw
In response to Re: walsender timeout on logical replication set  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Fri, Sep 17, 2021 at 12:48 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Thank you vary much for coming in!
>
> At Fri, 17 Sep 2021 10:18:11 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Mon, Sep 13, 2021 at 7:01 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > Hello.
> > >
> > > As reported in [1] it seems that walsender can suffer timeout in
> > > certain cases.  It is not clearly confirmed, but I suspect that
> > > there's the case where LogicalRepApplyLoop keeps running the innermost
> > > loop without receiving keepalive packet for longer than
> > > wal_sender_timeout (not wal_receiver_timeout).
> >
> > Why is that happening? In the previous investigation in this area [1]
> > your tests revealed that after reading a WAL page, we always send keep
> > alive, so even if the transaction is large, we should send some
> > keepalive in-between.
>
> We fixed too-many keepalives (aka keepalive-flood) in the thread, but
> this is an issue of long absense of subscriber response. What I'm
> suspecting, or assuming here is:
>
> - The publisher is working fine. It doesn't send extra keepalives so
>   much and does send regular keepalives with wal_sender_timeout/2 by
>   the sender's clock.
>

I think the publisher should also send it even after a certain amount
of WAL is consumed via the below code:
WalSndWaitForWal()
{
...
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
!waiting_for_ping_response)
WalSndKeepalive(false);
...
}

> - Networks conveys all the data in-time.
>
> - The subscriber consumes received data at less than half the speed at
>   which the publisher sends data.  In this case, while the burst
>   traffic is coming, the publisher keep sending for
>   wal_sender_timeout/2 seconds and it may not send a keepalive for the
>   same duration. This is the correct behavior.  On the other hand, the
>   subscriber is kept busy without receiving a keepalive for
>   wal_sender_timeout seconds.  AFAICS LogicalRepApplyLoop doesn't send
>   a response unless a keepalive comes while in the inner-most loop.
>

One way this could happen is that the apply is taking a long time
because of contention on subscriber, say there are a lot of other
operations going on in the subscriber or it stuck due to some reason.

> If wel_sender_timeout is relatively short (5 seconds, in the report),
> a burst (or a gap-less) logical replication traffic can continue
> easily for more than 2.5 seconds.  If wal_sender_timeout is longer (1
> min, ditto), burst replication traffics last for more than
> wal_sender_timeout/2 becomes relatively not so frequent.
>
> However, I'm not sure how it makes things worse again to increase it
> further to 5 min.
>

There might be a possibility that subscriber is stuck or is extremely
slow due to other operations.

> Is my diagnostics that while the innermost loop in LogicalRepAllyLoop
> [A] is busy, it doesn't have a chance to send reply until a keepalive
> comes in correct?  If so, walsender timeout due to slowness of
> subscriber happens and we might need to break the innermost loop to
> give subscriber a chance to send a response with appropriate
> intervals. This is what I wanted to propose.
>

I was thinking increasing wal_sender/receiver_timeout should solve
this problem. I am not sure why it leads to loss of WAL in the OP's
case.

> [A]
> backend/replication/logical/worker.c:2565@today's master
> >       /* Loop to process all available data (without blocking). */
> >       for (;;)
>
>
>
> > The other thing that I am not able to understand from Abhishek's reply
> > [2] is why increasing wal_sender_timeout/wal_recevier_timeout leads to
> > the removal of required WAL segments. As per my understanding, we
> > shouldn't remove WAL unless we get confirmation that the subscriber
> > has processed it.
> >
> > [1] - https://www.postgresql.org/message-id/20210610.150016.1709823354377067679.horikyota.ntt%40gmail.com
> > [2] - https://www.postgresql.org/message-id/CAEDsCzjEHLxgqa4d563CKFwSbgBvvnM91Cqfq_qoZDXCkyOsiw%40mail.gmail.com
> >
> > Note - I have added Abhishek to see if he has answers to any of these questions.
>
> Ouch! max_slot_wal_keep_size was introduced at 13. So I have no idea
> of how required segments can be removed on the publisher for now.
>
> == From the first report [1]
> sourcedb=# select * from pg_replication_slots;
> ...
> restart_lsn                     116D0/C36886F8
> confirmed_flush_lsn     116D0/C3E5D370
>
> targetdb=# show wal_receiver_status_interval;
> wal_receiver_status_interval  2s
>
> targetdb=# select * from pg_stat_subscription;
> ..
> received_lsn              116D1/2BA8F170
> last_msg_send_time        2021-08-20 09:05:15.398423+09
> last_msg_receipt_time 2021-08-20 09:05:15.398471+09
> latest_end_lsn            116D1/2BA8F170
> latest_end_time           2021-08-20 09:05:15.398423+09
> ==
>
> There is a gap with about 105 segments (1.7GB) between how far the
> subscriber advanced and the publisher's idea of how far the subscriber
> advanced. But that alone cannot cause wal removal..
>

Yeah, that is quite strange.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)
Next
From: Amit Kapila
Date:
Subject: Re: Logical replication timeout problem