RE: Time delayed LR (WAS Re: logical replication restrictions) - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Time delayed LR (WAS Re: logical replication restrictions)
Date
Msg-id TYAPR01MB5866360932F60714625192F9F5E09@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Time delayed LR (WAS Re: logical replication restrictions)  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Time delayed LR (WAS Re: logical replication restrictions)
Re: Time delayed LR (WAS Re: logical replication restrictions)
List pgsql-hackers
Dear Horiguchi-san, Amit,

> > On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila
> <amit.kapila16@gmail.com> wrote in
> > Yeah, I think ideally it will timeout but if we have a solution like
> > during delay, we keep sending ping messages time-to-time, it should
> > work fine. However, that needs to be verified. Do you see any reasons
> > why that won't work?

I have implemented and tested that workers wake up per wal_receiver_timeout/2
and send keepalive. Basically it works well, but I found two problems.
Do you have any good suggestions about them?

1)

With this PoC at present, workers calculate sending intervals based on its
wal_receiver_timeout, and it is suppressed when the parameter is set to zero.

This means that there is a possibility that walsender is timeout when wal_sender_timeout
in publisher and wal_receiver_timeout in subscriber is different.
Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is 5min,
and min_apply_delay is 10min. The worker on subscriber will wake up per 2.5min and
send keepalives, but walsender exits before the message arrives to publisher.

One idea to avoid that is to send the min_apply_delay subscriber option to publisher
and compare them, but it may be not sufficient. Because XXX_timout GUC parameters
could be modified later.

2)

The issue reported by Vignesh-san[1] has still remained. I have already analyzed that [2],
the root cause is that flushed WAL is not updated and sent to the publisher. Even
if workers send keepalive messages to pub during the delay, the flushed position
cannot be modified.

[1]: https://www.postgresql.org/message-id/CALDaNm1vT8qNBqHivtAgYur-5-YkwF026VHtw9srd4fsdeaufA%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/TYAPR01MB5866F6BE7399E6343A96E016F51C9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: slab allocator performance issues
Next
From: Amit Kapila
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)