Hi,
On Thursday, December 15, 2022 12:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Dec 15, 2022 at 7:16 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> >
> > At Wed, 14 Dec 2022 10:46:17 +0000, "Hayato Kuroda (Fujitsu)"
> > <kuroda.hayato@fujitsu.com> wrote in
> > > I have implemented and tested that workers wake up per
> > > wal_receiver_timeout/2 and send keepalive. Basically it works well, but I
> found two problems.
> > > Do you have any good suggestions about them?
> > >
> > > 1)
> > >
> > > With this PoC at present, workers calculate sending intervals based
> > > on its wal_receiver_timeout, and it is suppressed when the parameter is set
> to zero.
> > >
> > > This means that there is a possibility that walsender is timeout
> > > when wal_sender_timeout in publisher and wal_receiver_timeout in
> subscriber is different.
> > > Supposing that wal_sender_timeout is 2min, wal_receiver_tiemout is
> > > 5min,
> >
> > It seems to me wal_receiver_status_interval is better for this use.
> > It's enough for us to docuemnt that "wal_r_s_interval should be
> > shorter than wal_sener_timeout/2 especially when logical replication
> > connection is using min_apply_delay. Otherwise you will suffer
> > repeated termination of walsender".
> >
>
> This sounds reasonable to me.
Okay, I changed the time interval to wal_receiver_status_interval
and added this description about timeout.
FYI, wal_receiver_status_interval by definition specifies
the minimum frequency for the WAL receiver process to send information
to the upstream. So I utilized the value for WaitLatch directly.
My descriptions of the documentation change follow it.
> > > and min_apply_delay is 10min. The worker on subscriber will wake up
> > > per 2.5min and send keepalives, but walsender exits before the message
> arrives to publisher.
> > >
> > > One idea to avoid that is to send the min_apply_delay subscriber
> > > option to publisher and compare them, but it may be not sufficient.
> > > Because XXX_timout GUC parameters could be modified later.
> >
> > # Anyway, I don't think such asymmetric setup is preferable.
> >
> > > 2)
> > >
> > > The issue reported by Vignesh-san[1] has still remained. I have
> > > already analyzed that [2], the root cause is that flushed WAL is not
> > > updated and sent to the publisher. Even if workers send keepalive
> > > messages to pub during the delay, the flushed position cannot be modified.
> >
> > I didn't look closer but the cause I guess is walsender doesn't die
> > until all WAL has been sent, while logical delay chokes replication
> > stream.
For the (2) issue, a new thread has been created independently from this thread in [1].
I'll leave any new changes to the thread on this point.
Attached the updated patch.
Again, I used one basic patch in another thread to wake up logical replication worker
shared in [2] for TAP tests.
[1] -
https://www.postgresql.org/message-id/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/message-id/flat/20221122004119.GA132961%40nathanxps13
Best Regards,
Takamichi Osumi