Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Exit walsender before confirming remote flush in logical replication
Date
Msg-id CAA4eK1L1TKS2tpTpC+CRH_NFDjyMLkaxsEdgTSMSOcqYcZUY2A@mail.gmail.com
Whole thread Raw
In response to Re: Exit walsender before confirming remote flush in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Exit walsender before confirming remote flush in logical replication
Re: Exit walsender before confirming remote flush in logical replication
List pgsql-hackers
On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Let me try to summarize the discussion till now. The problem we are
> > > trying to solve here is to allow a shutdown to complete when walsender
> > > is not able to send the entire WAL. Currently, in such cases, the
> > > shutdown fails. As per our current understanding, this can happen when
> > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > configured to apply the WAL (we don't yet have such a feature for
> > > logical replication but the discussion for same is in progress).
> > >
> > > Both reasons mostly apply to logical replication because there is no
> > > separate walreceiver process whose job is to just flush the WAL. In
> > > logical replication, the process that receives the WAL also applies
> > > it. So, while applying it can stuck for a long time waiting for some
> > > heavy-weight lock to be released by some other long-running
> > > transaction by the backend.
> > >
...
...
>
> +1 to eliminate condition (b) for logical replication.
>
> Regarding (a), as Amit mentioned before[1], I think we should check if
> pq_is_send_pending() is false.
>

Sorry, but your suggestion is not completely clear to me. Do you mean
to say that for logical replication, we shouldn't wait for all the WAL
to be successfully replicated but we should ensure to inform the
subscriber that XLOG streaming is done (by ensuring
pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

> Otherwise, we will end up terminating
> the WAL stream without the done message. Which will lead to an error
> message "ERROR:  could not receive data from WAL stream: server closed
> the connection unexpectedly" on the subscriber even at a clean
> shutdown.
>

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

> In a case where pq_is_send_pending() doesn't become false
> for a long time, (e.g., the network socket buffer got full due to the
> apply worker waiting on a lock), I think users should unblock it by
> themselves. Or it might be practically better to shutdown the
> subscriber first in the logical replication case, unlike the physical
> replication case.
>

Yeah, will users like such a dependency? And what will they gain by doing so?


[1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Nikolay Shaplov
Date:
Subject: Re: [PATCH] New [relation] option engine
Next
From: Thomas Munro
Date:
Subject: Re: transition tables and UPDATE