Re: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Exit walsender before confirming remote flush in logical replication |
Date | |
Msg-id | CAA4eK1L1TKS2tpTpC+CRH_NFDjyMLkaxsEdgTSMSOcqYcZUY2A@mail.gmail.com Whole thread Raw |
In response to | Re: Exit walsender before confirming remote flush in logical replication (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Exit walsender before confirming remote flush in logical replication
Re: Exit walsender before confirming remote flush in logical replication |
List | pgsql-hackers |
On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Let me try to summarize the discussion till now. The problem we are > > > trying to solve here is to allow a shutdown to complete when walsender > > > is not able to send the entire WAL. Currently, in such cases, the > > > shutdown fails. As per our current understanding, this can happen when > > > (a) walreceiver/walapply process is stuck (not able to receive more > > > WAL) due to locks or some other reason; (b) a long time delay has been > > > configured to apply the WAL (we don't yet have such a feature for > > > logical replication but the discussion for same is in progress). > > > > > > Both reasons mostly apply to logical replication because there is no > > > separate walreceiver process whose job is to just flush the WAL. In > > > logical replication, the process that receives the WAL also applies > > > it. So, while applying it can stuck for a long time waiting for some > > > heavy-weight lock to be released by some other long-running > > > transaction by the backend. > > > ... ... > > +1 to eliminate condition (b) for logical replication. > > Regarding (a), as Amit mentioned before[1], I think we should check if > pq_is_send_pending() is false. > Sorry, but your suggestion is not completely clear to me. Do you mean to say that for logical replication, we shouldn't wait for all the WAL to be successfully replicated but we should ensure to inform the subscriber that XLOG streaming is done (by ensuring pq_is_send_pending() is false and by calling EndCommand, pq_flush())? > Otherwise, we will end up terminating > the WAL stream without the done message. Which will lead to an error > message "ERROR: could not receive data from WAL stream: server closed > the connection unexpectedly" on the subscriber even at a clean > shutdown. > But will that be a problem? As per docs of shutdown [1] ( “Smart” mode disallows new connections, then waits for all existing clients to disconnect. If the server is in hot standby, recovery and streaming replication will be terminated once all clients have disconnected.), there is no such guarantee. I see that it is required for the switchover in physical replication to ensure that all the WAL is sent and replicated but we don't need that for logical replication. > In a case where pq_is_send_pending() doesn't become false > for a long time, (e.g., the network socket buffer got full due to the > apply worker waiting on a lock), I think users should unblock it by > themselves. Or it might be practically better to shutdown the > subscriber first in the logical replication case, unlike the physical > replication case. > Yeah, will users like such a dependency? And what will they gain by doing so? [1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html -- With Regards, Amit Kapila.
pgsql-hackers by date: