RE: Time delayed LR (WAS Re: logical replication restrictions) - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Time delayed LR (WAS Re: logical replication restrictions)
Date
Msg-id TYAPR01MB5866F6BE7399E6343A96E016F51C9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Time delayed LR (WAS Re: logical replication restrictions)  (vignesh C <vignesh21@gmail.com>)
Responses Re: Time delayed LR (WAS Re: logical replication restrictions)  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi Vignesh,

> In the case of physical replication by setting
> recovery_min_apply_delay, I noticed that both primary and standby
> nodes were getting stopped successfully immediately after the stop
> server command. In case of logical replication, stop server fails:
> pg_ctl -D publisher -l publisher.log stop -c
> waiting for server to shut
> down...............................................................
> failed
> pg_ctl: server does not shut down
> 
> In case of logical replication, the server does not get stopped
> because the walsender process is not able to exit:
> ps ux | grep walsender
> vignesh  1950789 75.3  0.0 8695216 22284 ?       Rs   11:51   1:08
> postgres: walsender vignesh [local] START_REPLICATION

Thanks for reporting the issue. I analyzed about it.


This issue has occurred because the apply worker cannot reply during the delay.
I think we may have to modify the mechanism that delays applying transactions.

When walsender processes are requested to shut down, it can shut down only after
that all the sent WALs are replicated on the subscriber. This check is done in
WalSndDone(), and the replicated position will be updated when processes handle
the reply messages from a subscriber, in ProcessStandbyReplyMessage().

In the case of physical replication, the walreciever can receive WALs and reply
even if the application is delayed. It means that the replicated position will
be transported to the publisher side immediately. So the walsender can exit.

In terms of logical replication, however, the worker cannot reply to the
walsender while delaying the transaction with this patch at present. It causes
the replicated position to be never transported upstream and the walsender cannot
exit.


Based on the above analysis, we can conclude that the worker must update the
flushpos and reply to the walsender while delaying the transaction if we want
to solve the issue. This cannot be done in the current approach, and a newer
proposed one[1] may be able to solve this, although it's currently under discussion.


Note that a similar issue can reproduce while doing the physical replication.
When the wal_sender_timeout is set to 0 and the network between primary and
secondary is broken after that primary sends WALs to secondary, we cannot stop
the primary node.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] Backport perl tests for pg_upgrade from 322becb60
Next
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply