Home > mailing lists

RE: Time delayed LR (WAS Re: logical replication restrictions) - Mailing list pgsql-hackers

From	Hayato Kuroda (Fujitsu)
Subject	RE: Time delayed LR (WAS Re: logical replication restrictions)
Date	December 9, 2022 08:19:37
Msg-id	TYAPR01MB5866F6BE7399E6343A96E016F51C9@TYAPR01MB5866.jpnprd01.prod.outlook.com Whole thread Raw
In response to	Re: Time delayed LR (WAS Re: logical replication restrictions) (vignesh C <vignesh21@gmail.com>)
Responses	Re: Time delayed LR (WAS Re: logical replication restrictions) (Amit Kapila <amit.kapila16@gmail.com>)
List	pgsql-hackers

Tree view

Hi Vignesh,

> In the case of physical replication by setting
> recovery_min_apply_delay, I noticed that both primary and standby
> nodes were getting stopped successfully immediately after the stop
> server command. In case of logical replication, stop server fails:
> pg_ctl -D publisher -l publisher.log stop -c
> waiting for server to shut
> down...............................................................
> failed
> pg_ctl: server does not shut down
> 
> In case of logical replication, the server does not get stopped
> because the walsender process is not able to exit:
> ps ux | grep walsender
> vignesh  1950789 75.3  0.0 8695216 22284 ?       Rs   11:51   1:08
> postgres: walsender vignesh [local] START_REPLICATION

Thanks for reporting the issue. I analyzed about it.


This issue has occurred because the apply worker cannot reply during the delay.
I think we may have to modify the mechanism that delays applying transactions.

When walsender processes are requested to shut down, it can shut down only after
that all the sent WALs are replicated on the subscriber. This check is done in
WalSndDone(), and the replicated position will be updated when processes handle
the reply messages from a subscriber, in ProcessStandbyReplyMessage().

In the case of physical replication, the walreciever can receive WALs and reply
even if the application is delayed. It means that the replicated position will
be transported to the publisher side immediately. So the walsender can exit.

In terms of logical replication, however, the worker cannot reply to the
walsender while delaying the transaction with this patch at present. It causes
the replicated position to be never transported upstream and the walsender cannot
exit.


Based on the above analysis, we can conclude that the worker must update the
flushpos and reply to the walsender while delaying the transaction if we want
to solve the issue. This cannot be done in the current approach, and a newer
proposed one[1] may be able to solve this, although it's currently under discussion.


Note that a similar issue can reproduce while doing the physical replication.
When the wal_sender_timeout is set to 0 and the network between primary and
secondary is broken after that primary sends WALs to secondary, we cannot stop
the primary node.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

pgsql-hackers by date:

From: Michael Paquier
Date: 09 December 2022, 08:19:00
Subject: Re: [PATCH] Backport perl tests for pg_upgrade from 322becb60

From: Amit Kapila
Date: 09 December 2022, 09:05:02
Subject: Re: Perform streaming logical transactions by background workers and parallel apply

RE: Time delayed LR (WAS Re: logical replication restrictions) - Mailing list pgsql-hackers

Previous

Next