Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject Exit walsender before confirming remote flush in logical replication
Date
Msg-id TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
Responses Re: Exit walsender before confirming remote flush in logical replication  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Re: Exit walsender before confirming remote flush in logical replication  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1].
While discussing, we thought that we could extend the condition of walsender shutdown[2][3].

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4], which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

PSA the minimal patch for that. I'm not sure whether WalSndCaughtUp should be
also omitted or not. It seems that changes may affect other parts like
WalSndWaitForWal(), but we can investigate more about it.

[1]: https://commitfest.postgresql.org/41/3581/
[2]:
https://www.postgresql.org/message-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg%40mail.gmail.com
[4]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Rework of collation code, extensibility
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)