Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uAa6HnPKCO8pM7xFfEmzXtXDmgRvX5ErJX8e8-mWyvXYQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/28/23 4:13 AM, shveta malik wrote: > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > >> <houzj.fnst@fujitsu.com> wrote: > >>> > >>> Here is the updated version(v39_2) which include all the changes made in 0002. > >>> Please use for review, and sorry for the confusion. > >>> > >> > >> --- a/src/backend/replication/logical/launcher.c > >> +++ b/src/backend/replication/logical/launcher.c > >> @@ -8,20 +8,27 @@ > >> * src/backend/replication/logical/launcher.c > >> * > >> * NOTES > >> - * This module contains the logical replication worker launcher which > >> - * uses the background worker infrastructure to start the logical > >> - * replication workers for every enabled subscription. > >> + * This module contains the replication worker launcher which > >> + * uses the background worker infrastructure to: > >> + * a) start the logical replication workers for every enabled subscription > >> + * when not in standby_mode. > >> + * b) start the slot sync worker for logical failover slots synchronization > >> + * from the primary server when in standby_mode. > >> > >> I was wondering do we really need a launcher on standby to invoke > >> sync-slot worker. If so, why? I guess it may be required for previous > >> versions where we were managing work for multiple slot-sync workers > >> which is also questionable in the sense of whether launcher is the > >> right candidate for the same but now with the single slot-sync worker, > >> it doesn't seem worth having it. What do you think? > >> > >> -- > > > > Yes, earlier a manager process was needed to manage multiple slot-sync > > workers and distribute load among them, but now that does not seem > > necessary. I gave it a try (PoC) and it seems to work well. If there > > are no objections to this approach, I can share the patch soon. > > > > +1 on this new approach, thanks! PFA v40. This patch has removed Logical Replication Launcher support to launch slotsync worker. The slot-sync worker is now registered as bgworker with postmaster, with bgw_start_time=BgWorkerStart_ConsistentState and bgw_restart_time=60sec. On removal of launcher, now all the validity checks have been shifted to slot-sync worker itself. This brings us to some point of concerns: a) We still need to maintain RecoveryInProgress() check in slotsync worker. Since worker has the start time of BgWorkerStart_ConsistentState, it will be started on non-standby as well. So to ensure that it exists on non-standby, "RecoveryInProgress" has been introduced at the beginning of the worker. But once it exits, postmaster will not restart it since it will be clean-exist i.e. proc_exit(0) (the restart logic of postmaster comes into play only when there is an abnormal exit). But to exit for the first time on non-standby, we need that Recovery related check in worker. b) "enable_syncslot" check is moved to slotsync worker now. Since enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to exit the worker if 'enable_syncslot' is found to be disabled. 'proc_exit(1)' has been used in order to ensure that the worker is restarted and GUCs are checked again after restart_time. Downside of this approach is, if someone has kept "enable_syncslot" as disabled permanently even on standby, slotsync worker will keep on restarting and exiting. So to overcome the above pain-points, I think a potential approach will be to start slotsync worker only if 'enable_syncslot' is on and the system is non-standby. Potential ways (each with some issues) are: 1) Use the current way i.e. register slot-sync worker as bgworker with postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But this seems more like a hack. This will need extra changes as currently once 'maybe_start_bgworkers' is attempted by postmaster, it will attempt again to start any worker only if the worker had abnormal exit and restart_time !=0. The current postmatser will not attempt to start worker on any GUC change. 2) Another way maybe to treat slotsync worker as special case and separate out the start/restart of slotsync worker from bgworker, and follow what we do for autovacuum launcher(StartAutoVacLauncher) to keep starting it in the postmaster loop(ServerLoop). In this way, we may be able to add more checks before starting worker. But by opting this approach, we will have to manage slotsync worker completely by ourself as it will be no longer be part of existing bgworker-registration infra. If this seems okay and there are no other better options, it can be analyzed further in detail. 3) Another approach could be, in order to solve issue (a), introduce a new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means start a bgworker only if consistent state is reached and the system is standby. And for issue (b), lets retain check of enable_syncslot in the worker itself but make it 'PGC_POSTMASTER'. This will ensure we can safely exit the worker(proc_exit(0) if enable_syncslot is disabled and postmaster will not restart it. But I'm not sure if making it "PGC_POSTMASTER" is acceptable from the user's perspective. thanks Shveta
Attachment
pgsql-hackers by date: