RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Synchronizing slots from primary to standby |
Date | |
Msg-id | OS0PR01MB57163425CEE6CEC16C7D6FA59483A@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
RE: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Tuesday, November 28, 2023 11:58 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/28/23 10:40 AM, shveta malik wrote: > > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 11/28/23 4:13 AM, shveta malik wrote: > >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > >>>> > >>>> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > >>>> <houzj.fnst@fujitsu.com> wrote: > >>>>> > >>>>> Here is the updated version(v39_2) which include all the changes made > in 0002. > >>>>> Please use for review, and sorry for the confusion. > >>>>> > >>>> > >>>> --- a/src/backend/replication/logical/launcher.c > >>>> +++ b/src/backend/replication/logical/launcher.c > >>>> @@ -8,20 +8,27 @@ > >>>> * src/backend/replication/logical/launcher.c > >>>> * > >>>> * NOTES > >>>> - * This module contains the logical replication worker launcher which > >>>> - * uses the background worker infrastructure to start the logical > >>>> - * replication workers for every enabled subscription. > >>>> + * This module contains the replication worker launcher which > >>>> + * uses the background worker infrastructure to: > >>>> + * a) start the logical replication workers for every enabled > subscription > >>>> + * when not in standby_mode. > >>>> + * b) start the slot sync worker for logical failover slots > synchronization > >>>> + * from the primary server when in standby_mode. > >>>> > >>>> I was wondering do we really need a launcher on standby to invoke > >>>> sync-slot worker. If so, why? I guess it may be required for > >>>> previous versions where we were managing work for multiple > >>>> slot-sync workers which is also questionable in the sense of > >>>> whether launcher is the right candidate for the same but now with > >>>> the single slot-sync worker, it doesn't seem worth having it. What do you > think? > >>>> > >>>> -- > >>> > >>> Yes, earlier a manager process was needed to manage multiple > >>> slot-sync workers and distribute load among them, but now that does > >>> not seem necessary. I gave it a try (PoC) and it seems to work well. > >>> If there are no objections to this approach, I can share the patch soon. > >>> > >> > >> +1 on this new approach, thanks! > > > > PFA v40. This patch has removed Logical Replication Launcher support > > to launch slotsync worker. > > Thanks! > > > The slot-sync worker is now registered as bgworker with postmaster, > > with bgw_start_time=BgWorkerStart_ConsistentState and > > bgw_restart_time=60sec. > > > > On removal of launcher, now all the validity checks have been shifted > > to slot-sync worker itself. This brings us to some point of concerns: > > > > a) We still need to maintain RecoveryInProgress() check in slotsync > > worker. Since worker has the start time of > > BgWorkerStart_ConsistentState, it will be started on non-standby as > > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > > has been introduced at the beginning of the worker. But once it exits, > > postmaster will not restart it since it will be clean-exist i.e. > > proc_exit(0) (the restart logic of postmaster comes into play only > > when there is an abnormal exit). But to exit for the first time on > > non-standby, we need that Recovery related check in worker. > > > > b) "enable_syncslot" check is moved to slotsync worker now. Since > > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > > exit the worker if 'enable_syncslot' is found to be disabled. > > 'proc_exit(1)' has been used in order to ensure that the worker is > > restarted and GUCs are checked again after restart_time. Downside of > > this approach is, if someone has kept "enable_syncslot" as disabled > > permanently even on standby, slotsync worker will keep on restarting > > and exiting. > > > > So to overcome the above pain-points, I think a potential approach > > will be to start slotsync worker only if 'enable_syncslot' is on and > > the system is non-standby. > > That makes sense to me. > > > Potential ways (each with some issues) are: > > > > 1) Use the current way i.e. register slot-sync worker as bgworker with > > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > > this seems more like a hack. This will need extra changes as currently > > once 'maybe_start_bgworkers' is attempted by postmaster, it will > > attempt again to start any worker only if the worker had abnormal exit > > and restart_time !=0. The current postmatser will not attempt to start > > worker on any GUC change. > > > > 2) Another way maybe to treat slotsync worker as special case and > > separate out the start/restart of slotsync worker from bgworker, and > > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > > keep starting it in the postmaster loop(ServerLoop). In this way, we > > may be able to add more checks before starting worker. But by opting > > this approach, we will have to manage slotsync worker completely by > > ourself as it will be no longer be part of existing > > bgworker-registration infra. If this seems okay and there are no other > > better options, it can be analyzed further in detail. > > > > 3) Another approach could be, in order to solve issue (a), introduce a > > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > > start a bgworker only if consistent state is reached and the system is > > standby. And for issue (b), lets retain check of enable_syncslot in > > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > > and postmaster will not restart it. But I'm not sure if making it > > "PGC_POSTMASTER" is acceptable from the user's perspective. > > I had the same idea (means make enable_syncslot as 'PGC_POSTMASTER') > when reading b). I'm +1 on it (at least for V1) as I don't think that this parameter > value would change frequently. Curious to know what others think too. > > Then as far a) is concerned, I'd vote for introducing a new > BgWorkerStart_ConsistentState_HotStandby. Here is the V41 patch set which includes the following changes. V41-0001: 1) Based on the discussion[1], I update the document to remind user to change the slot's failover option when ALTER SUBSCRIPTION ... SET (slot_name = xx). 2) Address comments in [2][3][4]. V41-0002: 1) 'enable_syncslot' is changed from PGC_SIGHUP to PGC_POSTMASTER, slot-sync worker will now clean exit (proc_exit(0)) if enable_syncslot is found disabled. 2) BgWorkerStart_ConsistentState_HotStandby is introduced as new start-time for bgworker. This will start worker only if it is standby_mode and consistent state is reached. 3) 'SYNCSLOT_STATE_INITIATED' is now set in 'ReplicationSlotCreate' itself in slot-sync worker case. Earlier it was set at later point of time giving a window wherein even a synced slot was in 'n' state for quite some time, which was not correct. Thanks Shveta for working on the V41-0002. [1] https://www.postgresql.org/message-id/CAA4eK1Jd9dk%3D5POTKM9p4EyYqYzLXe-AnLzHrUELjzZScLz7mw%40mail.gmail.com [2] https://www.postgresql.org/message-id/eb09f682-db82-41cd-93bc-5d44e10e1d6d%40gmail.com [3] https://www.postgresql.org/message-id/CAHut%2BPsuSWjm7U_sVnL8FXZ7ZQcfCcT44kAK7i6qMG35Cwjy3A%40mail.gmail.com [4] https://www.postgresql.org/message-id/CAFPTHDbFqLgXS6Et%2BshNGPDjCKK66C%2BZSarqFHmQvfnAah3Qsw%40mail.gmail.com Best Regards, Hou zj
Attachment
pgsql-hackers by date: