Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | Ajin Cherian |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAFPTHDY1dRDfKX5mHfcm1KCkCTAiFti1VjYp8_6=Sxctq0+Etg@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: Improve pg_sync_replication_slots() to wait for primary to advance
Re: Improve pg_sync_replication_slots() to wait for primary to advance |
List | pgsql-hackers |
On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Thank You for the patches. Please find a few comments. > > 1) > Change not needed: > > * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC. > + * > */ > Removed. > 2) > Regarding the naptime in API, I was thinking why not to use > wait_for_slot_activity() directly? If there are no slots activity, it > will keep on doubling the naptime starting from 2sec till max of > 30sec. Thoughts? > > We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and > MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and > MAX_SLOTSYNC_NAPTIME_MS in such a case. > Not changing this since further discussion clarified this. > 3) > + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to > + * fetch all failover slots > > Can we please change it to: > > List of failover logical slots to fetch from primary, or NIL to fetch > all failover logical slots > Changed the variable itself. > 4) > In the worker, before each call to synchronize_slots(), we are > starting a new transaction. It aligns with the previous implementation > where StartTransaction was inside synchronize_slots(). But in API, we > are doing StartTransaction once outside of the loop instead of doing > before each synchronize_slots(), is it intentional? It may keep the > transaction open for a long duration for the case where slots are not > getting persisted soon. > I've added a new memory context to handle slot_names > 5) > With ProcessSlotSyncInterrupts() being removed from API, can you > please check the behaviour of API on smart-shutdown and rest of the > shutdown modes? It should behave like other APIs. And what happens if > I change primary_conninfo to some non-existing server when the API is > running. Does it error out or keep retrying? > I've tested with different types of shutdown and it seems to be handled corerctly. However, yes, if configuration changed, the API does not handle. I've written a new function slotsync_api_reread_config() to specifically handle configuration changes in API context as it is different from slotsync worker logic. On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > 4) > > > In the worker, before each call to synchronize_slots(), we are > > > starting a new transaction. It aligns with the previous implementation > > > where StartTransaction was inside synchronize_slots(). But in API, we > > > are doing StartTransaction once outside of the loop instead of doing > > > before each synchronize_slots(), is it intentional? It may keep the > > > transaction open for a long duration for the case where slots are not > > > getting persisted soon. > > > > > > > I’ll address your other comments separately, but I wanted to respond > > to this one first. I did try the approach you suggested, but the issue > > is that we use the remote_slots list across loop iterations. If we end > > the transaction at the end of each iteration, the list gets freed and > > is no longer available for the next pass. Each iteration relies on the > > remote_slots list from the previous one to build the new list, which > > is why we can’t free it inside the loop. > > Isn't that just a matter of allocating the list in appropriate long > lived memory context? > Yes, changed this. > Here are some more comments > <para> > - The logical replication slots on the primary can be synchronized to > - the hot standby by using the <literal>failover</literal> parameter of > + The logical replication slots on the primary can be enabled for > + synchronization to the hot standby by using the > + <literal>failover</literal> parameter of > > This change corresponds to an existing feature. Should be a separate > patch, which we may want to backport. > Removed. > - on the standby, the failover slots can be synchronized periodically in > + <command>CREATE SUBSCRIPTION</command> during slot creation. After that, > + synchronization can be be performed either manually by calling > + <link linkend="pg-sync-replication-slots"> > ... snip ... > - <note> > - <para> > - While enabling <link linkend="guc-sync-replication-slots"> > - <varname>sync_replication_slots</varname></link> allows for automatic > - periodic synchronization of failover slots, they can also be manually > ... snip ... > > I like the current documentation which separates the discussion of two > methods. I think we should just improve the second paragraph instead > of deleting it and changing the first one. > > * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC. > + * > > unnecessary blank line > Changed. > +/* > + * Flag used by pg_sync_replication_slots() > + * to do retries if the slot did not persist while syncing. > + */ > +static bool slot_persistence_pending = false; > > I don't think we need to keep a global variable for this. The variable > is used only inside SyncReplicationSlots() and the call depth is not > more than a few calls. From synchronize_slots(), before which the > variable is reset and after which the variable is checked, to > update_and_persist_local_synced_slot() which sets the variable, all > the functions return bool. All of them can be made to return an > integer status instead indicating the result of the operation. If we > do so we could check the return value of synchronize_slots() to decide > whether to retry or not, isntead of maintaining a global variable > which has a much wider scope than required. It's difficult to keep it > updated over the time. > The problem is that all those calls synchronize_slots() and update_and_persist_local_synced_slot() are shared with the slotsync worker logic and API. Hence, changing this will affect slotsync_worker logic as well. While the API needs to spefically retry only if the initial sync fails, the slotsync worker will always be retrying. I feel using a global variable is a more convenient way of doing this. > + * Parameters: > + * wrconn - Connection to the primary server > + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to > + * fetch all failover slots > * > - * Returns TRUE if any of the slots gets updated in this sync-cycle. > > Need to describe the return value. > Added. > */ > -static bool > -synchronize_slots(WalReceiverConn *wrconn) > +static List * > +fetch_remote_slots(WalReceiverConn *wrconn, List *target_slot_list) > > I like the way this function is broken down into two. That breaking down is > useful even without this feature. > > - /* Construct the remote_slot tuple and synchronize each slot locally */ > + /* Process the slot information */ > > Probably these comments don't make much sense or they repeat what's > already there in the function prologue. > > else > - /* Create list of remote slots */ > + /* Add to updated list */ > > Probably these comments don't make much sense or they repeat what's > already there in the function prologue. > Removed these comments. > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated) > > The function is too cute to be useful. The code should be part of > ReplSlotSyncWorkerMain() just like other worker's main functions. > But this wouldn't be part of this feature. > void > SyncReplicationSlots(WalReceiverConn *wrconn) > { > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn)); > { > > Shouldn't this function call CheckForInterrupts() somewhere in the > loop since it could be potentially an infinite loop? I've tested this and I see that interrupts are being handled by sending SIGQUIT and SIGINT to the backend process. Attaching v10 with the above changes. regards, Ajin Cerian Fujitsu Australia
Attachment
pgsql-hackers by date: