Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uAwcGyo7ABebhb3LnmnmQhJiyFasQ11UADERo=VPY=J6g@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Fri, Aug 29, 2025 at 2:20 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > 4)
> > > In the worker, before each call to synchronize_slots(), we are
> > > starting a new transaction. It aligns with the previous implementation
> > > where StartTransaction was inside synchronize_slots(). But in API, we
> > > are doing StartTransaction once outside of the loop instead of doing
> > > before each synchronize_slots(), is it intentional? It may keep the
> > > transaction open for a long duration for the case where slots are not
> > > getting persisted soon.
> > >
> >
> > I’ll address your other comments separately, but I wanted to respond
> > to this one first. I did try the approach you suggested, but the issue
> > is that we use the remote_slots list across loop iterations. If we end
> > the transaction at the end of each iteration, the list gets freed and
> > is no longer available for the next pass. Each iteration relies on the
> > remote_slots list from the previous one to build the new list, which
> > is why we can’t free it inside the loop.
>
> Isn't that just a matter of allocating the list in appropriate long
> lived memory context?
>

+1. Since we're reallocating the list each time we fetch it from the
remote server, it's not suitable for long-lived memory storage.
Instead, should we extract the slot names during the initial fetch of
failover slots and store them in the appropriate memory context? This
extraction would only need to happen only when
slot_persistence_pending is true during the first sync cycle.

> Here are some more comments
> <para>
> - The logical replication slots on the primary can be synchronized to
> - the hot standby by using the <literal>failover</literal> parameter of
> + The logical replication slots on the primary can be enabled for
> + synchronization to the hot standby by using the
> + <literal>failover</literal> parameter of
>
> This change corresponds to an existing feature. Should be a separate
> patch, which we may want to backport.
>
> - on the standby, the failover slots can be synchronized periodically in
> + <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
> + synchronization can be be performed either manually by calling
> + <link linkend="pg-sync-replication-slots">
> ... snip ...
> - <note>
> - <para>
> - While enabling <link linkend="guc-sync-replication-slots">
> - <varname>sync_replication_slots</varname></link> allows for automatic
> - periodic synchronization of failover slots, they can also be manually
> ... snip ...
>
> I like the current documentation which separates the discussion of two
> methods. I think we should just improve the second paragraph instead
> of deleting it and changing the first one.
>
> * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
> + *
>
> unnecessary blank line
>
> +/*
> + * Flag used by pg_sync_replication_slots()
> + * to do retries if the slot did not persist while syncing.
> + */
> +static bool slot_persistence_pending = false;
>
> I don't think we need to keep a global variable for this. The variable
> is used only inside SyncReplicationSlots() and the call depth is not
> more than a few calls. From synchronize_slots(), before which the
> variable is reset and after which the variable is checked, to
> update_and_persist_local_synced_slot() which sets the variable, all
> the functions return bool. All of them can be made to return an
> integer status instead indicating the result of the operation. If we
> do so we could check the return value of synchronize_slots() to decide
> whether to retry or not, isntead of maintaining a global variable
> which has a much wider scope than required. It's difficult to keep it
> updated over the time.
>
> + * Parameters:
> + * wrconn - Connection to the primary server
> + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
> + * fetch all failover slots
> *
> - * Returns TRUE if any of the slots gets updated in this sync-cycle.
>
> Need to describe the return value.
>
> */
> -static bool
> -synchronize_slots(WalReceiverConn *wrconn)
> +static List *
> +fetch_remote_slots(WalReceiverConn *wrconn, List *target_slot_list)
>
> I like the way this function is broken down into two. That breaking down is
> useful even without this feature.
>
> - /* Construct the remote_slot tuple and synchronize each slot locally */
> + /* Process the slot information */
>
> Probably these comments don't make much sense or they repeat what's
> already there in the function prologue.
>
> else
> - /* Create list of remote slots */
> + /* Add to updated list */
>
> Probably these comments don't make much sense or they repeat what's
> already there in the function prologue.
>
> @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
>
> The function is too cute to be useful. The code should be part of
> ReplSlotSyncWorkerMain() just like other worker's main functions.
>

I was thinking we can retain wait_for_slot_activity() as this can even
be invoked from API flow. See my comment# 2 in [1]

[1]: https://www.postgresql.org/message-id/CAJpy0uASzojKbzinpNu29xuYGsSRnSo%3D22CLhXaSt_43TVoBhQ%40mail.gmail.com

thanks
Shveta



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Next
From: Chao Li
Date:
Subject: Re: [PATCH] Generate random dates/times in a specified range