Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uASzojKbzinpNu29xuYGsSRnSo=22CLhXaSt_43TVoBhQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (Ajin Cherian <itsajin@gmail.com>)
Responses Re: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Fri, Aug 22, 2025 at 3:44 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > > I've removed them.
> > > Attaching patch v8 addressing the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > When the API is in progress, and meanwhile in another session we turn
> > off hot_standby_feedback, the API session terminates abnormally.
> >
> > postgres=# SELECT pg_sync_replication_slots();
> > server closed the connection unexpectedly
> >
> > It seems slotsync_reread_config() is not adjusted for API. It does
> > proc_exit assuming it is a worker process.
> >
>
> I've removed the API calling ProcessSlotSyncInterrupts() as I don't
> think the API needs to specifically handle shutdown requests, it works
> without calling this. And the other config checks, I don't think the
> API needs to handle, I think we should leave it to the user.
>
> > 2)
> > slotsync_worker_onexit():
> >
> >         SlotSyncCtx->slot_persistence_pending = false;
> >
> >         /*
> >          * If syncing_slots is true, it indicates that the process errored out
> >          * without resetting the flag. So, we need to clean up shared memory and
> >          * reset the flag here.
> >          */
> >         if (syncing_slots)
> >         {
> >                 SlotSyncCtx->syncing = false;
> >                 syncing_slots = false;
> >         }
> >
> > Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
> > slot_persistence_pending can not be true without syncing_slots being
> > true.
> >
> > 3)
> > reset_syncing_flag():
> >
> >   SpinLockAcquire(&SlotSyncCtx->mutex);
> >   SlotSyncCtx->syncing = false;
> > + SlotSyncCtx->slot_persistence_pending = false;
> >   SpinLockRelease(&SlotSyncCtx->mutex);
> >
> > Here we are changing slot_persistence_pending under mutex, while at
> > other places, it is not protected by mutex. Is it intentional here?
> >
> > 4)
> > On rethinking, we maintain anything in shared memory if it has to be
> > shared between a few processes. 'slot_persistence_pending' OTOH is
> > required to be set and accessed by only one process at a time. Shall
> > we move it out of SlotSyncCtxStruct and keep it static similar to
> > 'syncing_slots'? Rest of the setting, resetting flow remains the same.
> > What do you think?
> >
>
> Yes, I agree. I have modified it accordingly.
>
> >
> > 5)
> > /* Build the query based on whether we're fetching all or refreshing
> > specific slots */
> >
> > Perhaps we can shift this comment to where we actually append
> > target_slot_list. Better to have it something like:
> > 'If target_slot_list is provided, construct the query only to fetch given slots'
> >
>
> Changed.
>
> Attaching patch v9 addressing the above comments.
>

Thank You for the patches. Please find a few comments.

1)
Change not needed:

  * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
+ *
  */

2)
Regarding the naptime in API, I was thinking why not to use
wait_for_slot_activity() directly? If there are no slots activity, it
will keep on doubling the naptime starting from 2sec till max of
30sec. Thoughts?

We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and
MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and
MAX_SLOTSYNC_NAPTIME_MS in such a case.

3)
+ *   target_slot_list - List of RemoteSlot structures to refresh, or NIL to
+ *                      fetch all failover slots

Can we please change it to:

List of failover logical slots to fetch from primary, or NIL to fetch
all failover logical slots

4)
In the worker, before each call to synchronize_slots(), we are
starting a new transaction. It aligns with the previous implementation
where StartTransaction was inside synchronize_slots(). But in API, we
are doing StartTransaction once outside of the loop instead of doing
before each synchronize_slots(), is it intentional? It may keep the
transaction open for a long duration for the case where slots are not
getting persisted soon.

5)
With ProcessSlotSyncInterrupts() being removed from API, can you
please check the behaviour of API on smart-shutdown and rest of the
shutdown modes? It should behave like other APIs. And what happens if
I change primary_conninfo to some non-existing server when the API is
running. Does it error out or keep retrying?

thanks
Shveta



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Peter Eisentraut
Date:
Subject: Re: Remove unneeded cast in heap_xlog_lock.