Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uAxf0PS1atgTYuk=HU91adS17uUC_VJC6DXDmEFt9ziAg@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Fri, Jul 18, 2025 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jul 18, 2025 at 10:45 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > > > >
> > > > > > I am not able to apply the patch to the latest head or even to a week
> > > > > > back version. Can you please check and rebase?
> > > > > >
> > > > > > thanks
> > > > > > Shveta
> > > > >
> > > > > Rebased.
> > > > >
> > > >
> > > > Thanks. Please find a few comments:
> > > >
> > > >
> > > > 1)
> > > > /* Any slot with NULL in these fields should not have made it this far */
> > > >
> > > > It is good to get rid of the case where we had checks for NULL
> > > > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
> > > > state), as that has already been checked by synchronize_slots() and
> > > > such a slot will not even reach wait_for_primary_slot_catchup(). But a
> > > > slot can still be invalidated on primary anytime, and thus during this
> > > > wait, we should check for primary's invalidation as we were doing in
> > > > v1.
> > > >
> > > > 2)
> > > > + * If in SQL API synchronization, and we've been  promoted, then no point
> > > >
> > > >  extra space before promoted.
> > > >
> > > > 3)
> > > >
> > > > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
> > > >
> > > > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
> > > > checked at the beginning of this function.
> > > >
> > > > 4)
> > > > + ereport(WARNING,
> > > > + errmsg("aborting sync for slot \"%s\"",
> > > > + remote_slot->name),
> > > > + errdetail("Promotion occurred before this slot was fully"
> > > > +   " synchronized."));
> > > > + pfree(cmd.data);
> > > > +
> > > > + return false;
> > > >
> > > > a) Please add an error-code.
> > > >
> > > > b) Shall we change msg to
> > > >
> > > > errmsg("aborting sync for slot \"%s\"",
> > > > remote_slot->name),
> > > >                     errhint("%s cannot be executed once promotion is
> > > > triggered.",
> > > >
> > > > "pg_sync_replication_slots()")));
> > > >
> > > >
> > > >
> > > > 5)
> > > > Instead of using PromoteIsTriggered, shall we rely on
> > > > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
> > > >
> > > > 6)
> > > > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
> > > > sync_replication_slots on the standby is required" to make it same as
> > > > what we had prior to the patch I pointed earlier.
> > > >
> > > > Or better we can refine it to below. Thoughts?
> > > >
> > > > The logical replication slots on the primary can be enabled for
> > > > synchronization to the hot standby by using the failover parameter of
> > > > pg_create_logical_replication_slot, or by using the failover option of
> > > > CREATE SUBSCRIPTION during slot creation. After that, synchronization
> > > > can be performed either manually by calling pg_sync_replication_slots
> > > > on the standby, or automatically by enabling sync_replication_slots on
> > > > the standby. When sync_replication_slots is enabled, the failover
> > > > slots are periodically synchronized by the slot sync worker. For the
> > > > synchronization to work, .....
> > >
> > > I am wondering if we should provide an optional parameter to
> > > pg_sync_replication_slots(), to control whether to wait for the slot
> > > to be synced or just return with ERROR as it is doing now, default can
> > > be wait.
> > >
> >
> > Do you mean specifically in case of promotion or in general, as in do
> > not wait for primary to catch-up (anytime) and exit and drop the
> > temporary slot while exiting?
>
> I am specifically pointing to the exposed function
> pg_sync_replication_slots() which was earlier non-blocking and was
> giving an error if the primary slot for not catch-up, so we have
> improved the functionality in this thread by making it wait for
> primary to catch-up instead of just throwing an error.  But my
> question was since this is a user facing function so shall we keep the
> old behavior intact with some optional parameters so that if the user
> chooses not to wait they have options to do that?
>

Okay.  I see your point. Yes, it was non-blocking earlier but it was
not giving ERROR, it was just dumping in logilfe that primary is
behind and thus slot-sync could not be done.

If we continue using the non-blocking mode, there’s a risk that the
API may never successfully sync the slots. This is because it
eventually drops the temporary slot on exit, and when it tries to
create a new one later on subsequent call, it’s likely that the new
slot will again be ahead of the primary. This may happen if we have
continuous ongoing writes on the primary and the logical slot is not
being consumed at the same pace.

My preference would be to avoid including such an option as it is
confusing. With such an option in place, users may think that
slot-sync is completed while that may not be the case. But if it's
necessary for backward compatibility, it should be okay to provide it
as a non-default option as you suggested. Would like to know what
others think of this.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Next
From: Dilip Kumar
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance