Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAA4eK1KYAjkRRSV-NAfDq=4GyHpd2igs_8se0uW-LNEF2RbaRA@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Tue, Feb 6, 2024 at 3:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > ---
> > > > > > Since Two processes (e.g. the slotsync worker and
> > > > > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > > > > information, there is a race condition where slot's
> > > > > > confirmed_flush_lsn goes backward.
> > > > > >
> > > > >
> > > > > Right, this is possible, though there shouldn't be a problem because
> > > > > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > > > > required WAL won't be removed. Having said that, I can think of two
> > > > > ways to avoid it: (a) We can have some flag in shared memory using
> > > > > which we can detect whether any other process is doing slot
> > > > > syncronization and then either error out at that time or simply wait
> > > > > or may take nowait kind of parameter from user to decide what to do?
> > > > > If this is feasible, we can simply error out for the first version and
> > > > > extend it later if we see any use cases for the same (b) similar to
> > > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > > > > error, this is good for now but in future we may still have another
> > > > > similar issue, so I would prefer (a) among these but I am fine if you
> > > > > prefer (b) or have some other ideas like just note down in comments
> > > > > that this is a harmless case and can happen only very rarely.
> > > >
> > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same
> > > > issue still happens for other fields.
> > >
> > > I agree that (a) looks better.  On a separate note, while looking at
> > > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
> > > be an optional parameter to give one slot or multiple slots or all
> > > slots as default, that will give better control to the user no?
> > >
> >
> > As of now, we want to give functionality similar to slotsync worker
> > with a difference that users can use this new function for planned
> > switchovers. So, syncing all failover slots by default. I think if
> > there is a use case to selectively sync some of the failover slots
> > then we can probably extend this function and slotsync worker as well.
> > Normally, if the primary goes down due to whatever reason users would
> > want to restart the replication for all the defined publications via
> > existing failover slots. Why would anyone want to do it partially?
>
> If we consider the usability of such a function (I mean as it is
> implemented now, without any argument) one use case could be that if
> the slot sync worker is not keeping up or at some point in time the
> user doesn't want to wait for the worker to do this instead user can
> do it by himself.
>

Possibly, but I was imagining that it would be used for planned
switchover cases and also for testing the core sync slot functionality
in our TAP tests.

> So now if we have such a functionality then it would be even better to
> extend it to selectively sync the slot.  For example, if there is some
> issue in syncing all slots, maybe some bug or taking a long time to
> sync because there are a lot of slots but if the user needs to quickly
> failover and he/she is interested in only a couple of slots then such
> a option could be helpful. no?
>

I see your point but not sure how useful it is in the field. I am fine
if others also think such a parameter will be useful and anyway I
think we can even extend it after v1 is done.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Next
From: Ильясов Ян
Date:
Subject: RE: Memory leak fix in rmtree.c