Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From shveta malik
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAJpy0uD8J-Ro7hJXhQ_i8z9RVaK_qRjvOXG5wE-XOupyD-HKDg@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Mon, Sep 11, 2023 at 9:49 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Sep 8, 2023 at 4:40 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Shveta,
> >
> > I resumed to check the thread. Here are my high-level comments.
> > Sorry if you have been already discussed.
>
> Thanks Kuroda-san for the feedback.
> >
> > 01. General
> >
> > I think the documentation can be added, not only GUCs. How about adding examples
> > for combinations of physical and logical replications?  You can say that both of
> > physical primary can be publisher and slots on primary/standby are synchronized.
> >
>
> I did not fully understand this. Can you please state a clear example.
> We are only synchronizing logical replication slots in this draft and
> that too on physical standby from primary. So the last statement is
> not completely true.
>
> > 02. General
> >
> > standby_slot_names ensures that physical standby is always ahead subscriber, but I
> > think it may be not sufficient. There is a possibility that primary server does
> > not have any physical slots.So it expects a slot to be present.
> > In this case the physical standby may be behind the
> > subscriber and the system may be confused when the failover is occured.
>
> Currently there is a check in slot-sync worker which mandates that
> there is a physical slot present between primary and standby for this
> feature to proceed.So that confusion state will not arise.
> + /* WalRcvData is not set or primary_slot_name is not set yet */
> + if (!WalRcv || WalRcv->slotname[0] == '\0')
> + return naptime;
>
> >Can't we specify the name of standby via application_name or something?
>
> So do you mean that in absence of a physical slot (if we plan to
> support that), we let primary know about standby(slots-synchronization
> client) through application_name? I am not sure about this. Will think
> more on this. I would like to know others' opinion on this as well.
>
> >
> > 03. General
> >
> > In this architecture, the syncslot worker is launched per db and they
> > independently connects to primary, right?
>
> Not completely true. Each slotsync worker is responsible for managing
> N dbs. Here 'N' =  'Number of distinct dbs for slots in
> synchronize_slot_names'/ 'number of max_slotsync_workers configured'
> for cases where dbcount exceeds workers configured.
> And if dbcount < max_slotsync_workers, then we launch only that many
> workers equal to dbcount and each worker manages a single db. Each
> worker independently connects to primary. Currently it makes a
> connection multiple times, I am optimizing it to make connection only
> once and then after each SIGHUP assuming 'primary_conninfo' may
> change. This change will be in the next version.
>
>
> >I'm not sure it is efficient, but I
> > come up with another architecture - only a worker (syncslot receiver)connects
> > to the primary and other workers (syncslot worker) receives infos from it and
> > updates. This can reduce the number of connections so that it may slightly
> > improve the latency of network. How do you think?
> >
>
> I feel it may help in reducing network latency, but not sure if it
> could be more efficient in keeping the lsns in sync. I feel it may
> introduce lag due to the fact that only one worker is getting all the
> info from primary and the actual synchronizing workers are waiting on
> that worker. This lag may be more when the number of slots are huge.
> We have run some performance tests on the design implemented
> currently, please have a look at emails around [1] and [2].
>
> > 04. General
> >
> > test file recovery/t/051_slot_sync.pl is missing.
> >
>
> yes, it was removed. Please see point3 at [3]
>
>
> > 04. ReplSlotSyncMain
> >
> > Does the worker have to connect to the specific database?
> >
> >
> > ```
> >         /* Connect to our database. */
> >         BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid,
> >                                                                                           MySlotSyncWorker->userid,
> >                                                                                           0);
> > ```
>
> Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec'
> to connect to primary, this needs database connection. It errors out
> in the absence of 'MyDatabaseId'. Do you think db-connection can have
> some downsides?
>
> >
> > 05. SlotSyncInitSlotNamesLst()
> >
> > "Lst" should be "List".
> >
>
> Okay, I will change this in the next version.
>
> ==========
>
> [1]: https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com
> [2]: https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com
> [3]: https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com
>
> thanks
> Shveta


PFA  v17. It has below changes:

1) There was a common portion between SlotSync worker and LogicalRep
worker structures. The common portion is now moved to WorkerHeader.
The common functions are merged.
2) Connection to primary is made once in the beginning in both
slotSync worker as well as launcher. Earlier it was before each sync
cycle.
3) SpinLock Removed. Earlier LWlock was used for shared-memory access
by workers and then there was extra Spinlock for dbids access in DSM,
which is removed now. LWLock alone seems enough to maintain the
consistency.
4) In 'alter system standby_slot_names', we can not give non-existing
slot-names or logical slots now. Earlier it was accepting everything.
This specific change is in patch1, rest in patch2.

Thanks Ajin for working on 1.

Next, I plan to review patch01 and the existing feedback about it.
Until now focus was patch02.

thanks
Shveta

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Have better wording for snapshot file reading failure
Next
From: shveta malik
Date:
Subject: Re: Synchronizing slots from primary to standby