Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date
Msg-id CAA4eK1KyHfVOVeio28p8CHDnuyKuej78cj_7U9mHAB4ictVQwQ@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Thu, Aug 10, 2023 at 10:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, August 3, 2023 7:30 PM Melih Mutlu <m.melihmutlu@gmail.com>  wrote:
> >
> > > Right. I attached the v26 as you asked.
> >
> > Thanks for posting the patches.
> >
> > While reviewing the patch, I noticed one rare case that it's possible that there
> > are two table sync worker for the same table in the same time.
> >
> > The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
> > apply worker will start a new worker after releasing the lock. So, at the point[1]
> > where the lock is released and the new table sync worker has not been started,
> > it seems possible that another old table sync worker will be reused for the
> > same table.
> >
> >                                 /* Now safe to release the LWLock */
> >                                 LWLockRelease(LogicalRepWorkerLock);
> > *[1]
> >                                 /*
> >                                  * If there are free sync worker slot(s), start a new sync
> >                                  * worker for the table.
> >                                  */
> >                                 if (nsyncworkers < max_sync_workers_per_subscription)
> >                                 ...
> >                                                 logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> >
>
> Yeah, this is a problem. I think one idea to solve this is by
> extending the lock duration till we launch the tablesync worker but we
> should also consider changing this locking scheme such that there is a
> better way to indicate that for a particular rel, tablesync is in
> progress. Currently, the code in TablesyncWorkerMain() also acquires
> the lock in exclusive mode even though the tablesync for a rel is in
> progress which I guess could easily heart us for larger values of
> max_logical_replication_workers. So, that could be another motivation
> to think for a different locking scheme.
>

Yet another problem is that currently apply worker maintains a hash
table for 'last_start_times' to avoid restarting the tablesync worker
immediately upon error. The same functionality is missing while
reusing the table sync worker. One possibility is to use a shared hash
table to remember start times but I think it may depend on what we
decide to solve the previous problem reported by Hou-San.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: Fabien COELHO
Date:
Subject: Re: pgbench with libevent?