RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date
Msg-id OS0PR01MB5716C00FD3B8A0EFA51800909412A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Melih Mutlu <m.melihmutlu@gmail.com>)
Responses Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
List pgsql-hackers
On Thursday, August 3, 2023 7:30 PM Melih Mutlu <m.melihmutlu@gmail.com>  wrote:

> Right. I attached the v26 as you asked. 

Thanks for posting the patches.
 
While reviewing the patch, I noticed one rare case that it's possible that there
are two table sync worker for the same table in the same time.

The patch relies on LogicalRepWorkerLock to prevent concurrent access, but the
apply worker will start a new worker after releasing the lock. So, at the point[1]
where the lock is released and the new table sync worker has not been started,
it seems possible that another old table sync worker will be reused for the
same table.

                /* Now safe to release the LWLock */
                LWLockRelease(LogicalRepWorkerLock);
*[1]
                /*
                 * If there are free sync worker slot(s), start a new sync
                 * worker for the table.
                 */
                if (nsyncworkers < max_sync_workers_per_subscription)
                ...
                        logicalrep_worker_launch(MyLogicalRepWorker->dbid,

I can reproduce it by using gdb.

Steps:
1. set max_sync_workers_per_subscription to 1 and setup pub/sub which publishes
   two tables(table A and B).
2. when the table sync worker for the table A started, use gdb to block it
   before being reused for another table.
3. set max_sync_workers_per_subscription to 2 and use gdb to block the apply
   worker at the point after releasing the LogicalRepWorkerLock and before
   starting another table sync worker for table B.
4. release the blocked table sync worker, then we can see the table sync worker
   is also reused for table B.
5. release the apply worker, then we can see the apply worker will start
   another table sync worker for the same table(B).

I think it would be better to prevent this case from happening as this case
will give some unexpected ERROR or LOG. Note that I haven't checked if it would
cause worse problems like duplicate copy or others.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Andres Freund
Date:
Subject: Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?