RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs

From houzj.fnst@fujitsu.com
Subject RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date
Msg-id OS0PR01MB57161F9E9C15E73012FC192C94C49@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> On Wednesday, January 18, 2023 12:32 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote:
> > >
> > > > There is an analysis of the test
> > > > failure in the email [2] which explains the race condition that
> > > > leads to test failure. Thinking again about the failure, I feel we
> > > > can instead change the failed test (t/004_sync.pl) to either
> > > > ensure that both the walsenders (corresponding to sync worker and
> > > > apply
> > > > worker) exits after dropping the subscription and before checking
> > > > the remaining slots on publisher or wait for slots to become zero
> > > > in the test.
> > >
> > > How about waiting for the table to start to be synced (and thus the
> > > slot to be
> > > created) before issuing the drop subscription?
> > >
> >
> > In this test [1], the initial sync fails due to a unique constraint
> > violation, so checking that the sync has started is a bit tricky. We
> > can probably check sync_error_count in pg_stat_subscription_stats to
> > ensure that sync has started to fail which will ideally ensure that
> > the sync has started. I am not sure this would be completely safe. The
> > other possible ways are (a) after creating a subscription, wait for
> > two slots to get created in the publisher, and then after dropping
> > subscription wait for slots to become zero on the publisher; (b) after dropping
> the subscription, wait for slots to become zero.
> >
> > I think one of (a) or (b) will work.
> 
> I think in the mentioned testcase, the tablesync worker will keep restarting which
> means the table sync slot is also being dropped and re-created ... . So, (a) waiting
> for two slots to get created might not work as the slot will get dropped soon. I
> think (b) waiting for slot to become zero would be a simpler way to make the test
> stable. And here are the patches that tries to do it for all affected branches.

When testing the patches on back-branches, I find that the reported deadlock
problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176
which introduces the WaitForProcSignalBarrier logic in dropdb(). After this
commit, when executing the bug-reproduction sql script, the DROP DATABASE will
wait for the table sync worker to accept ProcSignalBarrier which can cause the
reported deadlock problem.

But I think it's still worth fixing this on PG14 as well, as it would be better
to allow termination when executing command over network.

Best regards,
Hou zj

pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: BUG #17753: pg_dump --if-exists bug
Next
From: Amit Kapila
Date:
Subject: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1