Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs

From Amit Kapila
Subject Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date
Msg-id CAA4eK1+y2LdoG0dm_yAEb_p8Uik5eDZGzagMS9O8gdkrS8g7MA@mail.gmail.com
Whole thread Raw
In response to RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
On Thu, Jan 19, 2023 at 3:46 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> > On Wednesday, January 18, 2023 12:32 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote:
> > > >
> > > > > There is an analysis of the test
> > > > > failure in the email [2] which explains the race condition that
> > > > > leads to test failure. Thinking again about the failure, I feel we
> > > > > can instead change the failed test (t/004_sync.pl) to either
> > > > > ensure that both the walsenders (corresponding to sync worker and
> > > > > apply
> > > > > worker) exits after dropping the subscription and before checking
> > > > > the remaining slots on publisher or wait for slots to become zero
> > > > > in the test.
> > > >
> > > > How about waiting for the table to start to be synced (and thus the
> > > > slot to be
> > > > created) before issuing the drop subscription?
> > > >
> > >
> > > In this test [1], the initial sync fails due to a unique constraint
> > > violation, so checking that the sync has started is a bit tricky. We
> > > can probably check sync_error_count in pg_stat_subscription_stats to
> > > ensure that sync has started to fail which will ideally ensure that
> > > the sync has started. I am not sure this would be completely safe. The
> > > other possible ways are (a) after creating a subscription, wait for
> > > two slots to get created in the publisher, and then after dropping
> > > subscription wait for slots to become zero on the publisher; (b) after dropping
> > the subscription, wait for slots to become zero.
> > >
> > > I think one of (a) or (b) will work.
> >
> > I think in the mentioned testcase, the tablesync worker will keep restarting which
> > means the table sync slot is also being dropped and re-created ... . So, (a) waiting
> > for two slots to get created might not work as the slot will get dropped soon. I
> > think (b) waiting for slot to become zero would be a simpler way to make the test
> > stable. And here are the patches that tries to do it for all affected branches.
>
> When testing the patches on back-branches, I find that the reported deadlock
> problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176
> which introduces the WaitForProcSignalBarrier logic in dropdb(). After this
> commit, when executing the bug-reproduction sql script, the DROP DATABASE will
> wait for the table sync worker to accept ProcSignalBarrier which can cause the
> reported deadlock problem.
>
> But I think it's still worth fixing this on PG14 as well, as it would be better
> to allow termination when executing command over network.
>

I also think that this should be backpatched in PG14 as well.

-- 
With Regards,
Amit Kapila.



pgsql-bugs by date:

Previous
From: "houzj.fnst@fujitsu.com"
Date:
Subject: RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Next
From: Tom Lane
Date:
Subject: Re: BUG #17753: pg_dump --if-exists bug