Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs

From Amit Kapila
Subject Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date
Msg-id CAA4eK1KwyMUn7DRRLJsSTws7HPs0DA45hH8dHL0gD817SEtjAA@mail.gmail.com
Whole thread Raw
In response to Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
List pgsql-bugs
On Thu, Jan 19, 2023 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 3:46 PM houzj.fnst@fujitsu.com
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote:
> > > On Wednesday, January 18, 2023 12:32 PM Amit Kapila
> > > <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote:
> > > > >
> > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote:
> > > > >
> > > > > > There is an analysis of the test
> > > > > > failure in the email [2] which explains the race condition that
> > > > > > leads to test failure. Thinking again about the failure, I feel we
> > > > > > can instead change the failed test (t/004_sync.pl) to either
> > > > > > ensure that both the walsenders (corresponding to sync worker and
> > > > > > apply
> > > > > > worker) exits after dropping the subscription and before checking
> > > > > > the remaining slots on publisher or wait for slots to become zero
> > > > > > in the test.
> > > > >
> > > > > How about waiting for the table to start to be synced (and thus the
> > > > > slot to be
> > > > > created) before issuing the drop subscription?
> > > > >
> > > >
> > > > In this test [1], the initial sync fails due to a unique constraint
> > > > violation, so checking that the sync has started is a bit tricky. We
> > > > can probably check sync_error_count in pg_stat_subscription_stats to
> > > > ensure that sync has started to fail which will ideally ensure that
> > > > the sync has started. I am not sure this would be completely safe. The
> > > > other possible ways are (a) after creating a subscription, wait for
> > > > two slots to get created in the publisher, and then after dropping
> > > > subscription wait for slots to become zero on the publisher; (b) after dropping
> > > the subscription, wait for slots to become zero.
> > > >
> > > > I think one of (a) or (b) will work.
> > >
> > > I think in the mentioned testcase, the tablesync worker will keep restarting which
> > > means the table sync slot is also being dropped and re-created ... . So, (a) waiting
> > > for two slots to get created might not work as the slot will get dropped soon. I
> > > think (b) waiting for slot to become zero would be a simpler way to make the test
> > > stable. And here are the patches that tries to do it for all affected branches.
> >
> > When testing the patches on back-branches, I find that the reported deadlock
> > problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176
> > which introduces the WaitForProcSignalBarrier logic in dropdb(). After this
> > commit, when executing the bug-reproduction sql script, the DROP DATABASE will
> > wait for the table sync worker to accept ProcSignalBarrier which can cause the
> > reported deadlock problem.
> >
> > But I think it's still worth fixing this on PG14 as well, as it would be better
> > to allow termination when executing command over network.
> >
>
> I also think that this should be backpatched in PG14 as well.
>

I have verified it and it looks good to me. I'll push and backpatch
this till PG14 on Tuesday unless there are more comments/suggestions.

-- 
With Regards,
Amit Kapila.



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes
Next
From: Noah Misch
Date:
Subject: Re: BUG #17753: pg_dump --if-exists bug