Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 |
Date | |
Msg-id | CAA4eK1KwyMUn7DRRLJsSTws7HPs0DA45hH8dHL0gD817SEtjAA@mail.gmail.com Whole thread Raw |
In response to | Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
|
List | pgsql-bugs |
On Thu, Jan 19, 2023 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 19, 2023 at 3:46 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Thursday, January 19, 2023 3:14 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > On Wednesday, January 18, 2023 12:32 PM Amit Kapila > > > <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Jan 18, 2023 at 1:34 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > > On 2023-01-17 06:23:45 +0530, Amit Kapila wrote: > > > > > > > > > > > There is an analysis of the test > > > > > > failure in the email [2] which explains the race condition that > > > > > > leads to test failure. Thinking again about the failure, I feel we > > > > > > can instead change the failed test (t/004_sync.pl) to either > > > > > > ensure that both the walsenders (corresponding to sync worker and > > > > > > apply > > > > > > worker) exits after dropping the subscription and before checking > > > > > > the remaining slots on publisher or wait for slots to become zero > > > > > > in the test. > > > > > > > > > > How about waiting for the table to start to be synced (and thus the > > > > > slot to be > > > > > created) before issuing the drop subscription? > > > > > > > > > > > > > In this test [1], the initial sync fails due to a unique constraint > > > > violation, so checking that the sync has started is a bit tricky. We > > > > can probably check sync_error_count in pg_stat_subscription_stats to > > > > ensure that sync has started to fail which will ideally ensure that > > > > the sync has started. I am not sure this would be completely safe. The > > > > other possible ways are (a) after creating a subscription, wait for > > > > two slots to get created in the publisher, and then after dropping > > > > subscription wait for slots to become zero on the publisher; (b) after dropping > > > the subscription, wait for slots to become zero. > > > > > > > > I think one of (a) or (b) will work. > > > > > > I think in the mentioned testcase, the tablesync worker will keep restarting which > > > means the table sync slot is also being dropped and re-created ... . So, (a) waiting > > > for two slots to get created might not work as the slot will get dropped soon. I > > > think (b) waiting for slot to become zero would be a simpler way to make the test > > > stable. And here are the patches that tries to do it for all affected branches. > > > > When testing the patches on back-branches, I find that the reported deadlock > > problem doesn't happen on PG14 and rather start from PG15 after commit 4eb2176 > > which introduces the WaitForProcSignalBarrier logic in dropdb(). After this > > commit, when executing the bug-reproduction sql script, the DROP DATABASE will > > wait for the table sync worker to accept ProcSignalBarrier which can cause the > > reported deadlock problem. > > > > But I think it's still worth fixing this on PG14 as well, as it would be better > > to allow termination when executing command over network. > > > > I also think that this should be backpatched in PG14 as well. > I have verified it and it looks good to me. I'll push and backpatch this till PG14 on Tuesday unless there are more comments/suggestions. -- With Regards, Amit Kapila.
pgsql-bugs by date: