Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION - Mailing list pgsql-hackers
From | Petr Jelinek |
---|---|
Subject | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION |
Date | |
Msg-id | b95a2aa4-11a0-d211-1e9b-6545261b265a@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION
|
List | pgsql-hackers |
On 06/02/17 17:33, Fujii Masao wrote: > On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> On 03/02/17 19:38, Fujii Masao wrote: >>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com> >>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier >>>>>> <michael.paquier@gmail.com> wrote: >>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: >>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that >>>>>>>>> some functions called from there can throw exceptions. >>>>>>>> >>>>>>>> Yes, but all LWLocks should be released by normal error recovery. >>>>>>>> It should not be necessary for this code to clean that up by hand. >>>>>>>> If it were necessary, there would be TRY-CATCH around every single >>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and >>>>>>>> unmaintainable system. Please don't add a TRY-CATCH unless it's >>>>>>>> *necessary* -- and you haven't explained why this one is. >>>>>> >>>>>> Yes. >>>>> >>>>> Thank you for the suggestion. I minunderstood that. >>>>> >>>>>>> Putting hands into the code and at the problem, I can see that >>>>>>> dropping a subscription on a node makes it unresponsive in case of a >>>>>>> stop. And that's just because calls to LWLockRelease are missing as in >>>>>>> the patch attached. A try/catch problem should not be necessary. >>>>>> >>>>>> Thanks for the patch! >>>>>> >>>>>> With the patch, LogicalRepLauncherLock is released at the end of >>>>>> DropSubscription(). But ISTM that the lock should be released just after >>>>>> logicalrep_worker_stop() and there is no need to protect the removal of >>>>>> replication slot with the lock. >>>>> >>>>> That's true. logicalrep_worker_stop returns after confirmig that >>>>> worker->proc is cleard, so no false relaunch cannot be caused. >>>>> After all, logicalrep_worker_stop is surrounded by >>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton >>>>> and make the lock secrion to be more narrower. >>> >>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be >>> removed and the comment for logicalrep_worker_stop() should be updated. >>> >>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock >>> while holding LogicalRepLauncherLock. OTOH, with your approach, >>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding >>> LogicalRepWorkerLock. >>> >>> Therefore I pushed the simple patch which adds LWLockRelease() just after >>> logicalrep_worker_stop(). >>> >>> Another problem that I found while reading the code is that the launcher can >>> start up the worker with the subscription that DROP SUBSCRIPTION just removed. >>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription, >>> but the launcher can see it and start new worker until the transaction for >>> DROP has been committed. >>> >> >> That was the reason why DropSubscription didn't release the lock in the >> first place. It was supposed to be released at the end of the >> transaction though. > > OK, I understood why you used the lock in that way. But using LWLock > for that purpose is not valid. > Yeah, I just tried to avoid what we are doing now really hard :) >>> To fix this issue, I think that DROP SUBSCRIPTION should take >>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock, >>> so that the launcher cannot see the entry to be being removed. >>> >> >> The whole point of having LogicalRepLauncherLock is to avoid having to >> do this, so if we do this we could probably get rid of it. > > Yes, let's remove LogicalRepLauncherLock and lock pg_subscription > with AccessExclusive mode at the beginning of DROP SUBSCRIPTION. > Attached patch does this. > Okay, looks reasonable to me. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: