Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION |
Date | |
Msg-id | CAHGQGwEzrbmmR-hNv4PrdTadyR=3xs1t6+VtbT9VLiLkBPePkA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Wed, Mar 8, 2017 at 9:05 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >>> On 06/02/17 17:33, Fujii Masao wrote: >>>> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek >>>> <petr.jelinek@2ndquadrant.com> wrote: >>>>> On 03/02/17 19:38, Fujii Masao wrote: >>>>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI >>>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com> >>>>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier >>>>>>>>> <michael.paquier@gmail.com> wrote: >>>>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: >>>>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that >>>>>>>>>>>> some functions called from there can throw exceptions. >>>>>>>>>>> >>>>>>>>>>> Yes, but all LWLocks should be released by normal error recovery. >>>>>>>>>>> It should not be necessary for this code to clean that up by hand. >>>>>>>>>>> If it were necessary, there would be TRY-CATCH around every single >>>>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and >>>>>>>>>>> unmaintainable system. Please don't add a TRY-CATCH unless it's >>>>>>>>>>> *necessary* -- and you haven't explained why this one is. >>>>>>>>> >>>>>>>>> Yes. >>>>>>>> >>>>>>>> Thank you for the suggestion. I minunderstood that. >>>>>>>> >>>>>>>>>> Putting hands into the code and at the problem, I can see that >>>>>>>>>> dropping a subscription on a node makes it unresponsive in case of a >>>>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in >>>>>>>>>> the patch attached. A try/catch problem should not be necessary. >>>>>>>>> >>>>>>>>> Thanks for the patch! >>>>>>>>> >>>>>>>>> With the patch, LogicalRepLauncherLock is released at the end of >>>>>>>>> DropSubscription(). But ISTM that the lock should be released just after >>>>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of >>>>>>>>> replication slot with the lock. >>>>>>>> >>>>>>>> That's true. logicalrep_worker_stop returns after confirmig that >>>>>>>> worker->proc is cleard, so no false relaunch cannot be caused. >>>>>>>> After all, logicalrep_worker_stop is surrounded by >>>>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton >>>>>>>> and make the lock secrion to be more narrower. >>>>>> >>>>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be >>>>>> removed and the comment for logicalrep_worker_stop() should be updated. >>>>>> >>>>>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock >>>>>> while holding LogicalRepLauncherLock. OTOH, with your approach, >>>>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding >>>>>> LogicalRepWorkerLock. >>>>>> >>>>>> Therefore I pushed the simple patch which adds LWLockRelease() just after >>>>>> logicalrep_worker_stop(). >>>>>> >>>>>> Another problem that I found while reading the code is that the launcher can >>>>>> start up the worker with the subscription that DROP SUBSCRIPTION just removed. >>>>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription, >>>>>> but the launcher can see it and start new worker until the transaction for >>>>>> DROP has been committed. >>>>>> >>>>> >>>>> That was the reason why DropSubscription didn't release the lock in the >>>>> first place. It was supposed to be released at the end of the >>>>> transaction though. >>>> >>>> OK, I understood why you used the lock in that way. But using LWLock >>>> for that purpose is not valid. >>>> >>> >>> Yeah, I just tried to avoid what we are doing now really hard :) >>> >>>>>> To fix this issue, I think that DROP SUBSCRIPTION should take >>>>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock, >>>>>> so that the launcher cannot see the entry to be being removed. >>>>>> >>>>> >>>>> The whole point of having LogicalRepLauncherLock is to avoid having to >>>>> do this, so if we do this we could probably get rid of it. >>>> >>>> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription >>>> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION. >>>> Attached patch does this. >>>> >>> >>> Okay, looks reasonable to me. >> >> Thanks for the review! >> But ISMT that I should suspend committing the patch until we fix the issue >> that Sawada reported in other thread. That bugfix may change the related >> code and design very much. >> https://www.postgresql.org/message-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com >> > > That patch has been committed. And this issue still happens. Should we > add this to the open item list so it doesn't get missed? Thanks for ping. Pushed the patch. Regards, -- Fujii Masao
pgsql-hackers by date: