Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION
Date
Msg-id CAD21AoAHMoTL_9CEtKKne_AqoOF304k5mR7PpyxghrnGeGTH_A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 06/02/17 17:33, Fujii Masao wrote:
>>> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
>>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> On 03/02/17 19:38, Fujii Masao wrote:
>>>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com>
>>>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>>>>> <michael.paquier@gmail.com> wrote:
>>>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
>>>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>>>>> some functions called from there can throw exceptions.
>>>>>>>>>>
>>>>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>>>>> unmaintainable system.  Please don't add a TRY-CATCH unless it's
>>>>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>
>>>>>>> Thank you for the suggestion. I minunderstood that.
>>>>>>>
>>>>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in
>>>>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>>>>>
>>>>>>>> Thanks for the patch!
>>>>>>>>
>>>>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>>>>> DropSubscription(). But ISTM that the lock should be released just after
>>>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of
>>>>>>>> replication slot with the lock.
>>>>>>>
>>>>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>>>>> After all, logicalrep_worker_stop is surrounded by
>>>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>>>>> and make the lock secrion to be more narrower.
>>>>>
>>>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>>>>> removed and the comment for logicalrep_worker_stop() should be updated.
>>>>>
>>>>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
>>>>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>>>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>>>>> LogicalRepWorkerLock.
>>>>>
>>>>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>>>>> logicalrep_worker_stop().
>>>>>
>>>>> Another problem that I found while reading the code is that the launcher can
>>>>> start up the worker with the subscription that DROP SUBSCRIPTION just removed.
>>>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>>>>> but the launcher can see it and start new worker until the transaction for
>>>>> DROP has been committed.
>>>>>
>>>>
>>>> That was the reason why DropSubscription didn't release the lock in the
>>>> first place. It was supposed to be released at the end of the
>>>> transaction though.
>>>
>>> OK, I understood why you used the lock in that way. But using LWLock
>>> for that purpose is not valid.
>>>
>>
>> Yeah, I just tried to avoid what we are doing now really hard :)
>>
>>>>> To fix this issue, I think that DROP SUBSCRIPTION should take
>>>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>>>>> so that the launcher cannot see the entry to be being removed.
>>>>>
>>>>
>>>> The whole point of having LogicalRepLauncherLock is to avoid having to
>>>> do this, so if we do this we could probably get rid of it.
>>>
>>> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
>>> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
>>> Attached patch does this.
>>>
>>
>> Okay, looks reasonable to me.
>
> Thanks for the review!
> But ISMT that I should suspend committing the patch until we fix the issue
> that Sawada reported in other thread. That bugfix may change the related
> code and design very much.
> https://www.postgresql.org/message-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com
>

That patch has been committed. And this issue still happens. Should we
add this to the open item list so it doesn't get missed?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Parallel seq. plan is not coming against inheritance orpartition table
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Write Ahead Logging for Hash Indexes