Thread: [PATCH] LWLock self-deadlock detection

[PATCH] LWLock self-deadlock detection

From
Craig Ringer
Date:
Hi all

Here's a patch I wrote a while ago to detect and report when a LWLockAcquire() results in a simple self-deadlock due to the caller already holding the LWLock.

To avoid affecting hot-path performance, it only fires the check on the first iteration through the retry loops in LWLockAcquire() and LWLockWaitForVar(), and just before we sleep, once the fast-path has been missed.

I wrote an earlier version of this when I was chasing down some hairy issues with background workers deadlocking on some exit paths because ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would cause a before_shmem_exit or on_shmem_exit cleanup function to deadlock when it tried to acquire the same lock.

But it's an easy enough mistake to make and a seriously annoying one to track down, so I figured I'd post it for consideration. Maybe someone else will get some use out of it even if nobody likes the idea of merging it.

As written the check runs only for --enable-cassert builds or when LOCK_DEBUG is defined.
Attachment

Re: [PATCH] LWLock self-deadlock detection

From
Ashutosh Bapat
Date:
This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe
variant instead of copying that code with possibly a change in that
function to return the required information.

I am also seeing a pattern
Assert(LWLockHeldByMe*())
LWLockAcquire()

at some places. Should we change LWLockAcquire to do
Assert(LWLockHelpByMe()) always to detect such occurrences? Enhance
that pattern to print the information that your patch prints?

It looks weird that we can detect a self deadlock but not handle it.
But handling it requires remembering multiple LWLocks held and then
release them those many times. That may not leave LWLocks LW anymore.

On Thu, Nov 19, 2020 at 4:02 PM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> Hi all
>
> Here's a patch I wrote a while ago to detect and report when a LWLockAcquire() results in a simple self-deadlock due
tothe caller already holding the LWLock. 
>
> To avoid affecting hot-path performance, it only fires the check on the first iteration through the retry loops in
LWLockAcquire()and LWLockWaitForVar(), and just before we sleep, once the fast-path has been missed. 
>
> I wrote an earlier version of this when I was chasing down some hairy issues with background workers deadlocking on
someexit paths because ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would cause a before_shmem_exit
oron_shmem_exit cleanup function to deadlock when it tried to acquire the same lock. 
>
> But it's an easy enough mistake to make and a seriously annoying one to track down, so I figured I'd post it for
consideration.Maybe someone else will get some use out of it even if nobody likes the idea of merging it. 
>
> As written the check runs only for --enable-cassert builds or when LOCK_DEBUG is defined.



--
Best Wishes,
Ashutosh Bapat



Re: [PATCH] LWLock self-deadlock detection

From
Craig Ringer
Date:
On Tue, Nov 24, 2020 at 10:11 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe
variant instead of copying that code with possibly a change in that
function to return the required information.

Yes, possibly so. I was reluctant to mess with the rest of the code too much.

I am also seeing a pattern
Assert(!LWLockHeldByMe());
LWLockAcquire()

at some places. Should we change LWLockAcquire to do
Assert(!LWLockHeldByMe()) always to detect such occurrences?

I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path. cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rather not risk a slowdown.

I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normal LOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assert build we'd carry on and self-deadlock anyway.

That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not without introducing a new and really hard to test code path that'll also be surprising to callers. We probably don't want to PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock. We could HINT that a -m immediate shutdown will be necessary to recover though.

I don't think it makes sense to introduce much complexity for a feature that's mainly there to help developers and to catch corner-case bugs.

(FWIW a variant of this patch has been in 2ndQPostgres for some time. It helped catch an extension bug just the other day.)

Re: [PATCH] LWLock self-deadlock detection

From
Ashutosh Bapat
Date:
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>> I am also seeing a pattern
>> Assert(!LWLockHeldByMe());
>> LWLockAcquire()
>>
>> at some places. Should we change LWLockAcquire to do
>> Assert(!LWLockHeldByMe()) always to detect such occurrences?
>
>
> I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the
fast-path.cassert builds are still supposed to perform decently and be suitable for day to day development and I'd
rathernot risk a slowdown. 
>
> I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a
normalLOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a
non-assertbuild we'd carry on and self-deadlock anyway. 
>
> That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not
withoutintroducing a new and really hard to test code path that'll also be surprising to callers. We probably don't
wantto PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to
self-deadlock.We could HINT that a -m immediate shutdown will be necessary to recover though. 

I agree that it will be helpful to print something in the logs
indicating the reason for this hang in case the hang happens in a
production build. In your patch you have used ereport(PANIC, ) which
may simply be replaced by an Assert() in an assert enabled build. We
already have Assert(!LWLockHeldByMe()) so that should be safe. It will
be good to have -m immediate hint in LOG message. But it might just be
better to kill -9 that process to get rid of it. That will cause the
server to restart and not just shutdown.

--
Best Wishes,
Ashutosh Bapat



Re: [PATCH] LWLock self-deadlock detection

From
Craig Ringer
Date:
On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>> I am also seeing a pattern
>> Assert(!LWLockHeldByMe());
>> LWLockAcquire()
>>
>> at some places. Should we change LWLockAcquire to do
>> Assert(!LWLockHeldByMe()) always to detect such occurrences?
>
>
> I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path. cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rather not risk a slowdown.
>
> I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normal LOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assert build we'd carry on and self-deadlock anyway.
>
> That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not without introducing a new and really hard to test code path that'll also be surprising to callers. We probably don't want to PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock. We could HINT that a -m immediate shutdown will be necessary to recover though.

I agree that it will be helpful to print something in the logs
indicating the reason for this hang in case the hang happens in a
production build. In your patch you have used ereport(PANIC, ) which
may simply be replaced by an Assert() in an assert enabled build.

I'd either select between PANIC (assert build) and LOG (non assert build), or always LOG then put an Assert() afterwards. It's strongly desirable to get the log output before any crash because it's a pain to get the LWLock info from a core.
 
We
already have Assert(!LWLockHeldByMe()) so that should be safe. It will
be good to have -m immediate hint in LOG message. But it might just be
better to kill -9 that process to get rid of it. That will cause the
server to restart and not just shutdown.

Sure. Won't work on Windows though. And I am hesitant about telling users to "kill -9" anything.

pg_ctl -m immediate restart then.

Re: [PATCH] LWLock self-deadlock detection

From
Tom Lane
Date:
Craig Ringer <craig.ringer@enterprisedb.com> writes:
> On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> wrote:
>> I'd prefer to make the lock self deadlock check run for production
>> builds, not just cassert builds.

I'd like to register a strong objection to spending any cycles whatsoever
on this.  If you're using LWLocks in a way that could deadlock, you're
doing it wrong.  The entire point of that mechanism is to be Light Weight
and not have all the overhead that the heavyweight lock mechanism has.
Building in deadlock checks and such is exactly the wrong direction.
If you think you need that, you should be using a heavyweight lock.

Maybe there's some case for a cassert-only check of this sort, but running
it in production is right out.

I'd also opine that checking for self-deadlock, but not any more general
deadlock, seems pretty pointless.  Careless coding is far more likely
to create opportunities for the latter.  (Thus, I have little use for
the existing assertions of this sort, either.)

            regards, tom lane



Re: [PATCH] LWLock self-deadlock detection

From
Heikki Linnakangas
Date:
On 26/11/2020 04:50, Tom Lane wrote:
> Craig Ringer <craig.ringer@enterprisedb.com> writes:
>> On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
>> wrote:
>>> I'd prefer to make the lock self deadlock check run for production
>>> builds, not just cassert builds.
> 
> I'd like to register a strong objection to spending any cycles whatsoever
> on this.  If you're using LWLocks in a way that could deadlock, you're
> doing it wrong.  The entire point of that mechanism is to be Light Weight
> and not have all the overhead that the heavyweight lock mechanism has.
> Building in deadlock checks and such is exactly the wrong direction.
> If you think you need that, you should be using a heavyweight lock.
> 
> Maybe there's some case for a cassert-only check of this sort, but running
> it in production is right out.
> 
> I'd also opine that checking for self-deadlock, but not any more general
> deadlock, seems pretty pointless.  Careless coding is far more likely
> to create opportunities for the latter.  (Thus, I have little use for
> the existing assertions of this sort, either.)

I've made the mistake of forgetting to release an LWLock many times, 
leading to self-deadlock. And I just reviewed a patch that did that this 
week [1]. You usually find that mistake very quickly when you start 
testing though, I don't think I've seen it happen in production.

So yeah, I agree it's not worth spending cycles on this. Maybe it would 
be worth it if it's really simple to check, and you only do it after 
waiting X seconds. (I haven't looked at this patch at all)

[1] 
https://www.postgresql.org/message-id/8e1cda9b-1cb9-a634-d383-f757bf67b820@iki.fi

- Heikki



Re: [PATCH] LWLock self-deadlock detection

From
Peter Geoghegan
Date:
On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I've made the mistake of forgetting to release an LWLock many times,
> leading to self-deadlock. And I just reviewed a patch that did that this
> week [1]. You usually find that mistake very quickly when you start
> testing though, I don't think I've seen it happen in production.

+1. The fact that you don't get deadlock detection with LWLocks is a
cornerstone of the whole design. This assumption is common to all
database systems (LWLocks are generally called latches in the database
research community, but the concept is exactly the same).

> So yeah, I agree it's not worth spending cycles on this. Maybe it would
> be worth it if it's really simple to check, and you only do it after
> waiting X seconds. (I haven't looked at this patch at all)

-1 for that, unless it's only for debug builds. Even if it is
practically free, it still means committing to the wrong priorities.
Plus the performance validation would be very arduous as a practical
matter.

We've made LWLocks much more scalable in the last 10 years. Why
shouldn't we expect to do the same again in the next 10 years? I
wouldn't bet against it. I might even do the opposite (predict further
improvements to LWLocks).

-- 
Peter Geoghegan



Re: [PATCH] LWLock self-deadlock detection

From
Michael Paquier
Date:
On Fri, Nov 27, 2020 at 11:08:49AM -0800, Peter Geoghegan wrote:
> On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I've made the mistake of forgetting to release an LWLock many times,
>> leading to self-deadlock. And I just reviewed a patch that did that this
>> week [1]. You usually find that mistake very quickly when you start
>> testing though, I don't think I've seen it happen in production.
>
> +1. The fact that you don't get deadlock detection with LWLocks is a
> cornerstone of the whole design. This assumption is common to all
> database systems (LWLocks are generally called latches in the database
> research community, but the concept is exactly the same).

+1.

>> So yeah, I agree it's not worth spending cycles on this. Maybe it would
>> be worth it if it's really simple to check, and you only do it after
>> waiting X seconds. (I haven't looked at this patch at all)
>
> -1 for that, unless it's only for debug builds. Even if it is
> practically free, it still means committing to the wrong priorities.
> Plus the performance validation would be very arduous as a practical
> matter.

Looking at the git history, we have a much larger history of bugs that
relate to race conditions when it comes to LWLocks.  I am not really
convinced that it is worth spending time on this even for debug
builds, FWIW.
--
Michael

Attachment

Re: [PATCH] LWLock self-deadlock detection

From
Andres Freund
Date:
Hi,

On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> On 26/11/2020 04:50, Tom Lane wrote:
> > Craig Ringer <craig.ringer@enterprisedb.com> writes:
> > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > > wrote:
> > > > I'd prefer to make the lock self deadlock check run for production
> > > > builds, not just cassert builds.
> > 
> > I'd like to register a strong objection to spending any cycles whatsoever
> > on this.  If you're using LWLocks in a way that could deadlock, you're
> > doing it wrong.  The entire point of that mechanism is to be Light Weight
> > and not have all the overhead that the heavyweight lock mechanism has.
> > Building in deadlock checks and such is exactly the wrong direction.
> > If you think you need that, you should be using a heavyweight lock.
> > 
> > Maybe there's some case for a cassert-only check of this sort, but running
> > it in production is right out.
> > 
> > I'd also opine that checking for self-deadlock, but not any more general
> > deadlock, seems pretty pointless.  Careless coding is far more likely
> > to create opportunities for the latter.  (Thus, I have little use for
> > the existing assertions of this sort, either.)
> 
> I've made the mistake of forgetting to release an LWLock many times, leading
> to self-deadlock. And I just reviewed a patch that did that this week [1].
> You usually find that mistake very quickly when you start testing though, I
> don't think I've seen it happen in production.

I think something roughly along Craig's original patch, basically adding
assert checks against holding a lock already, makes sense. Compared to
the other costs of running an assert build this isn't a huge cost, and
it's helpful.

I entirely concur that doing this outside of assertion builds is a
seriously bad idea.

Greetings,

Andres Freund



Re: [PATCH] LWLock self-deadlock detection

From
Craig Ringer
Date:


On Wed, 30 Dec 2020 at 10:11, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> On 26/11/2020 04:50, Tom Lane wrote:
> > Craig Ringer <craig.ringer@enterprisedb.com> writes:
> > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > > wrote:
> > > > I'd prefer to make the lock self deadlock check run for production
> > > > builds, not just cassert builds.
> >
> > I'd like to register a strong objection to spending any cycles whatsoever
> > on this.  If you're using LWLocks in a way that could deadlock, you're
> > doing it wrong.  The entire point of that mechanism is to be Light Weight
> > and not have all the overhead that the heavyweight lock mechanism has.
> > Building in deadlock checks and such is exactly the wrong direction.
> > If you think you need that, you should be using a heavyweight lock.
> >
> > Maybe there's some case for a cassert-only check of this sort, but running
> > it in production is right out.
> >
> > I'd also opine that checking for self-deadlock, but not any more general
> > deadlock, seems pretty pointless.  Careless coding is far more likely
> > to create opportunities for the latter.  (Thus, I have little use for
> > the existing assertions of this sort, either.)
>
> I've made the mistake of forgetting to release an LWLock many times, leading
> to self-deadlock. And I just reviewed a patch that did that this week [1].
> You usually find that mistake very quickly when you start testing though, I
> don't think I've seen it happen in production.

I think something roughly along Craig's original patch, basically adding
assert checks against holding a lock already, makes sense. Compared to
the other costs of running an assert build this isn't a huge cost, and
it's helpful.

I entirely concur that doing this outside of assertion builds is a
seriously bad idea.

Yeah, given it only targets developer error that's sensible.