Thread: [PATCH] LWLock self-deadlock detection
Hi all
Here's a patch I wrote a while ago to detect and report when a LWLockAcquire() results in a simple self-deadlock due to the caller already holding the LWLock.
To avoid affecting hot-path performance, it only fires the check on the first iteration through the retry loops in LWLockAcquire() and LWLockWaitForVar(), and just before we sleep, once the fast-path has been missed.
I wrote an earlier version of this when I was chasing down some hairy issues with background workers deadlocking on some exit paths because ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would cause a before_shmem_exit or on_shmem_exit cleanup function to deadlock when it tried to acquire the same lock.
But it's an easy enough mistake to make and a seriously annoying one to track down, so I figured I'd post it for consideration. Maybe someone else will get some use out of it even if nobody likes the idea of merging it.
As written the check runs only for --enable-cassert builds or when LOCK_DEBUG is defined.
Attachment
This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe variant instead of copying that code with possibly a change in that function to return the required information. I am also seeing a pattern Assert(LWLockHeldByMe*()) LWLockAcquire() at some places. Should we change LWLockAcquire to do Assert(LWLockHelpByMe()) always to detect such occurrences? Enhance that pattern to print the information that your patch prints? It looks weird that we can detect a self deadlock but not handle it. But handling it requires remembering multiple LWLocks held and then release them those many times. That may not leave LWLocks LW anymore. On Thu, Nov 19, 2020 at 4:02 PM Craig Ringer <craig.ringer@enterprisedb.com> wrote: > > Hi all > > Here's a patch I wrote a while ago to detect and report when a LWLockAcquire() results in a simple self-deadlock due tothe caller already holding the LWLock. > > To avoid affecting hot-path performance, it only fires the check on the first iteration through the retry loops in LWLockAcquire()and LWLockWaitForVar(), and just before we sleep, once the fast-path has been missed. > > I wrote an earlier version of this when I was chasing down some hairy issues with background workers deadlocking on someexit paths because ereport(ERROR) or elog(ERROR) calls fired when a LWLock was held would cause a before_shmem_exit oron_shmem_exit cleanup function to deadlock when it tried to acquire the same lock. > > But it's an easy enough mistake to make and a seriously annoying one to track down, so I figured I'd post it for consideration.Maybe someone else will get some use out of it even if nobody likes the idea of merging it. > > As written the check runs only for --enable-cassert builds or when LOCK_DEBUG is defined. -- Best Wishes, Ashutosh Bapat
On Tue, Nov 24, 2020 at 10:11 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
This looks useful. LWLockCheckSelfDeadlock() could use LWLockHeldByMe
variant instead of copying that code with possibly a change in that
function to return the required information.
Yes, possibly so. I was reluctant to mess with the rest of the code too much.
I am also seeing a pattern
Assert(!LWLockHeldByMe());
LWLockAcquire()
at some places. Should we change LWLockAcquire to do
Assert(!LWLockHeldByMe()) always to detect such occurrences?
I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path. cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rather not risk a slowdown.
I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normal LOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assert build we'd carry on and self-deadlock anyway.
That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not without introducing a new and really hard to test code path that'll also be surprising to callers. We probably don't want to PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock. We could HINT that a -m immediate shutdown will be necessary to recover though.
I don't think it makes sense to introduce much complexity for a feature that's mainly there to help developers and to catch corner-case bugs.
(FWIW a variant of this patch has been in 2ndQPostgres for some time. It helped catch an extension bug just the other day.)
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer <craig.ringer@enterprisedb.com> wrote: >> I am also seeing a pattern >> Assert(!LWLockHeldByMe()); >> LWLockAcquire() >> >> at some places. Should we change LWLockAcquire to do >> Assert(!LWLockHeldByMe()) always to detect such occurrences? > > > I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path.cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rathernot risk a slowdown. > > I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normalLOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assertbuild we'd carry on and self-deadlock anyway. > > That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not withoutintroducing a new and really hard to test code path that'll also be surprising to callers. We probably don't wantto PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock.We could HINT that a -m immediate shutdown will be necessary to recover though. I agree that it will be helpful to print something in the logs indicating the reason for this hang in case the hang happens in a production build. In your patch you have used ereport(PANIC, ) which may simply be replaced by an Assert() in an assert enabled build. We already have Assert(!LWLockHeldByMe()) so that should be safe. It will be good to have -m immediate hint in LOG message. But it might just be better to kill -9 that process to get rid of it. That will cause the server to restart and not just shutdown. -- Best Wishes, Ashutosh Bapat
On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>> I am also seeing a pattern
>> Assert(!LWLockHeldByMe());
>> LWLockAcquire()
>>
>> at some places. Should we change LWLockAcquire to do
>> Assert(!LWLockHeldByMe()) always to detect such occurrences?
>
>
> I'm inclined not to, at least not without benchmarking it, because that'd do the check before we attempt the fast-path. cassert builds are still supposed to perform decently and be suitable for day to day development and I'd rather not risk a slowdown.
>
> I'd prefer to make the lock self deadlock check run for production builds, not just cassert builds. It'd print a normal LOG (with backtrace if supported) then Assert(). So on an assert build we'd get a crash and core, and on a non-assert build we'd carry on and self-deadlock anyway.
>
> That's probably the safest thing to do. We can't expect to safely ERROR out of the middle of an LWLockAcquire(), not without introducing a new and really hard to test code path that'll also be surprising to callers. We probably don't want to PANIC the whole server for non-assert builds since it might enter a panic-loop. So it's probably better to self-deadlock. We could HINT that a -m immediate shutdown will be necessary to recover though.
I agree that it will be helpful to print something in the logs
indicating the reason for this hang in case the hang happens in a
production build. In your patch you have used ereport(PANIC, ) which
may simply be replaced by an Assert() in an assert enabled build.
I'd either select between PANIC (assert build) and LOG (non assert build), or always LOG then put an Assert() afterwards. It's strongly desirable to get the log output before any crash because it's a pain to get the LWLock info from a core.
We
already have Assert(!LWLockHeldByMe()) so that should be safe. It will
be good to have -m immediate hint in LOG message. But it might just be
better to kill -9 that process to get rid of it. That will cause the
server to restart and not just shutdown.
Sure. Won't work on Windows though. And I am hesitant about telling users to "kill -9" anything.
pg_ctl -m immediate restart then.
Craig Ringer <craig.ringer@enterprisedb.com> writes: > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> > wrote: >> I'd prefer to make the lock self deadlock check run for production >> builds, not just cassert builds. I'd like to register a strong objection to spending any cycles whatsoever on this. If you're using LWLocks in a way that could deadlock, you're doing it wrong. The entire point of that mechanism is to be Light Weight and not have all the overhead that the heavyweight lock mechanism has. Building in deadlock checks and such is exactly the wrong direction. If you think you need that, you should be using a heavyweight lock. Maybe there's some case for a cassert-only check of this sort, but running it in production is right out. I'd also opine that checking for self-deadlock, but not any more general deadlock, seems pretty pointless. Careless coding is far more likely to create opportunities for the latter. (Thus, I have little use for the existing assertions of this sort, either.) regards, tom lane
On 26/11/2020 04:50, Tom Lane wrote: > Craig Ringer <craig.ringer@enterprisedb.com> writes: >> On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> >> wrote: >>> I'd prefer to make the lock self deadlock check run for production >>> builds, not just cassert builds. > > I'd like to register a strong objection to spending any cycles whatsoever > on this. If you're using LWLocks in a way that could deadlock, you're > doing it wrong. The entire point of that mechanism is to be Light Weight > and not have all the overhead that the heavyweight lock mechanism has. > Building in deadlock checks and such is exactly the wrong direction. > If you think you need that, you should be using a heavyweight lock. > > Maybe there's some case for a cassert-only check of this sort, but running > it in production is right out. > > I'd also opine that checking for self-deadlock, but not any more general > deadlock, seems pretty pointless. Careless coding is far more likely > to create opportunities for the latter. (Thus, I have little use for > the existing assertions of this sort, either.) I've made the mistake of forgetting to release an LWLock many times, leading to self-deadlock. And I just reviewed a patch that did that this week [1]. You usually find that mistake very quickly when you start testing though, I don't think I've seen it happen in production. So yeah, I agree it's not worth spending cycles on this. Maybe it would be worth it if it's really simple to check, and you only do it after waiting X seconds. (I haven't looked at this patch at all) [1] https://www.postgresql.org/message-id/8e1cda9b-1cb9-a634-d383-f757bf67b820@iki.fi - Heikki
On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I've made the mistake of forgetting to release an LWLock many times, > leading to self-deadlock. And I just reviewed a patch that did that this > week [1]. You usually find that mistake very quickly when you start > testing though, I don't think I've seen it happen in production. +1. The fact that you don't get deadlock detection with LWLocks is a cornerstone of the whole design. This assumption is common to all database systems (LWLocks are generally called latches in the database research community, but the concept is exactly the same). > So yeah, I agree it's not worth spending cycles on this. Maybe it would > be worth it if it's really simple to check, and you only do it after > waiting X seconds. (I haven't looked at this patch at all) -1 for that, unless it's only for debug builds. Even if it is practically free, it still means committing to the wrong priorities. Plus the performance validation would be very arduous as a practical matter. We've made LWLocks much more scalable in the last 10 years. Why shouldn't we expect to do the same again in the next 10 years? I wouldn't bet against it. I might even do the opposite (predict further improvements to LWLocks). -- Peter Geoghegan
On Fri, Nov 27, 2020 at 11:08:49AM -0800, Peter Geoghegan wrote: > On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I've made the mistake of forgetting to release an LWLock many times, >> leading to self-deadlock. And I just reviewed a patch that did that this >> week [1]. You usually find that mistake very quickly when you start >> testing though, I don't think I've seen it happen in production. > > +1. The fact that you don't get deadlock detection with LWLocks is a > cornerstone of the whole design. This assumption is common to all > database systems (LWLocks are generally called latches in the database > research community, but the concept is exactly the same). +1. >> So yeah, I agree it's not worth spending cycles on this. Maybe it would >> be worth it if it's really simple to check, and you only do it after >> waiting X seconds. (I haven't looked at this patch at all) > > -1 for that, unless it's only for debug builds. Even if it is > practically free, it still means committing to the wrong priorities. > Plus the performance validation would be very arduous as a practical > matter. Looking at the git history, we have a much larger history of bugs that relate to race conditions when it comes to LWLocks. I am not really convinced that it is worth spending time on this even for debug builds, FWIW. -- Michael
Attachment
Hi, On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote: > On 26/11/2020 04:50, Tom Lane wrote: > > Craig Ringer <craig.ringer@enterprisedb.com> writes: > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> > > > wrote: > > > > I'd prefer to make the lock self deadlock check run for production > > > > builds, not just cassert builds. > > > > I'd like to register a strong objection to spending any cycles whatsoever > > on this. If you're using LWLocks in a way that could deadlock, you're > > doing it wrong. The entire point of that mechanism is to be Light Weight > > and not have all the overhead that the heavyweight lock mechanism has. > > Building in deadlock checks and such is exactly the wrong direction. > > If you think you need that, you should be using a heavyweight lock. > > > > Maybe there's some case for a cassert-only check of this sort, but running > > it in production is right out. > > > > I'd also opine that checking for self-deadlock, but not any more general > > deadlock, seems pretty pointless. Careless coding is far more likely > > to create opportunities for the latter. (Thus, I have little use for > > the existing assertions of this sort, either.) > > I've made the mistake of forgetting to release an LWLock many times, leading > to self-deadlock. And I just reviewed a patch that did that this week [1]. > You usually find that mistake very quickly when you start testing though, I > don't think I've seen it happen in production. I think something roughly along Craig's original patch, basically adding assert checks against holding a lock already, makes sense. Compared to the other costs of running an assert build this isn't a huge cost, and it's helpful. I entirely concur that doing this outside of assertion builds is a seriously bad idea. Greetings, Andres Freund
On Wed, 30 Dec 2020 at 10:11, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote:
> On 26/11/2020 04:50, Tom Lane wrote:
> > Craig Ringer <craig.ringer@enterprisedb.com> writes:
> > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > > wrote:
> > > > I'd prefer to make the lock self deadlock check run for production
> > > > builds, not just cassert builds.
> >
> > I'd like to register a strong objection to spending any cycles whatsoever
> > on this. If you're using LWLocks in a way that could deadlock, you're
> > doing it wrong. The entire point of that mechanism is to be Light Weight
> > and not have all the overhead that the heavyweight lock mechanism has.
> > Building in deadlock checks and such is exactly the wrong direction.
> > If you think you need that, you should be using a heavyweight lock.
> >
> > Maybe there's some case for a cassert-only check of this sort, but running
> > it in production is right out.
> >
> > I'd also opine that checking for self-deadlock, but not any more general
> > deadlock, seems pretty pointless. Careless coding is far more likely
> > to create opportunities for the latter. (Thus, I have little use for
> > the existing assertions of this sort, either.)
>
> I've made the mistake of forgetting to release an LWLock many times, leading
> to self-deadlock. And I just reviewed a patch that did that this week [1].
> You usually find that mistake very quickly when you start testing though, I
> don't think I've seen it happen in production.
I think something roughly along Craig's original patch, basically adding
assert checks against holding a lock already, makes sense. Compared to
the other costs of running an assert build this isn't a huge cost, and
it's helpful.
I entirely concur that doing this outside of assertion builds is a
seriously bad idea.
Yeah, given it only targets developer error that's sensible.