DeadLockCheck is buggy - Mailing list pgsql-hackers

From Tom Lane
Subject DeadLockCheck is buggy
Date
Msg-id 14276.979684024@sss.pgh.pa.us
Whole thread Raw
Responses Re: DeadLockCheck is buggy  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Create three tables and start four transactions, then do:

XACT 1:  LOCK TABLE A;

XACT 2:  LOCK TABLE B IN ROW SHARE MODE;

XACT 3:  LOCK TABLE B IN ROW EXCLUSIVE MODE;

XACT 4:  LOCK TABLE C;

XACT 2:  LOCK TABLE C;

XACT 3:  LOCK TABLE C;

XACT 1:  LOCK TABLE B IN SHARE MODE;

<< wait at least 1 second here >>

XACT 4:  LOCK TABLE A;

The system is now deadlocked: 4 waits for 1, 1 waits for 3, 3 waits for
4 ... but DeadLockCheck doesn't notice.  Deadlock *is* detected if LOCK
TABLE C is issued in xact 3 before xact 2, however.  The problem is that
after a recursive invocation of DeadLockCheck, the code checks to see
if the waitProc is blocked by thisProc, and assumes no deadlock if not.
But each waitProc will only be visited once, so if we happen to reach it
first by way of a proc that is not blocking it, we fail to notice that
there really is a deadlock.  In this example, we reach xact 1 by way of
xact 2 (which is hit first in the scan of waiters for lock C); xact 2 is
waiting on the same lock as xact 3, but only xact 3 blocks xact 1.

I have been studying DeadLockCheck for most of a day now, and I doubt
that this is the only bug lurking in it.  I think that we really ought
to throw it away and start over, because it doesn't look to me at all
like a standard deadlock-detection algorithm.  The standard way of doing
this is that you trace outward along waits-for relationships and see if
you come back to your starting point or not.  This is not doing that.
One reason is that the existing locktable structure makes it extremely
painful to discover just which processes are holding a given lock.
You can only find which ones are *waiting* for a given lock.  That's not
a killer problem, we can simply trace the waits-for relationships in
reverse, but it's not really doing that either.  In particular, it's
only doing the tracing in a literal sense for conflicts between locks
requested and locks already held.  It's not doing things that way for
the case where A is waiting for B because they are requesting (not
holding) conflicting locks and A is behind B in the queue for the lock.
I think there are bugs associated with that case too, though I don't
have an example yet.


Anyway, the bottom line is that I want to start over using a standard
deadlock-check algorithm, along the following lines:

1. Make it relatively cheap to find all the procs already holding a
given lock, by extending the lock datastructures so that there's a list
of already-granted holder objects for each lock, not only pending holders.
(In other words, each HOLDER object will now be in two linked lists, one
for its proc and one for its lock, not only one list.  This will also
make it a lot easier to write a useful lock-state-dumping routine,
whenever someone gets around to that.)

2. Rewrite DeadLockCheck to recurse outwards to procs that the current
proc is waiting for, rather than vice versa, and consider both hard
blocking (he's already got a conflicting lock) and priority blocking
(he's got a conflicting request and is ahead of me in the wait queue).

3. As now, if a cycle is found that includes a priority block, try to
break the cycle by awakening the lower-priority waiter out of order,
rather than aborting a transaction.

4. Clean up DeadLockCheck so that it can be started from any proc,
not only MyProc.  This is needed for point #5.

5. Change ProcLockWakeup to address Hiroshi's concern about possible
starvation if we awaken waiters out-of-order.  If we find a wakable
waiter behind one or more nonwakable waiters, awaken it only if
DeadLockCheck says it is deadlocked.  Under normal circumstances that
won't happen and we won't break priority-of-arrival ordering, but if
it's necessary to avoid deadlock then we will.  (We could avoid excess
computation here by only running DeadLockCheck if the waiting process
has already done so on its own behalf; otherwise don't, just let it
happen if the waiter is still blocked when its timeout occurs.  That
would be easy if we add a bool I've-run-DeadLockCheck flag to the PROC
structures.)

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Mikheev, Vadim"
Date:
Subject: RE: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea
Next
From: Tom Lane
Date:
Subject: Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea