On Fri, 2010-09-03 at 18:24 -0400, Tom Lane wrote:
> Now the HS case likewise appears to be set up so that the signal can
> only directly interrupt ProcWaitForSignal, so I think the core issue
> is
> whether any deadlock situations are possible. Given that this gets
> called from a low-level place like LockBufferForCleanup, I don't feel
> too comfortable about that.
LockBufferForCleanup is only ever called during recovery by
heap_xlog_clean() or btree_xlog_vacuum().
The actions taken to replay a WAL record are independent of all other
WAL records from a locking perspective, so replay of every WAL record
starts with no LWlocks held by startup process. LockBufferForCleanup is
taken early on in replay a heap or btree cleanup record and so we can
easily check that no other LWlocks are held while it is called.
> I certainly haven't seen any analysis or
> documentation of what locks can safely be held at that point.
> The deadlock checker only tries to take the LockMgr LWLocks, so
> extrapolating from whether it is safe to whether touching the
> ProcArrayLock is safe seems entirely unfounded.
So the startup process calls one LWlock, ProcArrayLock, and is not
holding any other LWlock when it does. The deadlock checker attempts to
get and hold all of the other lock partition locks. So deadlock checker
already does the thing you're saying might be dangerous and the startup
process doesn't.
The ProcArrayLock is only taken as a way of signaling other backends. If
that is particularly unsafe we could redesign that aspect.
> It might be worth pointing out here that LockBufferForCleanup is
> already
> known to be a risk factor for undetected deadlocks, even without HS in
> the picture, because of the possibility of deadlocks involving a chain
> of both heavyweight locks and LWLocks. Whether HS makes it materially
> worse may be something that we need field experience to determine.
You may be right and that it will be a problem.
The deadlock risk we're protecting against is a deadlock involving both
normal locks and buffer pins. We're safer having it than not having this
code, IMHO.
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services