Thread: Improving deadlock error messages

Improving deadlock error messages

From

Neil Conway

Date:

20 April 2007, 03:34:10

The error message we currently produce when a deadlock occurs is pretty
unfriendly:
ERROR: deadlock detected DETAIL: Process 32068 waits for AccessExclusiveLock on relation 16413 of database
16384;blocked by process 32064. Process 32064 waits for AccessExclusiveLock on relation 16415 of database 16384;
blockedby process 32068.

Users encounter this message relatively frequently -- and they often
depend on the information in the errdetail to track down the source of
the deadlock. Presenting numeric OIDs effectively forces the DBA to
resolve the OIDs to the appropriate relation and database names
manually.

I whipped up a quick patch to use names as well as OIDs for the
identifiers in the message, but on reflection the simple approach to
doing this is problematic: when we do syscache lookups to lookup the
identifier names, we might need to acquire an AccessShareLock on various
system catalogs (pg_class, pg_namespace, pg_database). This could fail
(e.g. because of a deadlock involving a system catalog), causing the
deadlock detector to infinitely recurse (albeit slowly).

We could fix this by first conditionally acquiring AccessShareLocks on
the necessary system catalogs. If any of those lock acquisitions fails,
we can just print the numeric OID instead. Since failing to acquire an
AccessShareLock on the system catalogs should not occur frequently in
practise, most users should see a more friendly error message.

Comments? Anyone see a better approach?

-Neil

Re: Improving deadlock error messages

From

Tom Lane

Date:

20 April 2007, 03:55:47

Neil Conway <neilc@samurai.com> writes:
> I whipped up a quick patch to use names as well as OIDs for the
> identifiers in the message, but on reflection the simple approach to
> doing this is problematic: when we do syscache lookups to lookup the
> identifier names, we might need to acquire an AccessShareLock on various
> system catalogs (pg_class, pg_namespace, pg_database). This could fail
> (e.g. because of a deadlock involving a system catalog), causing the
> deadlock detector to infinitely recurse (albeit slowly).

Yup, that's exactly why it doesn't do that already.

> We could fix this by first conditionally acquiring AccessShareLocks on
> the necessary system catalogs.

I don't think you've thought of quite all of the failure cases.  One
that's a bit pressing is that a deadlock isn't necessarily confined to
objects in your own database.

My take on this is that I'd much rather have unfriendly information
than none at all.
        regards, tom lane

Re: Improving deadlock error messages

From

Gregory Stark

Date:

20 April 2007, 05:56:02

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> I don't think you've thought of quite all of the failure cases.  One
> that's a bit pressing is that a deadlock isn't necessarily confined to
> objects in your own database.

We could do the syscache lookups for only the object we're waiting on and
store that information in the lock table for others to refer to. We would have
to do the syscache lookup either always or at the point where we first decide
we have to block.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Improving deadlock error messages

From

Neil Conway

Date:

20 April 2007, 14:44:17

On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote:
> I don't think you've thought of quite all of the failure cases.  One
> that's a bit pressing is that a deadlock isn't necessarily confined to
> objects in your own database.

I'm not sure I follow. If we conditionally acquire the locks we need and
always fallback to just printing the numeric OIDs, ISTM we should be
able to avoid the infinite recursion problem. (If necessary, we can
always special-case objects outside our own database, although I'm not
sure that's necessary.)

-Neil

Re: Improving deadlock error messages

From

Tom Lane

Date:

21 April 2007, 03:38:56

Neil Conway <neilc@samurai.com> writes:
> On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote:
>> I don't think you've thought of quite all of the failure cases.  One
>> that's a bit pressing is that a deadlock isn't necessarily confined to
>> objects in your own database.

> I'm not sure I follow. If we conditionally acquire the locks we need and
> always fallback to just printing the numeric OIDs, ISTM we should be
> able to avoid the infinite recursion problem. (If necessary, we can
> always special-case objects outside our own database, although I'm not
> sure that's necessary.)

Maybe so, but you're going to be writing quite a lot of duplicative
code, because the existing routines you might have been thinking of
using (lsyscache.c etc) don't behave that way.

The basic objection I've got to this is that it'll introduce a lot of
complexity and fragility into a seldom-taken error-recovery path,
which is almost by definition not well enough tested already.
        regards, tom lane

Re: Improving deadlock error messages

From

Neil Conway

Date:

21 April 2007, 18:55:02

On Sat, 2007-04-21 at 02:38 -0400, Tom Lane wrote:
> Maybe so, but you're going to be writing quite a lot of duplicative
> code, because the existing routines you might have been thinking of
> using (lsyscache.c etc) don't behave that way.

Right, I'm envisioning doing a conditional LockAcquire and then
heap_open() / heap_getnext() by hand. That will be relatively slow, but
code that emits a deadlock error message is almost by definition not
performance critical.

BTW, another alternative would be to set a global variable instructing
LockAcquire() to not block waiting for a lock; instead, it would
longjmp(), a la elog(ERROR). You could even construct something similar
to PG_TRY():

PG_COND_LOCK();
{   /* do various things that might acquire lmgr locks */
}
PG_ACQUIRE_FAILED()
{   /* failed to acquire an lmgr lock */
}
PG_END_COND_LOCK();

The risk would be leaving the LockAcquire() call site in an inconsistent
state when we longjmp(), but since DeadLockReport() is going to
ereport(ERROR) anyway, it might be sufficiently safe. This scheme does
seem a bit fragile, though...

-Neil

Re: Improving deadlock error messages

From

Neil Conway

Date:

21 April 2007, 20:42:13

On Sat, 2007-04-21 at 17:56 -0400, Neil Conway wrote:
> Right, I'm envisioning doing a conditional LockAcquire and then
> heap_open() / heap_getnext() by hand. That will be relatively slow, but
> code that emits a deadlock error message is almost by definition not
> performance critical.

... although it turns out you'd need to conditionally lock a *lot* of
system catalogs to guarantee that you're not going to block on a lock at
some point. Needless to say, that approach would be pretty ugly and
fragile.

> BTW, another alternative would be to set a global variable instructing
> LockAcquire() to not block waiting for a lock; instead, it would
> longjmp(), a la elog(ERROR). You could even construct something similar
> to PG_TRY()

Attached is a very quick hack of a patch to do this.

-Neil

Attachment

deadlock_error_msg-4.patch

Re: Improving deadlock error messages

From

Neil Conway

Date:

23 April 2007, 17:47:30

On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote:
> Attached is a very quick hack of a patch to do this.

Does anyone have any feedback on this approach? If people are satisfied
with this solution, I can get a cleaned up patch ready to apply shortly.

-Neil

Re: Improving deadlock error messages

From

Tom Lane

Date:

23 April 2007, 18:38:24

Neil Conway <neilc@samurai.com> writes:
> On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote:
>> Attached is a very quick hack of a patch to do this.

> Does anyone have any feedback on this approach? If people are satisfied
> with this solution, I can get a cleaned up patch ready to apply shortly.

I'm really still opposed to the entire concept.  You're proposing to put
a lot of fragile-looking code into a seldom-exercised error path.
I fear bugs will survive a long time in there, and the net effect will be
that we get no information when we need it most.  The numeric printouts
may be ugly, but they are reliable.
        regards, tom lane

Re: Improving deadlock error messages

From

Neil Conway

Date:

23 April 2007, 19:04:04

On Mon, 2007-04-23 at 17:38 -0400, Tom Lane wrote:
> I'm really still opposed to the entire concept.  You're proposing to put
> a lot of fragile-looking code into a seldom-exercised error path.

There's certainly not a "lot" of code: the patch just adds a few
syscache lookups, wrapped in a PG_LOCK_NOWAIT() block.

As for fragility, I think the important point is whether it's safe to
siglongjmp() out of LockAcquire(); the rest is just window dressing.

-Neil

Re: Improving deadlock error messages

From

Alvaro Herrera

Date:

23 April 2007, 19:26:57

Neil Conway wrote:
> On Mon, 2007-04-23 at 17:38 -0400, Tom Lane wrote:
> > I'm really still opposed to the entire concept.  You're proposing to put
> > a lot of fragile-looking code into a seldom-exercised error path.
> 
> There's certainly not a "lot" of code: the patch just adds a few
> syscache lookups, wrapped in a PG_LOCK_NOWAIT() block.
> 
> As for fragility, I think the important point is whether it's safe to
> siglongjmp() out of LockAcquire(); the rest is just window dressing.

We have some elog(ERROR) calls in LockAcquire, so yes, there are some
siglongjmp calls already in there.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Improving deadlock error messages

From

Jim Nasby

Date:

26 April 2007, 04:47:14

On Apr 23, 2007, at 11:38 PM, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
>> On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote:
>>> Attached is a very quick hack of a patch to do this.
>
>> Does anyone have any feedback on this approach? If people are  
>> satisfied
>> with this solution, I can get a cleaned up patch ready to apply  
>> shortly.
>
> I'm really still opposed to the entire concept.  You're proposing  
> to put
> a lot of fragile-looking code into a seldom-exercised error path.
> I fear bugs will survive a long time in there, and the net effect  
> will be
> that we get no information when we need it most.  The numeric  
> printouts
> may be ugly, but they are reliable.

If we're that worried about test coverage for deadlocks, what about  
adding a test to the regression tests? IIRC the framework can  
coordinate between multiple connections now...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)