Thread: Improving deadlock error messages
The error message we currently produce when a deadlock occurs is pretty unfriendly: ERROR: deadlock detected DETAIL: Process 32068 waits for AccessExclusiveLock on relation 16413 of database 16384;blocked by process 32064. Process 32064 waits for AccessExclusiveLock on relation 16415 of database 16384; blockedby process 32068. Users encounter this message relatively frequently -- and they often depend on the information in the errdetail to track down the source of the deadlock. Presenting numeric OIDs effectively forces the DBA to resolve the OIDs to the appropriate relation and database names manually. I whipped up a quick patch to use names as well as OIDs for the identifiers in the message, but on reflection the simple approach to doing this is problematic: when we do syscache lookups to lookup the identifier names, we might need to acquire an AccessShareLock on various system catalogs (pg_class, pg_namespace, pg_database). This could fail (e.g. because of a deadlock involving a system catalog), causing the deadlock detector to infinitely recurse (albeit slowly). We could fix this by first conditionally acquiring AccessShareLocks on the necessary system catalogs. If any of those lock acquisitions fails, we can just print the numeric OID instead. Since failing to acquire an AccessShareLock on the system catalogs should not occur frequently in practise, most users should see a more friendly error message. Comments? Anyone see a better approach? -Neil
Neil Conway <neilc@samurai.com> writes: > I whipped up a quick patch to use names as well as OIDs for the > identifiers in the message, but on reflection the simple approach to > doing this is problematic: when we do syscache lookups to lookup the > identifier names, we might need to acquire an AccessShareLock on various > system catalogs (pg_class, pg_namespace, pg_database). This could fail > (e.g. because of a deadlock involving a system catalog), causing the > deadlock detector to infinitely recurse (albeit slowly). Yup, that's exactly why it doesn't do that already. > We could fix this by first conditionally acquiring AccessShareLocks on > the necessary system catalogs. I don't think you've thought of quite all of the failure cases. One that's a bit pressing is that a deadlock isn't necessarily confined to objects in your own database. My take on this is that I'd much rather have unfriendly information than none at all. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > I don't think you've thought of quite all of the failure cases. One > that's a bit pressing is that a deadlock isn't necessarily confined to > objects in your own database. We could do the syscache lookups for only the object we're waiting on and store that information in the lock table for others to refer to. We would have to do the syscache lookup either always or at the point where we first decide we have to block. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote: > I don't think you've thought of quite all of the failure cases. One > that's a bit pressing is that a deadlock isn't necessarily confined to > objects in your own database. I'm not sure I follow. If we conditionally acquire the locks we need and always fallback to just printing the numeric OIDs, ISTM we should be able to avoid the infinite recursion problem. (If necessary, we can always special-case objects outside our own database, although I'm not sure that's necessary.) -Neil
Neil Conway <neilc@samurai.com> writes: > On Fri, 2007-04-20 at 02:55 -0400, Tom Lane wrote: >> I don't think you've thought of quite all of the failure cases. One >> that's a bit pressing is that a deadlock isn't necessarily confined to >> objects in your own database. > I'm not sure I follow. If we conditionally acquire the locks we need and > always fallback to just printing the numeric OIDs, ISTM we should be > able to avoid the infinite recursion problem. (If necessary, we can > always special-case objects outside our own database, although I'm not > sure that's necessary.) Maybe so, but you're going to be writing quite a lot of duplicative code, because the existing routines you might have been thinking of using (lsyscache.c etc) don't behave that way. The basic objection I've got to this is that it'll introduce a lot of complexity and fragility into a seldom-taken error-recovery path, which is almost by definition not well enough tested already. regards, tom lane
On Sat, 2007-04-21 at 02:38 -0400, Tom Lane wrote: > Maybe so, but you're going to be writing quite a lot of duplicative > code, because the existing routines you might have been thinking of > using (lsyscache.c etc) don't behave that way. Right, I'm envisioning doing a conditional LockAcquire and then heap_open() / heap_getnext() by hand. That will be relatively slow, but code that emits a deadlock error message is almost by definition not performance critical. BTW, another alternative would be to set a global variable instructing LockAcquire() to not block waiting for a lock; instead, it would longjmp(), a la elog(ERROR). You could even construct something similar to PG_TRY(): PG_COND_LOCK(); { /* do various things that might acquire lmgr locks */ } PG_ACQUIRE_FAILED() { /* failed to acquire an lmgr lock */ } PG_END_COND_LOCK(); The risk would be leaving the LockAcquire() call site in an inconsistent state when we longjmp(), but since DeadLockReport() is going to ereport(ERROR) anyway, it might be sufficiently safe. This scheme does seem a bit fragile, though... -Neil
On Sat, 2007-04-21 at 17:56 -0400, Neil Conway wrote: > Right, I'm envisioning doing a conditional LockAcquire and then > heap_open() / heap_getnext() by hand. That will be relatively slow, but > code that emits a deadlock error message is almost by definition not > performance critical. ... although it turns out you'd need to conditionally lock a *lot* of system catalogs to guarantee that you're not going to block on a lock at some point. Needless to say, that approach would be pretty ugly and fragile. > BTW, another alternative would be to set a global variable instructing > LockAcquire() to not block waiting for a lock; instead, it would > longjmp(), a la elog(ERROR). You could even construct something similar > to PG_TRY() Attached is a very quick hack of a patch to do this. -Neil
Attachment
On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote: > Attached is a very quick hack of a patch to do this. Does anyone have any feedback on this approach? If people are satisfied with this solution, I can get a cleaned up patch ready to apply shortly. -Neil
Neil Conway <neilc@samurai.com> writes: > On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote: >> Attached is a very quick hack of a patch to do this. > Does anyone have any feedback on this approach? If people are satisfied > with this solution, I can get a cleaned up patch ready to apply shortly. I'm really still opposed to the entire concept. You're proposing to put a lot of fragile-looking code into a seldom-exercised error path. I fear bugs will survive a long time in there, and the net effect will be that we get no information when we need it most. The numeric printouts may be ugly, but they are reliable. regards, tom lane
On Mon, 2007-04-23 at 17:38 -0400, Tom Lane wrote: > I'm really still opposed to the entire concept. You're proposing to put > a lot of fragile-looking code into a seldom-exercised error path. There's certainly not a "lot" of code: the patch just adds a few syscache lookups, wrapped in a PG_LOCK_NOWAIT() block. As for fragility, I think the important point is whether it's safe to siglongjmp() out of LockAcquire(); the rest is just window dressing. -Neil
Neil Conway wrote: > On Mon, 2007-04-23 at 17:38 -0400, Tom Lane wrote: > > I'm really still opposed to the entire concept. You're proposing to put > > a lot of fragile-looking code into a seldom-exercised error path. > > There's certainly not a "lot" of code: the patch just adds a few > syscache lookups, wrapped in a PG_LOCK_NOWAIT() block. > > As for fragility, I think the important point is whether it's safe to > siglongjmp() out of LockAcquire(); the rest is just window dressing. We have some elog(ERROR) calls in LockAcquire, so yes, there are some siglongjmp calls already in there. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Apr 23, 2007, at 11:38 PM, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: >> On Sat, 2007-04-21 at 19:43 -0400, Neil Conway wrote: >>> Attached is a very quick hack of a patch to do this. > >> Does anyone have any feedback on this approach? If people are >> satisfied >> with this solution, I can get a cleaned up patch ready to apply >> shortly. > > I'm really still opposed to the entire concept. You're proposing > to put > a lot of fragile-looking code into a seldom-exercised error path. > I fear bugs will survive a long time in there, and the net effect > will be > that we get no information when we need it most. The numeric > printouts > may be ugly, but they are reliable. If we're that worried about test coverage for deadlocks, what about adding a test to the regression tests? IIRC the framework can coordinate between multiple connections now... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)