Adding some error context for lock wait failures - Mailing list pgsql-hackers

From Tom Lane
Subject Adding some error context for lock wait failures
Date
Msg-id 1602369.1752167154@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I noted a complaint [1] about how hard it is to debug unforeseen
lock-timeout failures: we give no details about what we were
waiting for.  It's not hard to improve that situation, at least
to the extent of printing numeric locktag details similar to what
you get in deadlock reports.  (It'd be nice to give object names,
but just as with deadlocks, incurring any additional lock
acquisitions here seems too scary.)  The attached patch will
produce reports like

regression=# begin;
BEGIN
regression=*# lock table tenk1;
^CCancel request sent
ERROR:  canceling statement due to user request
CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348
regression=!# abort;
ROLLBACK
regression=#  set lock_timeout TO '1s';
SET
regression=# begin;
BEGIN
regression=*# lock table tenk1;
ERROR:  canceling statement due to lock timeout
CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348

and then the user can manually look up the object's identity.

Thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/CAKE1AiY17RgcKCFba1N6Sz6SjHqSqvq%2BcfKWBfyKFEjT-L%2Bqkg%40mail.gmail.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..01b752830fb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -415,6 +415,7 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static ProcWaitStatus WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void waitonlock_error_callback(void *arg);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
@@ -1931,6 +1932,7 @@ static ProcWaitStatus
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 {
     ProcWaitStatus result;
+    ErrorContextCallback waiterrcontext;

     TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
                                      locallock->tag.lock.locktag_field2,
@@ -1939,6 +1941,12 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
                                      locallock->tag.lock.locktag_type,
                                      locallock->tag.mode);

+    /* Setup error traceback support for ereport() */
+    waiterrcontext.callback = waitonlock_error_callback;
+    waiterrcontext.arg = (void *) locallock;
+    waiterrcontext.previous = error_context_stack;
+    error_context_stack = &waiterrcontext;
+
     /* adjust the process title to indicate that it's waiting */
     set_ps_display_suffix("waiting");

@@ -1990,6 +1998,8 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
     /* reset ps display to remove the suffix */
     set_ps_display_remove_suffix();

+    error_context_stack = waiterrcontext.previous;
+
     TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
                                     locallock->tag.lock.locktag_field2,
                                     locallock->tag.lock.locktag_field3,
@@ -2000,6 +2010,28 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
     return result;
 }

+/*
+ * error context callback for failures in WaitOnLock
+ *
+ * We report which lock was being waited on, in the same style used in
+ * deadlock reports.  This helps with lock timeout errors in particular.
+ */
+static void
+waitonlock_error_callback(void *arg)
+{
+    LOCALLOCK  *locallock = (LOCALLOCK *) arg;
+    const LOCKTAG *tag = &locallock->tag.lock;
+    LOCKMODE    mode = locallock->tag.mode;
+    StringInfoData locktagbuf;
+
+    initStringInfo(&locktagbuf);
+    DescribeLockTag(&locktagbuf, tag);
+
+    errcontext("waiting for %s on %s",
+               GetLockmodeName(tag->locktag_lockmethodid, mode),
+               locktagbuf.data);
+}
+
 /*
  * Remove a proc from the wait-queue it is on (caller must know it is on one).
  * This is only used when the proc has failed to get the lock, so we set its

pgsql-hackers by date:

Previous
From: Srirama Kucherlapati
Date:
Subject: RE: Buildfarm setup for AIX
Next
From: Jacob Champion
Date:
Subject: Re: Explicitly enable meson features in CI