Thread: Adding some error context for lock wait failures

Adding some error context for lock wait failures

From

Tom Lane

Date:

10 July, 20:05:54

I noted a complaint [1] about how hard it is to debug unforeseen
lock-timeout failures: we give no details about what we were
waiting for.  It's not hard to improve that situation, at least
to the extent of printing numeric locktag details similar to what
you get in deadlock reports.  (It'd be nice to give object names,
but just as with deadlocks, incurring any additional lock
acquisitions here seems too scary.)  The attached patch will
produce reports like

regression=# begin;
BEGIN
regression=*# lock table tenk1;
^CCancel request sent
ERROR:  canceling statement due to user request
CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348
regression=!# abort;
ROLLBACK
regression=#  set lock_timeout TO '1s';
SET
regression=# begin;
BEGIN
regression=*# lock table tenk1;
ERROR:  canceling statement due to lock timeout
CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348

and then the user can manually look up the object's identity.

Thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/CAKE1AiY17RgcKCFba1N6Sz6SjHqSqvq%2BcfKWBfyKFEjT-L%2Bqkg%40mail.gmail.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 62f3471448e..01b752830fb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -415,6 +415,7 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static ProcWaitStatus WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
+static void waitonlock_error_callback(void *arg);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
 static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
@@ -1931,6 +1932,7 @@ static ProcWaitStatus
 WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
 {
     ProcWaitStatus result;
+    ErrorContextCallback waiterrcontext;

     TRACE_POSTGRESQL_LOCK_WAIT_START(locallock->tag.lock.locktag_field1,
                                      locallock->tag.lock.locktag_field2,
@@ -1939,6 +1941,12 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
                                      locallock->tag.lock.locktag_type,
                                      locallock->tag.mode);

+    /* Setup error traceback support for ereport() */
+    waiterrcontext.callback = waitonlock_error_callback;
+    waiterrcontext.arg = (void *) locallock;
+    waiterrcontext.previous = error_context_stack;
+    error_context_stack = &waiterrcontext;
+
     /* adjust the process title to indicate that it's waiting */
     set_ps_display_suffix("waiting");

@@ -1990,6 +1998,8 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
     /* reset ps display to remove the suffix */
     set_ps_display_remove_suffix();

+    error_context_stack = waiterrcontext.previous;
+
     TRACE_POSTGRESQL_LOCK_WAIT_DONE(locallock->tag.lock.locktag_field1,
                                     locallock->tag.lock.locktag_field2,
                                     locallock->tag.lock.locktag_field3,
@@ -2000,6 +2010,28 @@ WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner)
     return result;
 }

+/*
+ * error context callback for failures in WaitOnLock
+ *
+ * We report which lock was being waited on, in the same style used in
+ * deadlock reports.  This helps with lock timeout errors in particular.
+ */
+static void
+waitonlock_error_callback(void *arg)
+{
+    LOCALLOCK  *locallock = (LOCALLOCK *) arg;
+    const LOCKTAG *tag = &locallock->tag.lock;
+    LOCKMODE    mode = locallock->tag.mode;
+    StringInfoData locktagbuf;
+
+    initStringInfo(&locktagbuf);
+    DescribeLockTag(&locktagbuf, tag);
+
+    errcontext("waiting for %s on %s",
+               GetLockmodeName(tag->locktag_lockmethodid, mode),
+               locktagbuf.data);
+}
+
 /*
  * Remove a proc from the wait-queue it is on (caller must know it is on one).
  * This is only used when the proc has failed to get the lock, so we set its

Re: Adding some error context for lock wait failures

From

Zhang Mingli

Date:

11 July, 05:44:11

Hi,

On Jul 11, 2025 at 01:06 +0800, Tom Lane <tgl@sss.pgh.pa.us>, wrote:

I noted a complaint [1] about how hard it is to debug unforeseen
lock-timeout failures: we give no details about what we were
waiting for. It's not hard to improve that situation, at least
to the extent of printing numeric locktag details similar to what
you get in deadlock reports. (It'd be nice to give object names,
but just as with deadlocks, incurring any additional lock
acquisitions here seems too scary.) The attached patch will
produce reports like

regression=# begin;
BEGIN
regression=*# lock table tenk1;
^CCancel request sent
ERROR: canceling statement due to user request
CONTEXT: waiting for AccessExclusiveLock on relation 77382 of database 77348
regression=!# abort;
ROLLBACK
regression=# set lock_timeout TO '1s';
SET
regression=# begin;
BEGIN
regression=*# lock table tenk1;
ERROR: canceling statement due to lock timeout
CONTEXT: waiting for AccessExclusiveLock on relation 77382 of database 77348

and then the user can manually look up the object's identity.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/CAKE1AiY17RgcKCFba1N6Sz6SjHqSqvq%2BcfKWBfyKFEjT-L%2Bqkg%40mail.gmail.com

````
}
PG_CATCH();
{
/* In this path, awaitedLock remains set until LockErrorCleanup */

/* reset ps display to remove the suffix */
set_ps_display_remove_suffix();

/* and propagate the error */
PG_RE_THROW();
}
PG_END_TRY();

/*
* We no longer want LockErrorCleanup to do anything.
*/
awaitedLock = NULL;

/* reset ps display to remove the suffix */
set_ps_display_remove_suffix();

error_context_stack = waiterrcontext.previous;

```

Do we need to rollback error_context_stack to the previous state if we enter the branch for PG_CATCH()?

--
Zhang Mingli
HashData

Re: Adding some error context for lock wait failures

From

Tom Lane

Date:

11 July, 05:53:28

Zhang Mingli <zmlpostgres@gmail.com> writes:
> Do we need to rollback error_context_stack to the previous state if we enter the branch for PG_CATCH()?

No.  The PG_TRY mechanism itself deals with that: the next outer
level of PG_TRY will restore error_context_stack to what it had
been.  If this were not so, most other places that add an
error_context_stack entry would also be broken.

            regards, tom lane

Re: Adding some error context for lock wait failures

From

Zhang Mingli

Date:

11 July, 06:32:50

Hi,

On Jul 11, 2025 at 10:53 +0800, Tom Lane <tgl@sss.pgh.pa.us>, wrote:

Zhang Mingli <zmlpostgres@gmail.com> writes:
Do we need to rollback error_context_stack to the previous state if we enter the branch for PG_CATCH()?

No. The PG_TRY mechanism itself deals with that: the next outer
level of PG_TRY will restore error_context_stack to what it had
been. If this were not so, most other places that add an
error_context_stack entry would also be broken.

Got it. LGTM.

--
Zhang Mingli
HashData

Re: Adding some error context for lock wait failures

From

Dilip Kumar

Date:

11 July, 06:35:48

On Thu, Jul 10, 2025 at 10:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I noted a complaint [1] about how hard it is to debug unforeseen
> lock-timeout failures: we give no details about what we were
> waiting for.  It's not hard to improve that situation, at least
> to the extent of printing numeric locktag details similar to what
> you get in deadlock reports.  (It'd be nice to give object names,
> but just as with deadlocks, incurring any additional lock
> acquisitions here seems too scary.)  The attached patch will
> produce reports like
>
> regression=# begin;
> BEGIN
> regression=*# lock table tenk1;
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348
> regression=!# abort;
> ROLLBACK
> regression=#  set lock_timeout TO '1s';
> SET
> regression=# begin;
> BEGIN
> regression=*# lock table tenk1;
> ERROR:  canceling statement due to lock timeout
> CONTEXT:  waiting for AccessExclusiveLock on relation 77382 of database 77348
>
> and then the user can manually look up the object's identity.
>

This seems to be quite useful to me, initially I thought if we can
print the relation and database name then this could be even better
but it might be a bad idea to fetch the object name while we are in
error callback.  And anyway deadlock error is also reported in the
same format so this makes sense.

--
Regards,
Dilip Kumar
Google

Re: Adding some error context for lock wait failures

From

Zhang Mingli

Date:

11 July, 07:03:46

Hi,

On Jul 11, 2025 at 11:36 +0800, Dilip Kumar <dilipbalaut@gmail.com>, wrote:

This seems to be quite useful to me, initially I thought if we can
print the relation and database name then this could be even better
but it might be a bad idea to fetch the object name while we are in
error callback.

May be confused if there were tables with same names under different schemas.

--
Zhang Mingli
HashData

Re: Adding some error context for lock wait failures

From

Dilip Kumar

Date:

11 July, 07:19:49

On Fri, Jul 11, 2025 at 9:34 AM Zhang Mingli <zmlpostgres@gmail.com> wrote:
>
> On Jul 11, 2025 at 11:36 +0800, Dilip Kumar <dilipbalaut@gmail.com>, wrote:
>
>
> This seems to be quite useful to me, initially I thought if we can
> print the relation and database name then this could be even better
> but it might be a bad idea to fetch the object name while we are in
> error callback.
>
> May be confused if there were tables with same names under different schemas.

If that's the only issue we can print schema qualified name, but I
think the problem is in error callback we just have lock tag
information which only have OIDs and we don't look up the
relcaches/sys table from the error path.

--
Regards,
Dilip Kumar
Google