Re: lock on object is already held - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: lock on object is already held |
Date | |
Msg-id | 3163.1385515801@sss.pgh.pa.us Whole thread Raw |
In response to | lock on object is already held (Daniel Wood <dwood@salesforce.com>) |
Responses |
Re: lock on object is already held
|
List | pgsql-hackers |
Daniel Wood <dwood@salesforce.com> writes: > Sorry but I don't know how to respond to an old thread I found on > postgresql.org: > http://www.postgresql.org/message-id/20766.1357318482@sss.pgh.pa.us > I presume this is still an open issue. While working on a new feature I > wrote a stress test for it. After fixing my bugs, I couldn't get rid of: > ERROR: lock RowExclusiveLock on object 16384/16393/0 is already held > In addition, with asserts enabled in lock.c I would see Asserts in > UnGrantLock, LockRelease, LockReleaseAll, etc. In one run everything hung > waiting on SQL locks. > The asserts or hang occurs within seconds of running the stress test. > Before I get into the details I want to see if this is something already > being worked on. I can repro it easily in vanilla 9.3. Dan sent me this test case off-list. Here's a somewhat cleaned-up version: 1. Create a table: create table lock_bug(f1 int, f2 int); 2. Compile the attached lockbug.c. 3. Run the attached lockbug.sh, with adjustment of parameters to taste. This spits up in a remarkable variety of ways in 9.3 and HEAD, especially if you have assertions on. After some investigation I found the cause: LockRelease() expects that if a LOCALLOCK object represents a valid lock (nLocks > 0), then either its lock and proclock fields point to associated shared-memory entries, or they're NULL. However, it's possible for that to not be true, because if we fail to acquire a lock, the code leaves around a LOCALLOCK object pointing at the shared objects we couldn't get lock on. *These objects are subject to reclamation, because no lock is actually held*. So if we make another attempt to get the same lock later in the same transaction, LockAcquire finds and re-uses that LOCALLOCK object. Pre fast-path locks, it would always recompute the lock and proclock pointers, so the fact that they might have been stale wasn't a problem. But the fast-path patch added a code path in which we could succeed in acquiring a fast-path lock, and then exit without having done anything with the pointer fields. Now we have something that looks like a valid locallock but could be pointing at entirely unrelated shared objects. This leads to all sorts of fun at release time, with the symptoms depending on exactly what those shared hashtable entries are being used for at the instant we stomp on them. So the fix is pretty simple: diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index f8dc951..37c605f 100644 *************** LockAcquireExtended(const LOCKTAG *lockt *** 837,842 **** --- 844,851 ---- LWLockRelease(MyProc->backendLock); if (acquired) { + locallock->lock = NULL; + locallock->proclock = NULL; GrantLockLocal(locallock, owner); return LOCKACQUIRE_OK; } although this needs to be backed up with a lot more comments of course. When I showed this to Dan, he objected that it'd be better if the failing lock operation had cleaned up the LOCALLOCK instead. That would be a significantly bigger change though, and I think it'd net out being less robust. The data structure was designed to not need such cleanup, because the more stuff you have to do to clean up after a failure, the more likely it is that you'll have additional problems during the cleanup, leaving you hosed. In particular, we'd have to be certain that we could remove the useless shared objects during the cleanup step, since once the LOCALLOCK is gone there is nothing else that will remind us to garbage-collect them at end of transaction. BTW, although I'm certain that 9.2 has this bug as well, this test case fails to show a problem in that branch. I've not looked into why not --- it's probably a timing issue or something. Comments? regards, tom lane #include <stdio.h> #include <stdlib.h> #include <string.h> static void pf(const char *str) { printf("%s;\n", str); fflush(stdout); } int main(int argc, char **argv) { int scenario = atoi(argv[1]); int count = atoi(argv[2]); int i; switch (scenario) { case 0: pf("set lock_timeout = 1"); while (count-- > 0) { pf("begin"); for (i = 0; i < 10; i++) { pf("savepoint mysp"); pf("insert into lock_bug values (1, 42)"); pf("rollback to savepoint mysp"); } pf("rollback"); } break; case 1: while (count-- > 0) { pf("create index lock_bug_idx_f1 on lock_bug(f1)"); pf("drop index lock_bug_idx_f1"); } break; } return 0; } #! /bin/sh PSQL=psql PARGS="regression" COUNT=1000 ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 0 $COUNT | $PSQL $PARGS & ./lockbug 1 $COUNT | $PSQL $PARGS &
pgsql-hackers by date: