Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Ok, I think I know what's happening. In btbulkdelete we have a
>> PG_TRY-CATCH block. In the try-block, we call _bt_start_vacuum which
>> acquires and releases the BtreeVacuumLock. Under certain error
>> conditions, _bt_start_vacuum calls elog(ERROR) while holding the
>> BtreeVacuumLock. The PG_CATCH block calls _bt_end_vacuum which also
>> tries to acquire BtreeVacuumLock.
>
> This is definitely a bug (I unfortunately didn't see your message until
> after I'd replicated your reasoning...) but the word from Shuttleworth
> is that he doesn't see either of those messages in his postmaster log.
> So it seems we need another theory. I haven't a clue at the moment though.
The error message never makes it to the log. The deadlock occurs in the
PG_CATCH-block, before rethrowing and printing the error. I added an
unconditional elog(ERROR) in _bt_start_vacuum to test it, and I'm
getting the same hang with no message in the log.
The unsafe elog while holding a lwlock pattern in _bt_vacuum_start needs
to be fixed, patch attached. We still need to figure out what's causing
the error in the first place. With the patch, we should at least get a
proper error message and not hang when the error occurs.
Martin: Would it be possible for you to reproduce the problem with a
patched version?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Index: src/backend/access/nbtree/nbtutils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtutils.c,v
retrieving revision 1.79
diff -c -r1.79 nbtutils.c
*** src/backend/access/nbtree/nbtutils.c 4 Oct 2006 00:29:49 -0000 1.79
--- src/backend/access/nbtree/nbtutils.c 30 Mar 2007 07:55:36 -0000
***************
*** 998,1016 ****
--- 998,1023 ----
vac = &btvacinfo->vacuums[i];
if (vac->relid.relId == rel->rd_lockInfo.lockRelId.relId &&
vac->relid.dbId == rel->rd_lockInfo.lockRelId.dbId)
+ {
+ LWLockRelease(BtreeVacuumLock);
elog(ERROR, "multiple active vacuums for index \"%s\"",
RelationGetRelationName(rel));
+ }
}
/* OK, add an entry */
if (btvacinfo->num_vacuums >= btvacinfo->max_vacuums)
+ {
+ LWLockRelease(BtreeVacuumLock);
elog(ERROR, "out of btvacinfo slots");
+ }
vac = &btvacinfo->vacuums[btvacinfo->num_vacuums];
vac->relid = rel->rd_lockInfo.lockRelId;
vac->cycleid = result;
btvacinfo->num_vacuums++;
LWLockRelease(BtreeVacuumLock);
+
return result;
}