Thread: Weird Assert failure in GetLockStatusData()

Weird Assert failure in GetLockStatusData()

From
Tom Lane
Date:
This is a bit disturbing:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bushpig&dt=2013-01-07%2019%3A15%3A02

The key bit is

[50eb2156.651e:6] LOG:  execute isolationtester_waiting: SELECT 1 FROM pg_locks holder, pg_locks waiter WHERE NOT
waiter.grantedAND waiter.pid = $1 AND holder.granted AND holder.pid <> $1 AND holder.pid IN (25887, 25888, 25889) AND
holder.mode= ANY (CASE waiter.mode WHEN 'AccessShareLock' THEN ARRAY['AccessExclusiveLock'] WHEN 'RowShareLock' THEN
ARRAY['ExclusiveLock','AccessExclusiveLock']WHEN 'RowExclusiveLock' THEN
ARRAY['ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN 'ShareUpdateExclusiveLock' THEN
ARRAY['ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN
'ShareLock'THEN
ARRAY['RowExclusiveLock','ShareUpdateExclusiveLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN
'ShareRowExclusiveLock'THEN
ARRAY['RowExclusiveLock','ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']
WHEN'ExclusiveLock' THEN
ARRAY['RowShar!eLock','RowExclusiveLock','ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']
WHEN'AccessExclusiveLock' THEN
ARRAY['AccessShareLock','RowShareLock','RowExclusiveLock','ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']
END)AND holder.locktype IS NOT DISTINCT FROM waiter.locktype AND holder.database IS NOT DISTINCT FROM waiter.database
ANDholder.relation IS NOT DISTINCT FROM waiter.relation AND holder.page IS NOT DISTINCT FROM waiter.page AND
holder.tupleIS NOT DISTINCT FROM waiter.tuple AND holder.virtualxid IS NOT DISTINCT FROM waiter.virtualxid AND
holder.transactionidIS NOT DISTINCT FROM waiter.transactionid AND holder.classid IS NOT DISTINCT FROM waiter.classid
ANDholder.objid IS NOT DISTINCT FROM waiter.objid AND holder.objsubid IS NOT DISTINCT FROM waiter.objsubid 
 
[50eb2156.651e:7] DETAIL:  parameters: $1 = '25889'
TRAP: FailedAssertion("!(el == data->nelements)", File: "lock.c", Line: 3398)
[50eb2103.62ee:2] LOG:  server process (PID 25886) was terminated by signal 6: Aborted
[50eb2103.62ee:3] DETAIL:  Failed process was running: SELECT 1 FROM pg_locks holder, pg_locks waiter WHERE NOT
waiter.grantedAND waiter.pid = $1 AND holder.granted AND holder.pid <> $1 AND holder.pid IN (25887, 25888, 25889) AND
holder.mode= ANY (CASE waiter.mode WHEN 'AccessShareLock' THEN ARRAY['AccessExclusiveLock'] WHEN 'RowShareLock' THEN
ARRAY['ExclusiveLock','AccessExclusiveLock']WHEN 'RowExclusiveLock' THEN
ARRAY['ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN 'ShareUpdateExclusiveLock' THEN
ARRAY['ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN
'ShareLock'THEN
ARRAY['RowExclusiveLock','ShareUpdateExclusiveLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']WHEN
'ShareRowExclusiveLock'THEN
ARRAY['RowExclusiveLock','ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','ExclusiveLock','AccessExclusiveLock']
WHEN'ExclusiveLock' THEN
ARRAY['RowShareL!ock','RowExclusiveLock','ShareUpdateExclusiveLock','ShareLock','ShareRowExclusiveLock','E

The assertion failure seems to indicate that the number of
LockMethodProcLockHash entries found by hash_seq_search didn't match the
number that had been counted by hash_get_num_entries immediately before
that.  I don't see any bug in GetLockStatusData itself, so this suggests
that there's something wrong with dynahash's entry counting, or that
somebody somewhere is modifying the shared hash table without holding
the appropriate lock.  The latter seems a bit more likely, given that
this must be a very low-probability bug or we'd have seen it before.
An overlooked locking requirement in a seldom-taken code path would fit
the symptoms.

Or maybe bushpig just had some weird cosmic-ray hardware failure,
but I don't put a lot of faith in such explanations.

Thoughts?
        regards, tom lane



Re: Weird Assert failure in GetLockStatusData()

From
Tom Lane
Date:
I wrote:
> This is a bit disturbing:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bushpig&dt=2013-01-07%2019%3A15%3A02
> ...
> The assertion failure seems to indicate that the number of
> LockMethodProcLockHash entries found by hash_seq_search didn't match the
> number that had been counted by hash_get_num_entries immediately before
> that.  I don't see any bug in GetLockStatusData itself, so this suggests
> that there's something wrong with dynahash's entry counting, or that
> somebody somewhere is modifying the shared hash table without holding
> the appropriate lock.  The latter seems a bit more likely, given that
> this must be a very low-probability bug or we'd have seen it before.
> An overlooked locking requirement in a seldom-taken code path would fit
> the symptoms.

After digging around a bit, I can find only one place where it looks
like somebody might be messing with the LockMethodProcLockHash table
while not holding the appropriate lock-partition LWLock(s):

1. VirtualXactLock finds target xact holds its VXID lock fast-path.
2. VirtualXactLock calls SetupLockInTable to convert the fast-path lock  to regular.
3. SetupLockInTable makes entries in LockMethodLockHash and  LockMethodProcLockHash.

I see no partition lock acquisition anywhere in the above code path.
Is there one that I'm missing?  Why isn't SetupLockInTable documented
as expecting the caller to hold the partition lock, as is generally
done for lock.c subroutines that require that?

If this is a bug, it's rather disturbing that it took us this long to
recognize it.  That code path isn't all that seldom-taken, AFAIK.
        regards, tom lane



Re: Weird Assert failure in GetLockStatusData()

From
Tom Lane
Date:
I wrote:
> After digging around a bit, I can find only one place where it looks
> like somebody might be messing with the LockMethodProcLockHash table
> while not holding the appropriate lock-partition LWLock(s):

> 1. VirtualXactLock finds target xact holds its VXID lock fast-path.
> 2. VirtualXactLock calls SetupLockInTable to convert the fast-path lock
>    to regular.
> 3. SetupLockInTable makes entries in LockMethodLockHash and
>    LockMethodProcLockHash.

> I see no partition lock acquisition anywhere in the above code path.

I've been able to reproduce the assertion crash by forcing the
appropriate timing with some carefully chosen debugger breakpoints.
So this theory is evidently correct.

> If this is a bug, it's rather disturbing that it took us this long to
> recognize it.  That code path isn't all that seldom-taken, AFAIK.

On closer inspection, VirtualXactLock() is only called in CREATE INDEX
CONCURRENTLY and DROP INDEX CONCURRENTLY (and yes, the failed test case
on bushpig was exercising DROP INDEX CONCURRENTLY).  So maybe the path
isn't that frequently taken after all.
        regards, tom lane