On Tue, Jul 22, 2014 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Anyway, to cut to the chase, the crash seems to be from this:
>>> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
>>> So there is still something rotten in the fastpath lock logic.
>
>> Gosh, that sucks.
>
>> The inconstancy of this problem would seem to suggest some kind of
>> locking bug rather than a flat-out concurrency issue, but it looks to
>> me like everything relevant is marked volatile.
>
> I don't think that you need any big assumptions about machine-specific
> coding issues to spot the problem.
I don't think that I'm making what could be described as big
assumptions; I think we should fix and back-patch the PPC64 spinlock
change.
But...
> The assert in question is here:
>
> /*
> * Decrement strong lock count. This logic is needed only for 2PC.
> */
> if (decrement_strong_lock_count
> && ConflictsWithRelationFastPath(&lock->tag, lockmode))
> {
> uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode);
>
> SpinLockAcquire(&FastPathStrongRelationLocks->mutex);
> Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0);
> FastPathStrongRelationLocks->count[fasthashcode]--;
> SpinLockRelease(&FastPathStrongRelationLocks->mutex);
> }
>
> and it sure looks to me like that
> "ConflictsWithRelationFastPath(&lock->tag" is looking at the tag of the
> shared-memory lock object you just released. If someone else had managed
> to recycle that locktable entry for some other purpose, the
> ConflictsWithRelationFastPath call might incorrectly return true.
>
> I think s/&lock->tag/locktag/ would fix it, but maybe I'm missing
> something.
...this is probably the real cause of the failures we've actually been
seeing. I'll go back-patch that change.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company