Re: Deadlock in multiple CIC. - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Deadlock in multiple CIC.
Date
Msg-id 6744.1523833660@sss.pgh.pa.us
Whole thread Raw
In response to Re: Deadlock in multiple CIC.  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Deadlock in multiple CIC.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Awhile back, Alvaro Herrera wrote:
>> Pushed to all affected branches, along with a somewhat lame
>> isolationtester test for the condition (since we've already broken this
>> twice and not noticed for long).

> Buildfarm member okapi just failed this test in 9.4:

okapi has continued to fail that test, not 100% of the time but much
more often than not ... but only in 9.4.  And no other animals have
shown it at all.  So what to make of that?

Noting that okapi uses a pretty old icc version running at a high -O
level, we could dismiss it as probably-a-compiler-bug.  But that theory
doesn't really account for the fact that it sometimes succeeds.

Another theory, noting that 9.5 and later have memory barriers in S_UNLOCK
which 9.4 lacks, is that the reason 9.4 has a problem is lack of a memory
barrier between SnapshotResetXmin and GetCurrentVirtualXIDs, thus allowing
both processes to observe the other's xmin as still nonzero given the
right timing.  This seems like a stretch, because really the latter
function's LWLockAcquire on ProcArrayLock ought to be enough to serialize
things.  But there has to be *something* different between 9.4 and all the
later branches, and the barrier stuff sure looks like it's in the right
neighborhood.

As an investigative measure, I propose that we insert

    Assert(MyPgXact->xmin == InvalidTransactionId);

into 9.4's DefineIndex, just after its InvalidateCatalogSnapshot call.
I don't want to leave that there permanently, because it's not clear to me
that there are no legitimate cases where a backend wouldn't have extra
snapshots active during CREATE INDEX CONCURRENTLY --- but we seem to get
through 9.4's regression tests with it, and it would quickly confirm or
deny whether okapi is failing because it somehow has an extra snapshot.

Assuming that that doesn't show anything, I'm inclined to think that
the next step should be to add a pg_memory_barrier() call to
SnapshotResetXmin (again only in the 9.4 branch), and see if that helps.

            regards, tom lane


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: WIP: Covering + unique indexes.
Next
From: Yuriy Zhuravlev
Date:
Subject: Re: Setting rpath on llvmjit.so?