Thread: Race condition in backend process exit

Race condition in backend process exit

From

Tom Lane

Date:

07 August 2005, 16:47:22

I can fairly consistently crash CVS tip with the following:

Session 1:

regression=# begin;
BEGIN
regression=# select * from int4_tbl where f1 = 123456 for update;  f1   
--------123456
(1 row)

Session 2:

regression=# begin;
BEGIN
regression=# select * from int4_tbl where f1 = 123456 for update;
<< blocks >>

Session 1:

regression=# \q

Session 2 now crashes:

TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "lmgr.c", Line: 464)
LOG:  server process (PID 2337) was terminated by signal 6

with this backtrace:

#4  0x3000c4 in ExceptionalCondition (   conditionName=0xc4b3c "!(((xid) != ((TransactionId) 0)))",
errorType=0xc4a14"FailedAssertion", fileName=0xc49c8 "lmgr.c",    lineNumber=464) at assert.c:51
 
#5  0x265058 in XactLockTableWait (xid=0) at lmgr.c:464
#6  0x105730 in heap_lock_tuple (relation=0x7b03c451, tuple=0x7b03bdd0,    buffer=0x7b03bdec, cid=2063845554,
mode=LockTupleExclusive,nowait=0)   at heapam.c:2076
 
#7  0x1c5880 in ExecutePlan (estate=0x4010ab50, planstate=0xc0064b68,    operation=CMD_SELECT,
numberTuples=-1073329308,   direction=ForwardScanDirection, dest=0x40106708) at execMain.c:1192
 

Apparently, session 1's locks are being released while it still shows as
an active transaction in the PGPROC array, causing XactLockTableWait to
suppose it was a subtransaction and look for the parent.  This indicates
something is being done incompletely or in the wrong order during
backend exit, because AbortTransaction is perfectly clear that you mark
yourself not running before you release your locks.  Haven't found it
yet.

I could not provoke the same crash in 8.0, but I suspect this may just
be a chance timing difference, and that the bug may be of long standing.
        regards, tom lane

Re: Race condition in backend process exit

From

Tom Lane

Date:

07 August 2005, 19:45:16

I wrote:
> I can fairly consistently crash CVS tip with the following:
> ...
> Apparently, session 1's locks are being released while it still shows as
> an active transaction in the PGPROC array, causing XactLockTableWait to
> suppose it was a subtransaction and look for the parent.  This indicates
> something is being done incompletely or in the wrong order during
> backend exit, because AbortTransaction is perfectly clear that you mark
> yourself not running before you release your locks.  Haven't found it
> yet.

It looks to me like the problem is that ShutdownPostgres tries to get
away with doing just a subset of AbortTransaction; in particular it
does nothing to mark the PGPROC as not running a transaction anymore.
So when ProcKill releases locks, the xact is still InProgress.

I'm thinking that the correct fix is to forget the notion that it's
safer to do a subset of AbortTransaction than to do the whole thing.
We should make ShutdownPostgres do this:
AbortOutOfAnyTransaction();
/* Drop user-level locks, which are not dropped by xact abort */
#ifdef USER_LOCKSLockReleaseAll(USER_LOCKMETHOD, true);
#endif

and then remove the lock manager cleanup operations from ProcKill.

> I could not provoke the same crash in 8.0, but I suspect this may just
> be a chance timing difference, and that the bug may be of long standing.

I haven't done the experiment, but I'm pretty certain that it's possible
to provoke this same crash in 8.0 if the timing is right, which could be
forced by using gdb to delay execution at the right place in ProcKill.
In pre-8.0 releases XactLockTableWait doesn't try to chain up to parent
transactions, so the particular crash doesn't exist, but we still have
the problem that the exiting backend releases locks while its xact still
appears to be running.  That's been incorrect according to the comments
in xact.c since forever, so I would imagine that there are other race
conditions in which this is a Bad Thing.

This bug may well explain the known reports of failures from SIGTERM'ing
an individual backend, since (IIRC) that code path could also try to
exit the backend with a transaction still in progress.

I'm a bit hesitant to back-patch such a nontrivial and hard-to-test
change, but it sure looks badly broken to me.  Any thoughts about the
risks involved?
        regards, tom lane

Re: Race condition in backend process exit

From

Alvaro Herrera

Date:

07 August 2005, 21:04:31

On Sun, Aug 07, 2005 at 03:45:10PM -0400, Tom Lane wrote:

> I'm thinking that the correct fix is to forget the notion that it's
> safer to do a subset of AbortTransaction than to do the whole thing.
> We should make ShutdownPostgres do this:
> 
>     AbortOutOfAnyTransaction();
> 
>     /* Drop user-level locks, which are not dropped by xact abort */
> #ifdef USER_LOCKS
>     LockReleaseAll(USER_LOCKMETHOD, true);
> #endif
> 
> and then remove the lock manager cleanup operations from ProcKill.

I agree it's cleaner.  It'd be comforting however if any cleanup
procedure would severely report when it finds inconsistent state (Most
of xact.c throws at least a WARNING, IIRC).  That way we'd know about
bogus conditions quickly.  OTOH, that code is in much better shape than
it was when ShutdownPostgres was last heavily modified, which AFAICS was
around revision 1.82.

> I'm a bit hesitant to back-patch such a nontrivial and hard-to-test
> change, but it sure looks badly broken to me.  Any thoughts about the
> risks involved?

How far back?  I'd do it in 8.0 but not in earlier releases.  The
transaction management code changed a lot in between, and I think a lot
of bugs were corrected.

-- 
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
Jason Tesser: You might not have understood me or I am not understanding you.
Paul Thomas: It feels like we're 2 people divided by a common language...

Re: Race condition in backend process exit

From

Tom Lane

Date:

07 August 2005, 23:37:56

I wrote:
>> I could not provoke the same crash in 8.0, but I suspect this may just
>> be a chance timing difference, and that the bug may be of long standing.

> I haven't done the experiment, but I'm pretty certain that it's possible
> to provoke this same crash in 8.0 if the timing is right, which could be
> forced by using gdb to delay execution at the right place in ProcKill.

Having done the experiment, I can now say that 8.0 and prior versions
are *not* vulnerable, but the reason is, um, subtle.  The actual
execution order of on_shmem_exit callbacks in an exiting backend is
ShutdownPostgresCleanupInvalidationStateProcKill

CleanupInvalidationState removes the backend from the SI invalidation
message ring.  Until I recently refactored the code to separate the
PGPROC array from the SI mechanism, that had the side effect of making
the backend's PGPROC disappear from the set visible to
TransactionIdIsInProgress.  Which means that in fact the released
versions do honor the rule "stop being in-progress before you release
locks".

This behavior is obviously mighty fragile, not to say undocumented,
so I'm still strongly inclined to make ShutdownPostgres do a normal
transaction abort sequence.  But I'm no longer very excited about
back-patching it.

> This bug may well explain the known reports of failures from SIGTERM'ing
> an individual backend, since (IIRC) that code path could also try to
> exit the backend with a transaction still in progress.

The particular issue exhibited here evidently isn't the explanation
for SIGTERM problems in existing releases ... but I still suspect that
those reports might have something to do with ShutdownPostgres taking
shortcuts with transaction abort.
        regards, tom lane

Re: Race condition in backend process exit

From

Andrew Sullivan

Date:

11 August 2005, 21:09:25

On Sun, Aug 07, 2005 at 03:45:10PM -0400, Tom Lane wrote:
> 
> I'm a bit hesitant to back-patch such a nontrivial and hard-to-test
> change, but it sure looks badly broken to me.  Any thoughts about the
> risks involved?

If there were some way to test it reliably, it'd make me feel a lot
better.  I guess I'd as soon hear about the risks involved in not
back-patching, because this seems like a pretty dangerous place to be
back-patching (and race condition fixes, in my experience, are often
well-loaded foot-guns, even if I do trust your coding abilities more
than approximately anyone else's).

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.                --Brad Holland