Re: Race condition in backend process exit - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race condition in backend process exit
Date
Msg-id 3568.1123443910@sss.pgh.pa.us
Whole thread Raw
In response to Race condition in backend process exit  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race condition in backend process exit  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Re: Race condition in backend process exit  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Race condition in backend process exit  (Andrew Sullivan <ajs@crankycanuck.ca>)
List pgsql-hackers
I wrote:
> I can fairly consistently crash CVS tip with the following:
> ...
> Apparently, session 1's locks are being released while it still shows as
> an active transaction in the PGPROC array, causing XactLockTableWait to
> suppose it was a subtransaction and look for the parent.  This indicates
> something is being done incompletely or in the wrong order during
> backend exit, because AbortTransaction is perfectly clear that you mark
> yourself not running before you release your locks.  Haven't found it
> yet.

It looks to me like the problem is that ShutdownPostgres tries to get
away with doing just a subset of AbortTransaction; in particular it
does nothing to mark the PGPROC as not running a transaction anymore.
So when ProcKill releases locks, the xact is still InProgress.

I'm thinking that the correct fix is to forget the notion that it's
safer to do a subset of AbortTransaction than to do the whole thing.
We should make ShutdownPostgres do this:
AbortOutOfAnyTransaction();
/* Drop user-level locks, which are not dropped by xact abort */
#ifdef USER_LOCKSLockReleaseAll(USER_LOCKMETHOD, true);
#endif

and then remove the lock manager cleanup operations from ProcKill.

> I could not provoke the same crash in 8.0, but I suspect this may just
> be a chance timing difference, and that the bug may be of long standing.

I haven't done the experiment, but I'm pretty certain that it's possible
to provoke this same crash in 8.0 if the timing is right, which could be
forced by using gdb to delay execution at the right place in ProcKill.
In pre-8.0 releases XactLockTableWait doesn't try to chain up to parent
transactions, so the particular crash doesn't exist, but we still have
the problem that the exiting backend releases locks while its xact still
appears to be running.  That's been incorrect according to the comments
in xact.c since forever, so I would imagine that there are other race
conditions in which this is a Bad Thing.

This bug may well explain the known reports of failures from SIGTERM'ing
an individual backend, since (IIRC) that code path could also try to
exit the backend with a transaction still in progress.

I'm a bit hesitant to back-patch such a nontrivial and hard-to-test
change, but it sure looks badly broken to me.  Any thoughts about the
risks involved?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql and ROLES
Next
From: Alvaro Herrera
Date:
Subject: Re: Race condition in backend process exit