Race conditions, race conditions! - Mailing list pgsql-hackers

From Tom Lane
Subject Race conditions, race conditions!
Date
Msg-id 3584.1115487400@sss.pgh.pa.us
Whole thread Raw
Responses Re: Race conditions, race conditions!  (Greg Stark <gsstark@mit.edu>)
Re: Race conditions, race conditions!  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-hackers
I promised an analysis of the problems Jan Wieck uncovered yesterday,
so here it is.

Jan had developed a testbed (which I hope he will post) that essentially
has a bunch of client threads all doing instances of the same
transaction in READ COMMITTED mode.  The intended database invariant is
that the number of rows in table t2 with a particular ID value is equal
to the "cnt" field of the single row in t1 with that ID value:
begin;select cnt from t1 where id = K for update;delete from t2 where id = K;-- check returned rowcount to see that
exactlyCNT rows were deletedinsert into t2 values (K);  -- repeat this N timesupdate t1 set cnt = N where id =
K;commit;

K is randomly chosen for each execution from the known set of t1 keys,
and N is randomly chosen each time as a small positive integer.  This
should maintain the invariant, since at any time only one transaction
can be holding the SELECT FOR UPDATE lock on a particular t1 row.

The trick is to run this under astronomical load; 100 or so client
threads on a garden-variety PC is about right.

What Jan was seeing was that every so often, a client thread would
error out, reporting that its DELETE had deleted zero rows rather
than the expected number.  But subsequent examination showed the t2
rows as being there.

Investigation showed that the connected backend had properly acquired
FOR UPDATE lock on the t1 row, but the snapshot it was using for the
subsequent DELETE showed the inserter of the t1 row as still running.
This should be impossible, since the SELECT FOR UPDATE cannot lock an
uncommitted row, and in READ COMMITTED mode we certainly will take a
new snapshot for the DELETE.

However, there is a race condition here.  During transaction commit,
xact.c first marks the transaction committed in pg_clog, and then
clears its XID from PGPROC.  This means there is a narrow window
in which both TransactionIdDidCommit and TransactionIdIsInProgress
will return true.  (We cannot do it the other way around, because
if neither one is returning true, onlookers are entitled to assume
that the transaction has crashed.)  However, the tqual.c routines
will allow a row to be seen as committed as soon as
TransactionIdDidCommit(xmin) returns true.  So the scenario is:

1. Backend A does RecordTransactionCommit to mark itself committed
in pg_clog, but then loses the CPU in the narrow window between
doing that and clearing its PGPROC entry.  Because of the ridiculous
load, it doesn't get control back for awhile.

2. Backend B comes along to run the test transaction for the same K
value.  It inspects the t1 row, concludes it's committed, marks the
row as locked FOR UPDATE, and returns the results to the client.

3. The client now issues the DELETE command.  B takes a new snapshot,
but because A is still not cleared out of PGPROC, A's transaction
is shown as still running in the snapshot.

4. Now the DELETE will delete no rows, because it doesn't consider the
t2 rows it should delete to be committed.


AFAICS this race condition has always been there; certainly at
least since Vadim put in MVCC, and it looks likely that the
original Berkeley code had a form of the problem.

The correct fix is that the tqual.c routines should check commit status
in the following way:
if (TransactionIdIsInProgress(xid))   // still in progress, don't touch tupleelse if (TransactionIdDidCommit(xid))   //
committed,mark accordinglyelse   // must be aborted or crashed, mark accordingly
 

rather than what they have traditionally done:
if (TransactionIdDidCommit(xid))   // committed, mark accordinglyelse if (TransactionIdDidAbort(xid))   // aborted,
markaccordinglyelse   // assume still in progress, don't touch tuple
 

Vadim recognized that the former logic was necessary for VACUUM to use
in deciding if it could clean up dead tuples, but apparently he never
made the extrapolation that it should be used *everywhere* transaction
status is tested.


The other interesting thing we saw was an Assertion failure.  The
postmaster log showed

WARNING:  relation "t1" page 196 is uninitialized --- fixing
TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
LOG:  server process (PID 11296) was terminated by signal 6

The WARNING could only have come from VACUUM.  (Jan's testbed does
launch a VACUUM every so often.)  The Assert failure is in
RelationGetBufferForTuple where it is adding a new page to a table.

I interpret this as the guy doing RelationGetBufferForTuple added a
page, but before he could initialize it, he lost control for long enough
for a VACUUM to scan through the entire table, see the zeroed page, and
"fix" it.  Then when the first guy got control again, his Assert saying
the page was zeroes failed.  The window for this exists because bufmgr/smgr
do physically extend the file, but when the new buffer is returned
to the caller, it is only pinned, not locked.  So it is possible for
VACUUM to acquire lock on the new page before RelationGetBufferForTuple
does.  (This is exceedingly improbable, because VACUUM would have to
scan through the entire relation first --- it wouldn't examine the
new page at all unless it was included in RelationGetNumberOfBlocks
at the start of VACUUM's scan.  We have not been able to reproduce the
crash, but we did see it once.)

At first I thought this didn't have any conseqences worse than an
Assert, since after all both VACUUM and the original extender are just
trying to set up the page as empty.  But consider this scenario:

1. Process A adds a zero page and gets blocked ahead of initializing it.

2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it  in FSM.

3. Process C takes the page from FSM and puts some new tuples in it.

4. Process A initializes the page, thus discarding C's tuples.

This could not happen if Asserts are enabled, because A would Assert
before re-initializing the page (and that's inside a lock so it is
sure to happen that way).  But with Asserts off, it's within the realm
of possibility under sufficiently heavy load.

This risk has been there since lazy VACUUM was created (it cannot
happen with VACUUM FULL because that takes an exclusive lock on the
whole table).  I believe the best fix is:

1. Change RelationGetBufferForTuple to hold the "relation extension"
lock until after it's exclusive-locked the new buffer.

2. Adjust VACUUM to not try to "fix" a page without getting the relation
extension lock.  If the page is still uninitialized after obtaining
that lock, then it must be a really lost page, not one that someone
is still in process of setting up.

There is a similar risk in btree indexes, with a similar fix.


We are currently testing candidate patches for these problems,
and so far haven't seen any further failures.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: FC3 broken with HEAD
Next
From: Tom Lane
Date:
Subject: Re: FC3 broken with HEAD