Race conditions, race conditions! - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Race conditions, race conditions! |
Date | |
Msg-id | 3584.1115487400@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Race conditions, race conditions!
Re: Race conditions, race conditions! |
List | pgsql-hackers |
I promised an analysis of the problems Jan Wieck uncovered yesterday, so here it is. Jan had developed a testbed (which I hope he will post) that essentially has a bunch of client threads all doing instances of the same transaction in READ COMMITTED mode. The intended database invariant is that the number of rows in table t2 with a particular ID value is equal to the "cnt" field of the single row in t1 with that ID value: begin;select cnt from t1 where id = K for update;delete from t2 where id = K;-- check returned rowcount to see that exactlyCNT rows were deletedinsert into t2 values (K); -- repeat this N timesupdate t1 set cnt = N where id = K;commit; K is randomly chosen for each execution from the known set of t1 keys, and N is randomly chosen each time as a small positive integer. This should maintain the invariant, since at any time only one transaction can be holding the SELECT FOR UPDATE lock on a particular t1 row. The trick is to run this under astronomical load; 100 or so client threads on a garden-variety PC is about right. What Jan was seeing was that every so often, a client thread would error out, reporting that its DELETE had deleted zero rows rather than the expected number. But subsequent examination showed the t2 rows as being there. Investigation showed that the connected backend had properly acquired FOR UPDATE lock on the t1 row, but the snapshot it was using for the subsequent DELETE showed the inserter of the t1 row as still running. This should be impossible, since the SELECT FOR UPDATE cannot lock an uncommitted row, and in READ COMMITTED mode we certainly will take a new snapshot for the DELETE. However, there is a race condition here. During transaction commit, xact.c first marks the transaction committed in pg_clog, and then clears its XID from PGPROC. This means there is a narrow window in which both TransactionIdDidCommit and TransactionIdIsInProgress will return true. (We cannot do it the other way around, because if neither one is returning true, onlookers are entitled to assume that the transaction has crashed.) However, the tqual.c routines will allow a row to be seen as committed as soon as TransactionIdDidCommit(xmin) returns true. So the scenario is: 1. Backend A does RecordTransactionCommit to mark itself committed in pg_clog, but then loses the CPU in the narrow window between doing that and clearing its PGPROC entry. Because of the ridiculous load, it doesn't get control back for awhile. 2. Backend B comes along to run the test transaction for the same K value. It inspects the t1 row, concludes it's committed, marks the row as locked FOR UPDATE, and returns the results to the client. 3. The client now issues the DELETE command. B takes a new snapshot, but because A is still not cleared out of PGPROC, A's transaction is shown as still running in the snapshot. 4. Now the DELETE will delete no rows, because it doesn't consider the t2 rows it should delete to be committed. AFAICS this race condition has always been there; certainly at least since Vadim put in MVCC, and it looks likely that the original Berkeley code had a form of the problem. The correct fix is that the tqual.c routines should check commit status in the following way: if (TransactionIdIsInProgress(xid)) // still in progress, don't touch tupleelse if (TransactionIdDidCommit(xid)) // committed,mark accordinglyelse // must be aborted or crashed, mark accordingly rather than what they have traditionally done: if (TransactionIdDidCommit(xid)) // committed, mark accordinglyelse if (TransactionIdDidAbort(xid)) // aborted, markaccordinglyelse // assume still in progress, don't touch tuple Vadim recognized that the former logic was necessary for VACUUM to use in deciding if it could clean up dead tuples, but apparently he never made the extrapolation that it should be used *everywhere* transaction status is tested. The other interesting thing we saw was an Assertion failure. The postmaster log showed WARNING: relation "t1" page 196 is uninitialized --- fixing TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263) LOG: server process (PID 11296) was terminated by signal 6 The WARNING could only have come from VACUUM. (Jan's testbed does launch a VACUUM every so often.) The Assert failure is in RelationGetBufferForTuple where it is adding a new page to a table. I interpret this as the guy doing RelationGetBufferForTuple added a page, but before he could initialize it, he lost control for long enough for a VACUUM to scan through the entire table, see the zeroed page, and "fix" it. Then when the first guy got control again, his Assert saying the page was zeroes failed. The window for this exists because bufmgr/smgr do physically extend the file, but when the new buffer is returned to the caller, it is only pinned, not locked. So it is possible for VACUUM to acquire lock on the new page before RelationGetBufferForTuple does. (This is exceedingly improbable, because VACUUM would have to scan through the entire relation first --- it wouldn't examine the new page at all unless it was included in RelationGetNumberOfBlocks at the start of VACUUM's scan. We have not been able to reproduce the crash, but we did see it once.) At first I thought this didn't have any conseqences worse than an Assert, since after all both VACUUM and the original extender are just trying to set up the page as empty. But consider this scenario: 1. Process A adds a zero page and gets blocked ahead of initializing it. 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it in FSM. 3. Process C takes the page from FSM and puts some new tuples in it. 4. Process A initializes the page, thus discarding C's tuples. This could not happen if Asserts are enabled, because A would Assert before re-initializing the page (and that's inside a lock so it is sure to happen that way). But with Asserts off, it's within the realm of possibility under sufficiently heavy load. This risk has been there since lazy VACUUM was created (it cannot happen with VACUUM FULL because that takes an exclusive lock on the whole table). I believe the best fix is: 1. Change RelationGetBufferForTuple to hold the "relation extension" lock until after it's exclusive-locked the new buffer. 2. Adjust VACUUM to not try to "fix" a page without getting the relation extension lock. If the page is still uninitialized after obtaining that lock, then it must be a really lost page, not one that someone is still in process of setting up. There is a similar risk in btree indexes, with a similar fix. We are currently testing candidate patches for these problems, and so far haven't seen any further failures. regards, tom lane
pgsql-hackers by date: