Thread: Race conditions, race conditions!

Race conditions, race conditions!

From
Tom Lane
Date:
I promised an analysis of the problems Jan Wieck uncovered yesterday,
so here it is.

Jan had developed a testbed (which I hope he will post) that essentially
has a bunch of client threads all doing instances of the same
transaction in READ COMMITTED mode.  The intended database invariant is
that the number of rows in table t2 with a particular ID value is equal
to the "cnt" field of the single row in t1 with that ID value:
begin;select cnt from t1 where id = K for update;delete from t2 where id = K;-- check returned rowcount to see that
exactlyCNT rows were deletedinsert into t2 values (K);  -- repeat this N timesupdate t1 set cnt = N where id =
K;commit;

K is randomly chosen for each execution from the known set of t1 keys,
and N is randomly chosen each time as a small positive integer.  This
should maintain the invariant, since at any time only one transaction
can be holding the SELECT FOR UPDATE lock on a particular t1 row.

The trick is to run this under astronomical load; 100 or so client
threads on a garden-variety PC is about right.

What Jan was seeing was that every so often, a client thread would
error out, reporting that its DELETE had deleted zero rows rather
than the expected number.  But subsequent examination showed the t2
rows as being there.

Investigation showed that the connected backend had properly acquired
FOR UPDATE lock on the t1 row, but the snapshot it was using for the
subsequent DELETE showed the inserter of the t1 row as still running.
This should be impossible, since the SELECT FOR UPDATE cannot lock an
uncommitted row, and in READ COMMITTED mode we certainly will take a
new snapshot for the DELETE.

However, there is a race condition here.  During transaction commit,
xact.c first marks the transaction committed in pg_clog, and then
clears its XID from PGPROC.  This means there is a narrow window
in which both TransactionIdDidCommit and TransactionIdIsInProgress
will return true.  (We cannot do it the other way around, because
if neither one is returning true, onlookers are entitled to assume
that the transaction has crashed.)  However, the tqual.c routines
will allow a row to be seen as committed as soon as
TransactionIdDidCommit(xmin) returns true.  So the scenario is:

1. Backend A does RecordTransactionCommit to mark itself committed
in pg_clog, but then loses the CPU in the narrow window between
doing that and clearing its PGPROC entry.  Because of the ridiculous
load, it doesn't get control back for awhile.

2. Backend B comes along to run the test transaction for the same K
value.  It inspects the t1 row, concludes it's committed, marks the
row as locked FOR UPDATE, and returns the results to the client.

3. The client now issues the DELETE command.  B takes a new snapshot,
but because A is still not cleared out of PGPROC, A's transaction
is shown as still running in the snapshot.

4. Now the DELETE will delete no rows, because it doesn't consider the
t2 rows it should delete to be committed.


AFAICS this race condition has always been there; certainly at
least since Vadim put in MVCC, and it looks likely that the
original Berkeley code had a form of the problem.

The correct fix is that the tqual.c routines should check commit status
in the following way:
if (TransactionIdIsInProgress(xid))   // still in progress, don't touch tupleelse if (TransactionIdDidCommit(xid))   //
committed,mark accordinglyelse   // must be aborted or crashed, mark accordingly
 

rather than what they have traditionally done:
if (TransactionIdDidCommit(xid))   // committed, mark accordinglyelse if (TransactionIdDidAbort(xid))   // aborted,
markaccordinglyelse   // assume still in progress, don't touch tuple
 

Vadim recognized that the former logic was necessary for VACUUM to use
in deciding if it could clean up dead tuples, but apparently he never
made the extrapolation that it should be used *everywhere* transaction
status is tested.


The other interesting thing we saw was an Assertion failure.  The
postmaster log showed

WARNING:  relation "t1" page 196 is uninitialized --- fixing
TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
LOG:  server process (PID 11296) was terminated by signal 6

The WARNING could only have come from VACUUM.  (Jan's testbed does
launch a VACUUM every so often.)  The Assert failure is in
RelationGetBufferForTuple where it is adding a new page to a table.

I interpret this as the guy doing RelationGetBufferForTuple added a
page, but before he could initialize it, he lost control for long enough
for a VACUUM to scan through the entire table, see the zeroed page, and
"fix" it.  Then when the first guy got control again, his Assert saying
the page was zeroes failed.  The window for this exists because bufmgr/smgr
do physically extend the file, but when the new buffer is returned
to the caller, it is only pinned, not locked.  So it is possible for
VACUUM to acquire lock on the new page before RelationGetBufferForTuple
does.  (This is exceedingly improbable, because VACUUM would have to
scan through the entire relation first --- it wouldn't examine the
new page at all unless it was included in RelationGetNumberOfBlocks
at the start of VACUUM's scan.  We have not been able to reproduce the
crash, but we did see it once.)

At first I thought this didn't have any conseqences worse than an
Assert, since after all both VACUUM and the original extender are just
trying to set up the page as empty.  But consider this scenario:

1. Process A adds a zero page and gets blocked ahead of initializing it.

2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it  in FSM.

3. Process C takes the page from FSM and puts some new tuples in it.

4. Process A initializes the page, thus discarding C's tuples.

This could not happen if Asserts are enabled, because A would Assert
before re-initializing the page (and that's inside a lock so it is
sure to happen that way).  But with Asserts off, it's within the realm
of possibility under sufficiently heavy load.

This risk has been there since lazy VACUUM was created (it cannot
happen with VACUUM FULL because that takes an exclusive lock on the
whole table).  I believe the best fix is:

1. Change RelationGetBufferForTuple to hold the "relation extension"
lock until after it's exclusive-locked the new buffer.

2. Adjust VACUUM to not try to "fix" a page without getting the relation
extension lock.  If the page is still uninitialized after obtaining
that lock, then it must be a really lost page, not one that someone
is still in process of setting up.

There is a similar risk in btree indexes, with a similar fix.


We are currently testing candidate patches for these problems,
and so far haven't seen any further failures.
        regards, tom lane


Re: Race conditions, race conditions!

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.

I wonder if there's an argument for building assertion-enabled builds with
code that randomly yields the processor some percentage of time before and
after taking a lock. It wouldn't catch every case but it might help.

-- 
greg



Re: Race conditions, race conditions!

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I wonder if there's an argument for building assertion-enabled builds with
> code that randomly yields the processor some percentage of time before and
> after taking a lock. It wouldn't catch every case but it might help.

Seems like that would mainly help you find cases where you'd put a lock
acquire or release a bit too late or too soon in a sequence of events;
not cases where you'd failed to acquire a needed lock at all.  It'd be
more useful I think to have a facility that randomly stops backends for
awhile regardless of exactly where they are in the code.

A high-load test case actually does this to some extent, but the problem
is you have little reproducibility and no assurance that execution
stopped for long enough to let critical events happen elsewhere.  The
ideal facility I think would slow one backend much more than others,
whereas high load still leaves them all making progress at about the
same rate ...
        regards, tom lane


Re: Race conditions, race conditions!

From
Tatsuo Ishii
Date:
Are we going to put the fixes into 8.0.3 and so on? Or will it be
included in 8.0.4?
--
Tatsuo Ishii

> I promised an analysis of the problems Jan Wieck uncovered yesterday,
> so here it is.
> 
> Jan had developed a testbed (which I hope he will post) that essentially
> has a bunch of client threads all doing instances of the same
> transaction in READ COMMITTED mode.  The intended database invariant is
> that the number of rows in table t2 with a particular ID value is equal
> to the "cnt" field of the single row in t1 with that ID value:
> 
>     begin;
>     select cnt from t1 where id = K for update;
>     delete from t2 where id = K;
>     -- check returned rowcount to see that exactly CNT rows were deleted
>     insert into t2 values (K);  -- repeat this N times
>     update t1 set cnt = N where id = K;
>     commit;
> 
> K is randomly chosen for each execution from the known set of t1 keys,
> and N is randomly chosen each time as a small positive integer.  This
> should maintain the invariant, since at any time only one transaction
> can be holding the SELECT FOR UPDATE lock on a particular t1 row.
> 
> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.
> 
> What Jan was seeing was that every so often, a client thread would
> error out, reporting that its DELETE had deleted zero rows rather
> than the expected number.  But subsequent examination showed the t2
> rows as being there.
> 
> Investigation showed that the connected backend had properly acquired
> FOR UPDATE lock on the t1 row, but the snapshot it was using for the
> subsequent DELETE showed the inserter of the t1 row as still running.
> This should be impossible, since the SELECT FOR UPDATE cannot lock an
> uncommitted row, and in READ COMMITTED mode we certainly will take a
> new snapshot for the DELETE.
> 
> However, there is a race condition here.  During transaction commit,
> xact.c first marks the transaction committed in pg_clog, and then
> clears its XID from PGPROC.  This means there is a narrow window
> in which both TransactionIdDidCommit and TransactionIdIsInProgress
> will return true.  (We cannot do it the other way around, because
> if neither one is returning true, onlookers are entitled to assume
> that the transaction has crashed.)  However, the tqual.c routines
> will allow a row to be seen as committed as soon as
> TransactionIdDidCommit(xmin) returns true.  So the scenario is:
> 
> 1. Backend A does RecordTransactionCommit to mark itself committed
> in pg_clog, but then loses the CPU in the narrow window between
> doing that and clearing its PGPROC entry.  Because of the ridiculous
> load, it doesn't get control back for awhile.
> 
> 2. Backend B comes along to run the test transaction for the same K
> value.  It inspects the t1 row, concludes it's committed, marks the
> row as locked FOR UPDATE, and returns the results to the client.
> 
> 3. The client now issues the DELETE command.  B takes a new snapshot,
> but because A is still not cleared out of PGPROC, A's transaction
> is shown as still running in the snapshot.
> 
> 4. Now the DELETE will delete no rows, because it doesn't consider the
> t2 rows it should delete to be committed.
> 
> 
> AFAICS this race condition has always been there; certainly at
> least since Vadim put in MVCC, and it looks likely that the
> original Berkeley code had a form of the problem.
> 
> The correct fix is that the tqual.c routines should check commit status
> in the following way:
> 
>     if (TransactionIdIsInProgress(xid))
>        // still in progress, don't touch tuple
>     else if (TransactionIdDidCommit(xid))
>        // committed, mark accordingly
>     else
>        // must be aborted or crashed, mark accordingly
> 
> rather than what they have traditionally done:
> 
>     if (TransactionIdDidCommit(xid))
>        // committed, mark accordingly
>     else if (TransactionIdDidAbort(xid))
>        // aborted, mark accordingly
>     else
>        // assume still in progress, don't touch tuple
> 
> Vadim recognized that the former logic was necessary for VACUUM to use
> in deciding if it could clean up dead tuples, but apparently he never
> made the extrapolation that it should be used *everywhere* transaction
> status is tested.
> 
> 
> The other interesting thing we saw was an Assertion failure.  The
> postmaster log showed
> 
> WARNING:  relation "t1" page 196 is uninitialized --- fixing
> TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
> LOG:  server process (PID 11296) was terminated by signal 6
> 
> The WARNING could only have come from VACUUM.  (Jan's testbed does
> launch a VACUUM every so often.)  The Assert failure is in
> RelationGetBufferForTuple where it is adding a new page to a table.
> 
> I interpret this as the guy doing RelationGetBufferForTuple added a
> page, but before he could initialize it, he lost control for long enough
> for a VACUUM to scan through the entire table, see the zeroed page, and
> "fix" it.  Then when the first guy got control again, his Assert saying
> the page was zeroes failed.  The window for this exists because bufmgr/smgr
> do physically extend the file, but when the new buffer is returned
> to the caller, it is only pinned, not locked.  So it is possible for
> VACUUM to acquire lock on the new page before RelationGetBufferForTuple
> does.  (This is exceedingly improbable, because VACUUM would have to
> scan through the entire relation first --- it wouldn't examine the
> new page at all unless it was included in RelationGetNumberOfBlocks
> at the start of VACUUM's scan.  We have not been able to reproduce the
> crash, but we did see it once.)
> 
> At first I thought this didn't have any conseqences worse than an
> Assert, since after all both VACUUM and the original extender are just
> trying to set up the page as empty.  But consider this scenario:
> 
> 1. Process A adds a zero page and gets blocked ahead of initializing it.
> 
> 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it
>    in FSM.
> 
> 3. Process C takes the page from FSM and puts some new tuples in it.
> 
> 4. Process A initializes the page, thus discarding C's tuples.
> 
> This could not happen if Asserts are enabled, because A would Assert
> before re-initializing the page (and that's inside a lock so it is
> sure to happen that way).  But with Asserts off, it's within the realm
> of possibility under sufficiently heavy load.
> 
> This risk has been there since lazy VACUUM was created (it cannot
> happen with VACUUM FULL because that takes an exclusive lock on the
> whole table).  I believe the best fix is:
> 
> 1. Change RelationGetBufferForTuple to hold the "relation extension"
> lock until after it's exclusive-locked the new buffer.
> 
> 2. Adjust VACUUM to not try to "fix" a page without getting the relation
> extension lock.  If the page is still uninitialized after obtaining
> that lock, then it must be a really lost page, not one that someone
> is still in process of setting up.
> 
> There is a similar risk in btree indexes, with a similar fix.
> 
> 
> We are currently testing candidate patches for these problems,
> and so far haven't seen any further failures.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 


Re: Race conditions, race conditions!

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> Are we going to put the fixes into 8.0.3 and so on? Or will it be
> included in 8.0.4?

We have removed 8.0.3 from the FTP servers and plan to re-release 8.0.3
and previous 7.X releases.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Race conditions, race conditions!

From
"Jim C. Nasby"
Date:
On Sat, May 07, 2005 at 07:20:48PM -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > I wonder if there's an argument for building assertion-enabled builds with
> > code that randomly yields the processor some percentage of time before and
> > after taking a lock. It wouldn't catch every case but it might help.
> 
> Seems like that would mainly help you find cases where you'd put a lock
> acquire or release a bit too late or too soon in a sequence of events;
> not cases where you'd failed to acquire a needed lock at all.  It'd be
> more useful I think to have a facility that randomly stops backends for
> awhile regardless of exactly where they are in the code.
> 
> A high-load test case actually does this to some extent, but the problem
> is you have little reproducibility and no assurance that execution
> stopped for long enough to let critical events happen elsewhere.  The
> ideal facility I think would slow one backend much more than others,
> whereas high load still leaves them all making progress at about the
> same rate ...

Would setting different priorities/niceness on different backends during
the stress test help? It might not be perfect but it should be trivial
to accomplish...
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Race conditions, race conditions!

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
>
> WARNING:  relation "t1" page 196 is uninitialized --- fixing
> TRAP: FailedAssertion("!((((PageHeader) ((PageHeader)
pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
> LOG:  server process (PID 11296) was terminated by signal 6
>

Inspired by this, can we put an asseration here:

--- /*  * lookup the buffer.  IO_IN_PROGRESS is set if the requested  * block is not currently in memory.  */ bufHdr =
BufferAlloc(reln,blockNum, &found);
 

+ /* we are guaranted that nobody else has touched this will-be-new block */
+ Assert(!(found && isExtend));
 if (found)  BufferHitCount++;
---

BufferAlloc() consists of two parts, one is looking for the blockNum, the
other is allocating a free buffer. If the asseration is good, then for
"isExtend", we just need the second part. This could bring marginal
performance benefits.


Regards,
Qingqing




Re: Race conditions, race conditions!

From
Bruce Momjian
Date:
Patch applied.  Thanks.

---------------------------------------------------------------------------


Qingqing Zhou wrote:
>
> "Tom Lane" <tgl@sss.pgh.pa.us> writes
> >
> > WARNING:  relation "t1" page 196 is uninitialized --- fixing
> > TRAP: FailedAssertion("!((((PageHeader) ((PageHeader)
> pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
> > LOG:  server process (PID 11296) was terminated by signal 6
> >
>
> Inspired by this, can we put an asseration here:
>
> ---
>   /*
>    * lookup the buffer.  IO_IN_PROGRESS is set if the requested
>    * block is not currently in memory.
>    */
>   bufHdr = BufferAlloc(reln, blockNum, &found);
>
> + /* we are guaranted that nobody else has touched this will-be-new block */
> + Assert(!(found && isExtend));
>
>   if (found)
>    BufferHitCount++;
> ---
>
> BufferAlloc() consists of two parts, one is looking for the blockNum, the
> other is allocating a free buffer. If the asseration is good, then for
> "isExtend", we just need the second part. This could bring marginal
> performance benefits.
>
>
> Regards,
> Qingqing
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.193
diff -c -c -r1.193 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    12 Aug 2005 05:05:50 -0000    1.193
--- src/backend/storage/buffer/bufmgr.c    12 Aug 2005 21:40:41 -0000
***************
*** 153,158 ****
--- 153,160 ----
           * block is not currently in memory.
           */
          bufHdr = BufferAlloc(reln, blockNum, &found);
+         /* we are guaranted that nobody else has touched this will-be-new block */
+         Assert(!(found && isExtend));
          if (found)
              BufferHitCount++;
      }

Re: Race conditions, race conditions!

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Patch applied.  Thanks.
>            * block is not currently in memory.
>            */
>           bufHdr = BufferAlloc(reln, blockNum, &found);
> +         /* we are guaranted that nobody else has touched this will-be-new block */
> +         Assert(!(found && isExtend));
>           if (found)
>               BufferHitCount++;
>       }

This patch is utterly wrong.  Please revert it.

The case it is Asserting can't happen is explained in the comment a
couple dozen lines further down:

* try to extend a relation
* read smgrnblocks to find the current relation length
* allocate an empty buffer for the N+1'st page of the rel
* call smgrextend
* smgrextend fails for some reason (eg, no space left on disk)
* buffer remains present, but is not BM_VALID
* awhile later, try to extend relation again
* read smgrnblocks to find the current relation length
* allocate a buffer for the N+1'st page of the rel

This is entirely likely to find the same non-BM_VALID buffer that was
assigned in the first iteration.
        regards, tom lane


Re: Race conditions, race conditions!

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Patch applied.  Thanks.
> >            * block is not currently in memory.
> >            */
> >           bufHdr = BufferAlloc(reln, blockNum, &found);
> > +         /* we are guaranted that nobody else has touched this will-be-new block */
> > +         Assert(!(found && isExtend));
> >           if (found)
> >               BufferHitCount++;
> >       }
> 
> This patch is utterly wrong.  Please revert it.
> 
> The case it is Asserting can't happen is explained in the comment a
> couple dozen lines further down:
> 
> * try to extend a relation
> * read smgrnblocks to find the current relation length
> * allocate an empty buffer for the N+1'st page of the rel
> * call smgrextend
> * smgrextend fails for some reason (eg, no space left on disk)
> * buffer remains present, but is not BM_VALID
> * awhile later, try to extend relation again
> * read smgrnblocks to find the current relation length
> * allocate a buffer for the N+1'st page of the rel
> 
> This is entirely likely to find the same non-BM_VALID buffer that was
> assigned in the first iteration.

Reverted.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Race conditions, race conditions!

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> This patch is utterly wrong.  Please revert it.
>
> This is entirely likely to find the same non-BM_VALID buffer that was
> assigned in the first iteration.
>

Yes, Tom is right. A relation extension might find a non-BM_VALID buffer
left by previous failed relation extension.

Regards,
Qingqing




Re: Race conditions, race conditions!

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> This is entirely likely to find the same non-BM_VALID buffer that was
> assigned in the first iteration.
>

So after we found it, we still need to extend the file. In ReadBuffer():

---/* if it was already in the buffer pool, we're done */if (found){ ... return BufferDescriptorGetBuffer(bufHdr);}
/* ... so future attempts ... come right back here to * try smgrextend again. */
---

Shall we write
   /* If it was already in the buffer pool and not for extension, we're
done */   if (found && !isExtend)

instead?

Regards,
Qingqing





Re: Race conditions, race conditions!

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> Shall we write

>     /* If it was already in the buffer pool and not for extension, we're
> done */
>     if (found && !isExtend)

> instead?

If you can demonstrate a problem in this code, please do.  I'm very much
less than excited about making random changes though.
        regards, tom lane