Thread: WAL dirty-buffer management bug

WAL dirty-buffer management bug

From
Tom Lane
Date:
I added some notes to src/backend/access/transam/README explaining the
protocol for executing a WAL-logged action:

---

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION()  (Any error during the next two steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk.  Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Build a WAL log record and pass it to XLogInsert(); then update the page's
LSN and TLI using the returned XLOG location.  For instance,
    recptr = XLogInsert(rmgr_id, info, rdata);
    PageSetLSN(dp, recptr);    PageSetTLI(dp, ThisTimeLineID);

5. END_CRIT_SECTION()

6. Unlock and write the buffer(s):
    LockBuffer(buffer, BUFFER_LOCK_UNLOCK);    WriteBuffer(buffer);

(Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it
dirty and unpins it.  The write will not happen until a checkpoint occurs or
the shared buffer is needed for another page.)

---

This is pretty much what heapam and btree currently do, but on looking
at it I think it's got a problem: we really ought to mark the buffer
dirty before releasing the critical section.  Otherwise, if there's an
elog(ERROR) before the WriteBuffer call is reached, the backend would go
on about its business, and we'd have changes in a disk buffer that isn't
marked dirty.  The changes would be uncommitted, presumably, because of
the error --- but nonetheless this could result in inconsistency down
the road.  One example scenario is:1. We insert a tuple at, say, index 3 on a page.2. elog after making the XLOG entry,
butbefore WriteBuffer.3. page is later discarded from shared buffers; since it's not   marked dirty, it'll just be
droppedwithout writing it.4. Later we need to insert another tuple in same table, and   we again choose index 3 on this
pageas the place to put it.5. system crash leads to replay from WAL.
 
Now we'll have two different WAL records trying to insert tuple 3.
Not good.

While there's not much chance of an elog in just LockBuffer/WriteBuffer,
some of the code paths in btree do quite a bit of stuff before they
write the buffer (this was probably done to avoid multiple entries to
bufmgr, back when the BufMgrLock was a severe source of contention).
So I think there's a nontrivial risk of index corruption from this
coding practice.

I'm thinking we should change the code and the README to specify that
you must mark the buffer dirty before you can END_CRIT_SECTION().
Comments?
        regards, tom lane


Re: WAL dirty-buffer management bug

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> This is pretty much what heapam and btree currently do, but on looking
> at it I think it's got a problem: we really ought to mark the buffer
> dirty before releasing the critical section.  Otherwise, if there's an
> elog(ERROR) before the WriteBuffer call is reached, the backend would go
> on about its business, and we'd have changes in a disk buffer that isn't
> marked dirty.  The changes would be uncommitted, presumably, because of
> the error --- but nonetheless this could result in inconsistency down
> the road.  One example scenario is:
> 1. We insert a tuple at, say, index 3 on a page.
> 2. elog after making the XLOG entry, but before WriteBuffer.
> 3. page is later discarded from shared buffers; since it's not
>    marked dirty, it'll just be dropped without writing it.
> 4. Later we need to insert another tuple in same table, and
>    we again choose index 3 on this page as the place to put it.
> 5. system crash leads to replay from WAL.
> Now we'll have two different WAL records trying to insert tuple 3.
> Not good.
>

It may be not good but not harmful either. On step2, the transaction will
abort and leave a page that has been changed but not marked dirty. There are
two situtations could happen after that. One is step 3, the other is the
page is still in the buffer pool and another transaction will write on it
(no problem, the tuple slot is already marked used). For step 3, yes, we
will see two WAL records trying to insert to the same tuple slot, but the
2nd one will cover the 1st one -- no problem. If the 2nd one will not cover
the 1st one (say that WAL record is broken), also no prolbem since the tuple
header will gaurantee that tuple is invisible. Can you give an example that
this will lead data corruption?

Regards,
Qingqing




Re: WAL dirty-buffer management bug

From
Simon Riggs
Date:
On Thu, 2006-03-30 at 13:51 -0500, Tom Lane wrote:

> This is pretty much what heapam and btree currently do, but on looking
> at it I think it's got a problem: we really ought to mark the buffer
> dirty before releasing the critical section.  Otherwise, if there's an
> elog(ERROR) before the WriteBuffer call is reached, the backend would go
> on about its business, and we'd have changes in a disk buffer that isn't
> marked dirty.  The changes would be uncommitted, presumably, because of
> the error --- but nonetheless this could result in inconsistency down
> the road.  One example scenario is:
>     1. We insert a tuple at, say, index 3 on a page.
>     2. elog after making the XLOG entry, but before WriteBuffer.
>     3. page is later discarded from shared buffers; since it's not
>        marked dirty, it'll just be dropped without writing it.
>     4. Later we need to insert another tuple in same table, and
>        we again choose index 3 on this page as the place to put it.
>     5. system crash leads to replay from WAL.
> Now we'll have two different WAL records trying to insert tuple 3.
> Not good.

Agreed.

Problem for indexes only. heap xlrecs don't specify exact insert points
so they'd replay just fine even if they were not originally inserted
like that.

> I'm thinking we should change the code and the README to specify that
> you must mark the buffer dirty before you can END_CRIT_SECTION().
> Comments?

Couple thoughts...

Should we just do this for indexes only? (Or any structure that requires
an exact physical position to be recorded in WAL).

Accesses to local buffers don't need to be critical sections either.

Best Regards, Simon Riggs



Re: WAL dirty-buffer management bug

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Problem for indexes only. heap xlrecs don't specify exact insert points

Sure they do.  They had better, else (for example) the associated index
insertions will be wrong.

> Accesses to local buffers don't need to be critical sections either.

True, but in most places it would uglify the code quite a bit to make it
like that, because START/END_CRIT_SECTION are bracketing code that is
shared between both cases.  And I'm not seeing where we'd get any
particular reliability gain from it: if the code has any significant
risk of elog'ing during the critical section, it's broken anyway ...
        regards, tom lane


Re: WAL dirty-buffer management bug

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> It may be not good but not harmful either. On step2, the transaction will
> abort and leave a page that has been changed but not marked dirty. There are
> two situtations could happen after that. One is step 3, the other is the
> page is still in the buffer pool and another transaction will write on it
> (no problem, the tuple slot is already marked used). For step 3, yes, we
> will see two WAL records trying to insert to the same tuple slot, but the
> 2nd one will cover the 1st one -- no problem.

Well, no, see the code in PageAddItem:
               if (ItemIdIsUsed(itemId) || ItemIdGetLength(itemId) != 0)               {
elog(WARNING,"will not overwrite a used ItemId");                   return InvalidOffsetNumber;               }
 

So during WAL replay the second insert will fail, leading to
       elog(PANIC, "heap_insert_redo: failed to add tuple");

Removing that error check in PageAddItem doesn't strike me as a good
idea, either ;-)
        regards, tom lane


Re: WAL dirty-buffer management bug

From
Simon Riggs
Date:
On Fri, 2006-03-31 at 09:36 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Problem for indexes only. heap xlrecs don't specify exact insert points
> 
> Sure they do.  They had better, else (for example) the associated index
> insertions will be wrong.

Yep, you're right.

Best Regards, Simon Riggs



Re: WAL dirty-buffer management bug

From
Tom Lane
Date:
I wrote:
> I'm thinking we should change the code and the README to specify that
> you must mark the buffer dirty before you can END_CRIT_SECTION().

While looking at this I realized that in fact we need to, and do,
mark the buffer dirty even earlier than that: look at bufmgr.c
LockBuffer and SyncOneBuffer comments.  LockBuffer is marking the
buffer dirty before we even start the critical section.  Because of
that, there is no actual bug here at present.  But what we've got is
confusing, klugy, inefficient code (it's inefficient because it
sometimes marks buffers dirty without changing them).

I think it's time to clean this up.  The correct sequence of operations
is really
pin and lock buffer(s)
START_CRIT_SECTION()
apply change to buffer(s), and mark them dirty
emit XLOG record
END_CRIT_SECTION()
unlock and unpin buffer(s)

I think we ought to rename WriteNoReleaseBuffer to MarkBufferDirty
to convey its true function, and use it in the third step of this
sequence.  We could get rid of the klugy force-dirty in LockBuffer,
and get rid of the poorly named WriteBuffer altogether --- the unpin
operation would now always just be ReleaseBuffer.

Any objections?
        regards, tom lane