Thread: WAL dirty-buffer management bug
I added some notes to src/backend/access/transam/README explaining the protocol for executing a WAL-logged action: --- 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s) to be modified. 2. START_CRIT_SECTION() (Any error during the next two steps must cause a PANIC because the shared buffers will contain unlogged changes, which we have to ensure don't get to disk. Obviously, you should check conditions such as whether there's enough free space on the page before you start the critical section.) 3. Apply the required changes to the shared buffer(s). 4. Build a WAL log record and pass it to XLogInsert(); then update the page's LSN and TLI using the returned XLOG location. For instance, recptr = XLogInsert(rmgr_id, info, rdata); PageSetLSN(dp, recptr); PageSetTLI(dp, ThisTimeLineID); 5. END_CRIT_SECTION() 6. Unlock and write the buffer(s): LockBuffer(buffer, BUFFER_LOCK_UNLOCK); WriteBuffer(buffer); (Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it dirty and unpins it. The write will not happen until a checkpoint occurs or the shared buffer is needed for another page.) --- This is pretty much what heapam and btree currently do, but on looking at it I think it's got a problem: we really ought to mark the buffer dirty before releasing the critical section. Otherwise, if there's an elog(ERROR) before the WriteBuffer call is reached, the backend would go on about its business, and we'd have changes in a disk buffer that isn't marked dirty. The changes would be uncommitted, presumably, because of the error --- but nonetheless this could result in inconsistency down the road. One example scenario is:1. We insert a tuple at, say, index 3 on a page.2. elog after making the XLOG entry, butbefore WriteBuffer.3. page is later discarded from shared buffers; since it's not marked dirty, it'll just be droppedwithout writing it.4. Later we need to insert another tuple in same table, and we again choose index 3 on this pageas the place to put it.5. system crash leads to replay from WAL. Now we'll have two different WAL records trying to insert tuple 3. Not good. While there's not much chance of an elog in just LockBuffer/WriteBuffer, some of the code paths in btree do quite a bit of stuff before they write the buffer (this was probably done to avoid multiple entries to bufmgr, back when the BufMgrLock was a severe source of contention). So I think there's a nontrivial risk of index corruption from this coding practice. I'm thinking we should change the code and the README to specify that you must mark the buffer dirty before you can END_CRIT_SECTION(). Comments? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > This is pretty much what heapam and btree currently do, but on looking > at it I think it's got a problem: we really ought to mark the buffer > dirty before releasing the critical section. Otherwise, if there's an > elog(ERROR) before the WriteBuffer call is reached, the backend would go > on about its business, and we'd have changes in a disk buffer that isn't > marked dirty. The changes would be uncommitted, presumably, because of > the error --- but nonetheless this could result in inconsistency down > the road. One example scenario is: > 1. We insert a tuple at, say, index 3 on a page. > 2. elog after making the XLOG entry, but before WriteBuffer. > 3. page is later discarded from shared buffers; since it's not > marked dirty, it'll just be dropped without writing it. > 4. Later we need to insert another tuple in same table, and > we again choose index 3 on this page as the place to put it. > 5. system crash leads to replay from WAL. > Now we'll have two different WAL records trying to insert tuple 3. > Not good. > It may be not good but not harmful either. On step2, the transaction will abort and leave a page that has been changed but not marked dirty. There are two situtations could happen after that. One is step 3, the other is the page is still in the buffer pool and another transaction will write on it (no problem, the tuple slot is already marked used). For step 3, yes, we will see two WAL records trying to insert to the same tuple slot, but the 2nd one will cover the 1st one -- no problem. If the 2nd one will not cover the 1st one (say that WAL record is broken), also no prolbem since the tuple header will gaurantee that tuple is invisible. Can you give an example that this will lead data corruption? Regards, Qingqing
On Thu, 2006-03-30 at 13:51 -0500, Tom Lane wrote: > This is pretty much what heapam and btree currently do, but on looking > at it I think it's got a problem: we really ought to mark the buffer > dirty before releasing the critical section. Otherwise, if there's an > elog(ERROR) before the WriteBuffer call is reached, the backend would go > on about its business, and we'd have changes in a disk buffer that isn't > marked dirty. The changes would be uncommitted, presumably, because of > the error --- but nonetheless this could result in inconsistency down > the road. One example scenario is: > 1. We insert a tuple at, say, index 3 on a page. > 2. elog after making the XLOG entry, but before WriteBuffer. > 3. page is later discarded from shared buffers; since it's not > marked dirty, it'll just be dropped without writing it. > 4. Later we need to insert another tuple in same table, and > we again choose index 3 on this page as the place to put it. > 5. system crash leads to replay from WAL. > Now we'll have two different WAL records trying to insert tuple 3. > Not good. Agreed. Problem for indexes only. heap xlrecs don't specify exact insert points so they'd replay just fine even if they were not originally inserted like that. > I'm thinking we should change the code and the README to specify that > you must mark the buffer dirty before you can END_CRIT_SECTION(). > Comments? Couple thoughts... Should we just do this for indexes only? (Or any structure that requires an exact physical position to be recorded in WAL). Accesses to local buffers don't need to be critical sections either. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Problem for indexes only. heap xlrecs don't specify exact insert points Sure they do. They had better, else (for example) the associated index insertions will be wrong. > Accesses to local buffers don't need to be critical sections either. True, but in most places it would uglify the code quite a bit to make it like that, because START/END_CRIT_SECTION are bracketing code that is shared between both cases. And I'm not seeing where we'd get any particular reliability gain from it: if the code has any significant risk of elog'ing during the critical section, it's broken anyway ... regards, tom lane
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > It may be not good but not harmful either. On step2, the transaction will > abort and leave a page that has been changed but not marked dirty. There are > two situtations could happen after that. One is step 3, the other is the > page is still in the buffer pool and another transaction will write on it > (no problem, the tuple slot is already marked used). For step 3, yes, we > will see two WAL records trying to insert to the same tuple slot, but the > 2nd one will cover the 1st one -- no problem. Well, no, see the code in PageAddItem: if (ItemIdIsUsed(itemId) || ItemIdGetLength(itemId) != 0) { elog(WARNING,"will not overwrite a used ItemId"); return InvalidOffsetNumber; } So during WAL replay the second insert will fail, leading to elog(PANIC, "heap_insert_redo: failed to add tuple"); Removing that error check in PageAddItem doesn't strike me as a good idea, either ;-) regards, tom lane
On Fri, 2006-03-31 at 09:36 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Problem for indexes only. heap xlrecs don't specify exact insert points > > Sure they do. They had better, else (for example) the associated index > insertions will be wrong. Yep, you're right. Best Regards, Simon Riggs
I wrote: > I'm thinking we should change the code and the README to specify that > you must mark the buffer dirty before you can END_CRIT_SECTION(). While looking at this I realized that in fact we need to, and do, mark the buffer dirty even earlier than that: look at bufmgr.c LockBuffer and SyncOneBuffer comments. LockBuffer is marking the buffer dirty before we even start the critical section. Because of that, there is no actual bug here at present. But what we've got is confusing, klugy, inefficient code (it's inefficient because it sometimes marks buffers dirty without changing them). I think it's time to clean this up. The correct sequence of operations is really pin and lock buffer(s) START_CRIT_SECTION() apply change to buffer(s), and mark them dirty emit XLOG record END_CRIT_SECTION() unlock and unpin buffer(s) I think we ought to rename WriteNoReleaseBuffer to MarkBufferDirty to convey its true function, and use it in the third step of this sequence. We could get rid of the klugy force-dirty in LockBuffer, and get rid of the poorly named WriteBuffer altogether --- the unpin operation would now always just be ReleaseBuffer. Any objections? regards, tom lane