Theory about XLogFlush startup failures - Mailing list pgsql-hackers

From Tom Lane
Subject Theory about XLogFlush startup failures
Date
Msg-id 21344.1010868390@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I just spent some time trying to understand the mechanism behind the
"XLogFlush: request is not satisfied" startup errors we've seen reported
occasionally with 7.1.  The only apparent way for this to happen is for
XLogFlush to be given a garbage WAL record pointer (ie, one pointing
beyond the current end of WAL), which presumably must be coming from
a corrupted LSN field in a data page.  Well, that's not too hard to
believe during normal operation: say the disk drive drops some bits in
the LSN field, and we read the page in, and don't have any immediate
need to change it (which would cause the LSN to be overwritten); but we
do find some transaction status hint bits to set, so the page gets
marked dirty.  Then when the page is written out, bufmgr will try to
flush xlog using the corrupted LSN pointer.

However, what about the cases where this error occurs during WAL
recovery?  As far as I can see, all pages that WAL recovery touches will
be marked with a valid LSN pointing at the WAL record that caused them
to be touched.  So the bufmgr flush operations done during recovery
can't see a bad LSN even if the disk dropped some bits.

With one exception.  In 7.1, pages of pg_log are held by the buffer
manager but we never touch their LSN fields.  Ordinarily the LSNs are
always zero and don't cause any flush problems.  But what if pg_log
becomes corrupt?  If there's a sufficiently large garbage value in the
LSN of a pg_log page that WAL recovery tries to set commit status in,
then bingo: we'll get "request is not satisfied" every time, because
the recovery process will always try to flush that page without fixing
its LSN.

So the failure-to-start-up problem can be blamed entirely on 7.1's
failure to do anything with LSN fields in pg_log pages.  I was able to
get some experimental confirmation of this train of thought when I
looked at Jeff Lu's files (he's the most recent reporter of failure
to start up with this message).  Sure enough, his pg_log contains
garbage.

If that's true, then the startup problem should be gone in 7.2, since
7.2 doesn't use the buffer manager to access CLOG pages and doesn't
expect CLOG pages to have an LSN.

However we could still see failures during normal operation due to
dropped bits on disk.  So I am still dissatisfied with doing elog(STOP)
for this condition, as I regard it as an overly strong reaction to
corrupted data; moreover, it does nothing to fix the problem and indeed
gets in the way of fixing the problem.  I propose the attached patch.
What do you think?
        regards, tom lane


*** src/backend/access/transam/xlog.c.orig    Fri Dec 28 13:16:41 2001
--- src/backend/access/transam/xlog.c    Sat Jan 12 13:24:20 2002
***************
*** 1262,1276 ****             WriteRqst.Write = WriteRqstPtr;             WriteRqst.Flush = record;
XLogWrite(WriteRqst);
-             if (XLByteLT(LogwrtResult.Flush, record))
-                 elog(STOP, "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X",
-                      record.xlogid, record.xrecoff,
-                   LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);         }
LWLockRelease(WALWriteLock);    }      END_CRIT_SECTION(); }  /*
 
--- 1262,1301 ----             WriteRqst.Write = WriteRqstPtr;             WriteRqst.Flush = record;
XLogWrite(WriteRqst);        }         LWLockRelease(WALWriteLock);     }      END_CRIT_SECTION();
 
+ 
+     /*
+      * If we still haven't flushed to the request point then we have a
+      * problem; most likely, the requested flush point is past end of XLOG.
+      * This has been seen to occur when a disk page has a corrupted LSN.
+      *
+      * Formerly we treated this as a STOP condition, but that hurts the
+      * system's robustness rather than helping it: we do not want to take
+      * down the whole system due to corruption on one data page.  In
+      * particular, if the bad page is encountered again during recovery then
+      * we would be unable to restart the database at all!  (This scenario
+      * has actually happened in the field several times with 7.1 releases.
+      * Note that we cannot get here while InRedo is true, but if the bad
+      * page is brought in and marked dirty during recovery then
+      * CreateCheckpoint will try to flush it at the end of recovery.)
+      *
+      * The current approach is to ERROR under normal conditions, but only
+      * NOTICE during recovery, so that the system can be brought up even if
+      * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR
+      * will be promoted to STOP since xact.c calls this routine inside a
+      * critical section.  However, calls from bufmgr.c are not within
+      * critical sections and so we will not force a restart for a bad LSN
+      * on a data page.
+      */
+     if (XLByteLT(LogwrtResult.Flush, record))
+         elog(InRecovery ? NOTICE : ERROR,
+              "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X",
+              record.xlogid, record.xrecoff,
+              LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); }  /*


pgsql-hackers by date:

Previous
From: mlw
Date:
Subject: benchmarking journalling file systems, fsync, and buffers.
Next
From: Tom Lane
Date:
Subject: Re: benchmarking journalling file systems, fsync, and buffers.