Theory about XLogFlush startup failures - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Theory about XLogFlush startup failures |
Date | |
Msg-id | 21344.1010868390@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
I just spent some time trying to understand the mechanism behind the "XLogFlush: request is not satisfied" startup errors we've seen reported occasionally with 7.1. The only apparent way for this to happen is for XLogFlush to be given a garbage WAL record pointer (ie, one pointing beyond the current end of WAL), which presumably must be coming from a corrupted LSN field in a data page. Well, that's not too hard to believe during normal operation: say the disk drive drops some bits in the LSN field, and we read the page in, and don't have any immediate need to change it (which would cause the LSN to be overwritten); but we do find some transaction status hint bits to set, so the page gets marked dirty. Then when the page is written out, bufmgr will try to flush xlog using the corrupted LSN pointer. However, what about the cases where this error occurs during WAL recovery? As far as I can see, all pages that WAL recovery touches will be marked with a valid LSN pointing at the WAL record that caused them to be touched. So the bufmgr flush operations done during recovery can't see a bad LSN even if the disk dropped some bits. With one exception. In 7.1, pages of pg_log are held by the buffer manager but we never touch their LSN fields. Ordinarily the LSNs are always zero and don't cause any flush problems. But what if pg_log becomes corrupt? If there's a sufficiently large garbage value in the LSN of a pg_log page that WAL recovery tries to set commit status in, then bingo: we'll get "request is not satisfied" every time, because the recovery process will always try to flush that page without fixing its LSN. So the failure-to-start-up problem can be blamed entirely on 7.1's failure to do anything with LSN fields in pg_log pages. I was able to get some experimental confirmation of this train of thought when I looked at Jeff Lu's files (he's the most recent reporter of failure to start up with this message). Sure enough, his pg_log contains garbage. If that's true, then the startup problem should be gone in 7.2, since 7.2 doesn't use the buffer manager to access CLOG pages and doesn't expect CLOG pages to have an LSN. However we could still see failures during normal operation due to dropped bits on disk. So I am still dissatisfied with doing elog(STOP) for this condition, as I regard it as an overly strong reaction to corrupted data; moreover, it does nothing to fix the problem and indeed gets in the way of fixing the problem. I propose the attached patch. What do you think? regards, tom lane *** src/backend/access/transam/xlog.c.orig Fri Dec 28 13:16:41 2001 --- src/backend/access/transam/xlog.c Sat Jan 12 13:24:20 2002 *************** *** 1262,1276 **** WriteRqst.Write = WriteRqstPtr; WriteRqst.Flush = record; XLogWrite(WriteRqst); - if (XLByteLT(LogwrtResult.Flush, record)) - elog(STOP, "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X", - record.xlogid, record.xrecoff, - LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); } LWLockRelease(WALWriteLock); } END_CRIT_SECTION(); } /* --- 1262,1301 ---- WriteRqst.Write = WriteRqstPtr; WriteRqst.Flush = record; XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); } END_CRIT_SECTION(); + + /* + * If we still haven't flushed to the request point then we have a + * problem; most likely, the requested flush point is past end of XLOG. + * This has been seen to occur when a disk page has a corrupted LSN. + * + * Formerly we treated this as a STOP condition, but that hurts the + * system's robustness rather than helping it: we do not want to take + * down the whole system due to corruption on one data page. In + * particular, if the bad page is encountered again during recovery then + * we would be unable to restart the database at all! (This scenario + * has actually happened in the field several times with 7.1 releases. + * Note that we cannot get here while InRedo is true, but if the bad + * page is brought in and marked dirty during recovery then + * CreateCheckpoint will try to flush it at the end of recovery.) + * + * The current approach is to ERROR under normal conditions, but only + * NOTICE during recovery, so that the system can be brought up even if + * there's a corrupt LSN. Note that for calls from xact.c, the ERROR + * will be promoted to STOP since xact.c calls this routine inside a + * critical section. However, calls from bufmgr.c are not within + * critical sections and so we will not force a restart for a bad LSN + * on a data page. + */ + if (XLByteLT(LogwrtResult.Flush, record)) + elog(InRecovery ? NOTICE : ERROR, + "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X", + record.xlogid, record.xrecoff, + LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); } /*
pgsql-hackers by date: