Thread: Theory about XLogFlush startup failures
I just spent some time trying to understand the mechanism behind the "XLogFlush: request is not satisfied" startup errors we've seen reported occasionally with 7.1. The only apparent way for this to happen is for XLogFlush to be given a garbage WAL record pointer (ie, one pointing beyond the current end of WAL), which presumably must be coming from a corrupted LSN field in a data page. Well, that's not too hard to believe during normal operation: say the disk drive drops some bits in the LSN field, and we read the page in, and don't have any immediate need to change it (which would cause the LSN to be overwritten); but we do find some transaction status hint bits to set, so the page gets marked dirty. Then when the page is written out, bufmgr will try to flush xlog using the corrupted LSN pointer. However, what about the cases where this error occurs during WAL recovery? As far as I can see, all pages that WAL recovery touches will be marked with a valid LSN pointing at the WAL record that caused them to be touched. So the bufmgr flush operations done during recovery can't see a bad LSN even if the disk dropped some bits. With one exception. In 7.1, pages of pg_log are held by the buffer manager but we never touch their LSN fields. Ordinarily the LSNs are always zero and don't cause any flush problems. But what if pg_log becomes corrupt? If there's a sufficiently large garbage value in the LSN of a pg_log page that WAL recovery tries to set commit status in, then bingo: we'll get "request is not satisfied" every time, because the recovery process will always try to flush that page without fixing its LSN. So the failure-to-start-up problem can be blamed entirely on 7.1's failure to do anything with LSN fields in pg_log pages. I was able to get some experimental confirmation of this train of thought when I looked at Jeff Lu's files (he's the most recent reporter of failure to start up with this message). Sure enough, his pg_log contains garbage. If that's true, then the startup problem should be gone in 7.2, since 7.2 doesn't use the buffer manager to access CLOG pages and doesn't expect CLOG pages to have an LSN. However we could still see failures during normal operation due to dropped bits on disk. So I am still dissatisfied with doing elog(STOP) for this condition, as I regard it as an overly strong reaction to corrupted data; moreover, it does nothing to fix the problem and indeed gets in the way of fixing the problem. I propose the attached patch. What do you think? regards, tom lane *** src/backend/access/transam/xlog.c.orig Fri Dec 28 13:16:41 2001 --- src/backend/access/transam/xlog.c Sat Jan 12 13:24:20 2002 *************** *** 1262,1276 **** WriteRqst.Write = WriteRqstPtr; WriteRqst.Flush = record; XLogWrite(WriteRqst); - if (XLByteLT(LogwrtResult.Flush, record)) - elog(STOP, "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X", - record.xlogid, record.xrecoff, - LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); } LWLockRelease(WALWriteLock); } END_CRIT_SECTION(); } /* --- 1262,1301 ---- WriteRqst.Write = WriteRqstPtr; WriteRqst.Flush = record; XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); } END_CRIT_SECTION(); + + /* + * If we still haven't flushed to the request point then we have a + * problem; most likely, the requested flush point is past end of XLOG. + * This has been seen to occur when a disk page has a corrupted LSN. + * + * Formerly we treated this as a STOP condition, but that hurts the + * system's robustness rather than helping it: we do not want to take + * down the whole system due to corruption on one data page. In + * particular, if the bad page is encountered again during recovery then + * we would be unable to restart the database at all! (This scenario + * has actually happened in the field several times with 7.1 releases. + * Note that we cannot get here while InRedo is true, but if the bad + * page is brought in and marked dirty during recovery then + * CreateCheckpoint will try to flush it at the end of recovery.) + * + * The current approach is to ERROR under normal conditions, but only + * NOTICE during recovery, so that the system can be brought up even if + * there's a corrupt LSN. Note that for calls from xact.c, the ERROR + * will be promoted to STOP since xact.c calls this routine inside a + * critical section. However, calls from bufmgr.c are not within + * critical sections and so we will not force a restart for a bad LSN + * on a data page. + */ + if (XLByteLT(LogwrtResult.Flush, record)) + elog(InRecovery ? NOTICE : ERROR, + "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X", + record.xlogid, record.xrecoff, + LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); } /*
Tom Lane wrote: > > I just spent some time trying to understand the mechanism behind the > "XLogFlush: request is not satisfied" startup errors we've seen reported > occasionally with 7.1. The only apparent way for this to happen is for > XLogFlush to be given a garbage WAL record pointer (ie, one pointing > beyond the current end of WAL), which presumably must be coming from > a corrupted LSN field in a data page. Well, that's not too hard to > believe during normal operation: say the disk drive drops some bits in > the LSN field, and we read the page in, and don't have any immediate > need to change it (which would cause the LSN to be overwritten); but we > do find some transaction status hint bits to set, so the page gets > marked dirty. Then when the page is written out, bufmgr will try to > flush xlog using the corrupted LSN pointer. I agree with you at least at the point that we had better continue FlushBufferPool() even though STOP-error occurs. BTW doesn't the LSN corruption imply the possibility of the corruption of other parts (of e.g. pg_log) ? regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > BTW doesn't the LSN corruption imply the possibility > of the corruption of other parts (of e.g. pg_log) ? Indeed. Not sure what we can do about that. In the case I examined (Jeff Lu's recent problem), pg_log appeared perfectly valid up through the end of the page containing the last transaction ID recorded in pg_control. However, this ID was close to the end of the page, and WAL entries contained XIDs reaching into the next page of pg_log. That page contained complete garbage. Even more interesting, there was about 400K of complete garbage beyond that page, in pages that Postgres should never have touched at all. (This seemed like a lot, considering the valid part of pg_log was less than 200K.) My bet is that the garbaged pages were there before Postgres got to them. Both normal operation and WAL recovery would've died at the first attempt to write out the first garbage page, because of its bad LSN. Also, AFAICT 7.1 and before contained no explicit code to zero a newly used pg_log page (it relied on the smgr to fill in zeroes when reading beyond EOF); nor did the pg_log updating code stop to notice whether the transaction status bits it was about to overwrite looked sane. So there would've been no notice before trying to write the garbage page back out. (These last two holes, at least, are plugged in 7.2. But if the OS gives us back a page of garbage instead of the data we wrote out, how well can we be expected to survive that?) Since Jeff was running on a Cygwin/Win2K setup, I'm quite happy to write this off as an OS hiccup, unless someone can think of a mechanism inside Postgres that could have provoked it. regards, tom lane
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> Indeed. Not sure what we can do about that. > One thing I can think of is to prevent a corrupted page > from spoiling other pages by jumping the page boundary > in the buffer pool. We do that already, no? regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > BTW doesn't the LSN corruption imply the possibility > > of the corruption of other parts (of e.g. pg_log) ? > > Indeed. Not sure what we can do about that. One thing I can think of is to prevent a corrupted page from spoiling other pages by jumping the page boundary in the buffer pool. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > One thing I can think of is to prevent a corrupted page > from spoiling other pages by jumping the page boundary > in the buffer pool. >> >> We do that already, no? > Oh I may be missing something. > Where is it checked ? I know PageRepairFragmentation is real paranoid about this, because I made it so recently. I suppose it might be worth adding some more sanity checks to PageAddItem, maybe PageZero (is that ever called on a pre-existing page?), and PageIndexTupleDelete. Seems like that should about cover it --- noplace else inserts items on disk pages or reshuffles disk page contents, AFAIK. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > Tom Lane wrote: > >> Indeed. Not sure what we can do about that. > > > One thing I can think of is to prevent a corrupted page > > from spoiling other pages by jumping the page boundary > > in the buffer pool. > > We do that already, no? Oh I may be missing something. Where is it checked ? regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Tom Lane wrote: >> I know PageRepairFragmentation is real paranoid about this, because I >> made it so recently. I suppose it might be worth adding some more >> sanity checks to PageAddItem, maybe PageZero (is that ever called on a >> pre-existing page?), and PageIndexTupleDelete. Seems like that should >> about cover it --- noplace else inserts items on disk pages or >> reshuffles disk page contents, AFAIK. > What about PageGetItem ? It seems to be able to touch the item > via HeapTupleSatisfies etc. Hmm. Strictly speaking I think you are right, but I'm hesitant to add a bunch of new tests to PageGetItem --- that is much more of a hot spot than PageAddItem, and it'll cost us something in speed I fear. regards, tom lane
Tom Lane wrote: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > > One thing I can think of is to prevent a corrupted page > > from spoiling other pages by jumping the page boundary > > in the buffer pool. > >> > >> We do that already, no? > > > Oh I may be missing something. > > Where is it checked ? > > I know PageRepairFragmentation is real paranoid about this, because I > made it so recently. I suppose it might be worth adding some more > sanity checks to PageAddItem, maybe PageZero (is that ever called on a > pre-existing page?), and PageIndexTupleDelete. Seems like that should > about cover it --- noplace else inserts items on disk pages or > reshuffles disk page contents, AFAIK. What about PageGetItem ? It seems to be able to touch the item via HeapTupleSatisfies etc. regards, Hiroshi Inoue
I said: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: >> What about PageGetItem ? It seems to be able to touch the item >> via HeapTupleSatisfies etc. > Hmm. Strictly speaking I think you are right, but I'm hesitant to add a > bunch of new tests to PageGetItem --- that is much more of a hot spot > than PageAddItem, and it'll cost us something in speed I fear. I wasn't totally comfortable with this (and I'm sure you weren't either), but after more thought I still feel it's the right tradeoff. Here are a couple of heuristic arguments why we don't need more error checking in PageGetItem: 1. tqual.c won't ever try to set tuple status bits until it's checked t_xmin or t_xmax against TransactionIdDidCommit/DidAbort. If messed-up page headers have caused us to compute a bogus item pointer, one would expect that a more-or-less-random transaction ID will be passed to TransactionIdDidCommit/DidAbort. Now in 7.2, it's unlikely that more than about 2 segments (2 million transactions' worth) of CLOG will exist at any instant, so the odds that asking about a random XID will produce an answer and not elog(STOP) are less than 1 in 1000. 2. If this happened in the field, the signature would be one or two bits set apparently at random in an otherwise-okay page. In the data corruption cases I've been able to examine personally, I can't recall ever having seen such a case. The usual form of corruption is dozens of consecutive bytes worth of garbage overlaying part of an otherwise-valid page. While I tend to blame such stuff on hardware glitches (especially when the damage is aligned on power-of-2 byte boundaries), it's certainly possible that it comes from a misdirected memcpy, which is why I think it's a good idea to introduce more bounds checking in PageAddItem and so forth. If we start to see failure reports that look like they might have been caused by tqual.c let loose on the wrong bits, we can certainly revisit this decision. But right now I think that adding more checks in PageGetItem would waste a lot of cycles to little purpose. BTW, to close the loop back to the original topic: I think it's quite likely that some of the elog(STOP)s in clog.c will need to be reduced to lesser error levels once we see what sorts of problems arise in the field, just as we found that this particular elog(STOP) in xlog.c was overkill. But I want to wait and see which ones cause problems before backing off the error severity. I will go and add a few more sanity checks to bufpage.c. regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > I said: > > Hiroshi Inoue <Inoue@tpf.co.jp> writes: > >> What about PageGetItem ? It seems to be able to touch the item > >> via HeapTupleSatisfies etc. > > > Hmm. Strictly speaking I think you are right, but I'm hesitant to add a > > bunch of new tests to PageGetItem --- that is much more of a hot spot > > than PageAddItem, and it'll cost us something in speed I fear. > > I wasn't totally comfortable with this (and I'm sure you weren't > either), but after more thought I still feel it's the right tradeoff. > Here are a couple of heuristic arguments why we don't need more error > checking in PageGetItem: What I have minded is e.g. the following case. Undoubtedly the page is corrupted(too big offset number). I'm suspicious if other pages are safe under such a situation. regards, Hiroshi Inoue a part of the report [GENERAL] Database corruption? by Alvaro Herrera [alvherre@atentus.com] > DEBUG: --Relation delay_171-- > NOTICE: Rel delay_171: TID 15502/4279: OID IS INVALID. TUPGONE 0. > NOTICE: Rel delay_171: TID 15502/4291: OID IS INVALID. TUPGONE 1. > NOTICE: Rel delay_171: TID 15502/4315: OID IS INVALID. TUPGONE 1. > NOTICE: Rel delay_171: TID 15502/4375: OID IS INVALID. TUPGONE 0. > NOTICE: Rel delay_171: TID 15502/4723: OID IS INVALID. TUPGONE 1. > NOTICE: Rel delay_171: TID 15502/4771: OID IS INVALID. TUPGONE 0. > NOTICE: Rel delay_171: TID 15502/4783: OID IS INVALID. TUPGONE 0. > NOTICE: Rel delay_171: TID 15502/4831: OID IS INVALID. TUPGONE 1. > NOTICE: Rel delay_171: TID 15502/4843: OID IS INVALID. TUPGONE 0. > NOTICE: Rel delay_171: TID 15502/4867: InsertTransactionInProgress 0 - can't shrink relation > NOTICE: Rel delay_171: TID 15502/4867: OID IS INVALID. TUPGONE 0. > [a lot similarly looking lines] > NOTICE: Rel delay_171: TID 15502/6067: OID IS INVALID. TUPGONE 0. > Server process (pid 22773) exited with status 139 at Sun Oct 21 02:30:27 2001 > Terminating any active server processes...
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > What I have minded is e.g. the following case. > Undoubtedly the page is corrupted(too big offset number). > I'm suspicious if other pages are safe under such a situation. You have a point, but I still don't like slowing down PageGetItem. How about this instead: whenever we read in a page, check to see if its page header data is sane. We could do this right after the smgrread call in ReadBufferInternal, and follow the "status = SM_FAIL" exit path if we see trouble. regards, tom lane
Tom Lane wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > What I have minded is e.g. the following case. > > Undoubtedly the page is corrupted(too big offset number). > > I'm suspicious if other pages are safe under such a situation. > > You have a point, but I still don't like slowing down PageGetItem. > > How about this instead: whenever we read in a page, check to see > if its page header data is sane. We could do this right after the > smgrread call in ReadBufferInternal, and follow the "status = SM_FAIL" > exit path if we see trouble. Agreed. What we really expect is to not see such troubles from the first. regards, Hiroshi Inoue
> So the failure-to-start-up problem can be blamed entirely on 7.1's > failure to do anything with LSN fields in pg_log pages. I was able to No, first reported problem can be blamed on RAM failures. > So I am still dissatisfied with doing elog(STOP) for this condition, > as I regard it as an overly strong reaction to corrupted data; > moreover, it does nothing to fix the problem and indeed gets in > the way of fixing the problem. Totally agreed but.. > I propose the attached patch. > What do you think? > ... > + if (XLByteLT(LogwrtResult.Flush, record)) > + elog(InRecovery ? NOTICE : ERROR, I suggest also to set some flag here if InRecovery, to elog(STOP DATA FILE(S) CORRUPTED! RESTORE DATA FROM BACKUP OR RESET WAL TO DUMP/MANUALLY FIX ERRORS - or something like that -:) - after all data buffers flushed. What's wrong with this? It's not Ok automatically restart knowing about errors in data. Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> So I am still dissatisfied with doing elog(STOP) for this condition, >> as I regard it as an overly strong reaction to corrupted data; >> moreover, it does nothing to fix the problem and indeed gets in >> the way of fixing the problem. > ... It's not Ok automatically restart > knowing about errors in data. Actually, I disagree. If we come across clearly corrupt data values (eg, bad length word for a varlena item, or even tuple-header errors such as a bad XID), we do not try to force the admin to restore the database from backup, do we? A bogus LSN is bad, certainly, but it is not the end of the world and does not deserve a panic reaction. At worst it tells us that one data page is corrupt. A robust system should report that and keep plugging. What would be actually useful here is to report which page contains the bad LSN, so that the admin could look at it and decide what to do. xlog.c doesn't know that, unfortunately. I'd be more interested in expending work to make that happen than in expending work to make a dbadmin's life more difficult --- and I rank forced stops in the latter category. regards, tom lane
... > > ... It's not Ok automatically restart > > knowing about errors in data. ... > At worst it tells us that one data page is corrupt. A robust system > should report that and keep plugging. Hmm. I'm not sure that this needs an "either-or" resolution on the general topic of error recovery. Back when I used Ingres, it had the feature that corruption would mark the database as "readonly" (a mode I'd like us to have -- even without errors -- to help support upgrades, updates, and error handling). So an administrator could evaluate the damage without having further damage caused, but could allow users to rummage through database at the same time. I have a hard time believing that we should *always* allow the database to keep writing in the face of *any* detected error. I'm sure that is not what Tom is saying, but in this case could further damage be caused by subsequent writing when we *already* know that there is some corruption? If so, we should consider supporting some sort of error state that prevents further damage. Vadim's solution uses the only current mechanism available, which is to force the database to shut down until it can be evaluated. But if we had some stronger mechanisms to support limited operation, that would might help in this case and would certainly help in other situations too. - Thomas
Thomas Lockhart <lockhart@fourpalms.org> writes: > If so, we should consider supporting some sort of error > state that prevents further damage. This seems reasonable (though I'd still question whether a bad LSN is sufficient reason to force the whole database into read-only mode). > Vadim's solution uses the only > current mechanism available, which is to force the database to shut down > until it can be evaluated. But one of the big problems with his solution is that it gets in the way of evaluating the problem. A read-only mode seems like a better way. regards, tom lane