Thread: Theory about XLogFlush startup failures

Theory about XLogFlush startup failures

From

Tom Lane

Date:

12 January 2002, 15:55:46

I just spent some time trying to understand the mechanism behind the
"XLogFlush: request is not satisfied" startup errors we've seen reported
occasionally with 7.1.  The only apparent way for this to happen is for
XLogFlush to be given a garbage WAL record pointer (ie, one pointing
beyond the current end of WAL), which presumably must be coming from
a corrupted LSN field in a data page.  Well, that's not too hard to
believe during normal operation: say the disk drive drops some bits in
the LSN field, and we read the page in, and don't have any immediate
need to change it (which would cause the LSN to be overwritten); but we
do find some transaction status hint bits to set, so the page gets
marked dirty.  Then when the page is written out, bufmgr will try to
flush xlog using the corrupted LSN pointer.

However, what about the cases where this error occurs during WAL
recovery?  As far as I can see, all pages that WAL recovery touches will
be marked with a valid LSN pointing at the WAL record that caused them
to be touched.  So the bufmgr flush operations done during recovery
can't see a bad LSN even if the disk dropped some bits.

With one exception.  In 7.1, pages of pg_log are held by the buffer
manager but we never touch their LSN fields.  Ordinarily the LSNs are
always zero and don't cause any flush problems.  But what if pg_log
becomes corrupt?  If there's a sufficiently large garbage value in the
LSN of a pg_log page that WAL recovery tries to set commit status in,
then bingo: we'll get "request is not satisfied" every time, because
the recovery process will always try to flush that page without fixing
its LSN.

So the failure-to-start-up problem can be blamed entirely on 7.1's
failure to do anything with LSN fields in pg_log pages.  I was able to
get some experimental confirmation of this train of thought when I
looked at Jeff Lu's files (he's the most recent reporter of failure
to start up with this message).  Sure enough, his pg_log contains
garbage.

If that's true, then the startup problem should be gone in 7.2, since
7.2 doesn't use the buffer manager to access CLOG pages and doesn't
expect CLOG pages to have an LSN.

However we could still see failures during normal operation due to
dropped bits on disk.  So I am still dissatisfied with doing elog(STOP)
for this condition, as I regard it as an overly strong reaction to
corrupted data; moreover, it does nothing to fix the problem and indeed
gets in the way of fixing the problem.  I propose the attached patch.
What do you think?
        regards, tom lane


*** src/backend/access/transam/xlog.c.orig    Fri Dec 28 13:16:41 2001
--- src/backend/access/transam/xlog.c    Sat Jan 12 13:24:20 2002
***************
*** 1262,1276 ****             WriteRqst.Write = WriteRqstPtr;             WriteRqst.Flush = record;
XLogWrite(WriteRqst);
-             if (XLByteLT(LogwrtResult.Flush, record))
-                 elog(STOP, "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X",
-                      record.xlogid, record.xrecoff,
-                   LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);         }
LWLockRelease(WALWriteLock);    }      END_CRIT_SECTION(); }  /*
 
--- 1262,1301 ----             WriteRqst.Write = WriteRqstPtr;             WriteRqst.Flush = record;
XLogWrite(WriteRqst);        }         LWLockRelease(WALWriteLock);     }      END_CRIT_SECTION();
 
+ 
+     /*
+      * If we still haven't flushed to the request point then we have a
+      * problem; most likely, the requested flush point is past end of XLOG.
+      * This has been seen to occur when a disk page has a corrupted LSN.
+      *
+      * Formerly we treated this as a STOP condition, but that hurts the
+      * system's robustness rather than helping it: we do not want to take
+      * down the whole system due to corruption on one data page.  In
+      * particular, if the bad page is encountered again during recovery then
+      * we would be unable to restart the database at all!  (This scenario
+      * has actually happened in the field several times with 7.1 releases.
+      * Note that we cannot get here while InRedo is true, but if the bad
+      * page is brought in and marked dirty during recovery then
+      * CreateCheckpoint will try to flush it at the end of recovery.)
+      *
+      * The current approach is to ERROR under normal conditions, but only
+      * NOTICE during recovery, so that the system can be brought up even if
+      * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR
+      * will be promoted to STOP since xact.c calls this routine inside a
+      * critical section.  However, calls from bufmgr.c are not within
+      * critical sections and so we will not force a restart for a bad LSN
+      * on a data page.
+      */
+     if (XLByteLT(LogwrtResult.Flush, record))
+         elog(InRecovery ? NOTICE : ERROR,
+              "XLogFlush: request %X/%X is not satisfied --- flushed only to %X/%X",
+              record.xlogid, record.xrecoff,
+              LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); }  /*

Re: Theory about XLogFlush startup failures

From

Hiroshi Inoue

Date:

14 January 2002, 21:29:31

Tom Lane wrote:
> 
> I just spent some time trying to understand the mechanism behind the
> "XLogFlush: request is not satisfied" startup errors we've seen reported
> occasionally with 7.1.  The only apparent way for this to happen is for
> XLogFlush to be given a garbage WAL record pointer (ie, one pointing
> beyond the current end of WAL), which presumably must be coming from
> a corrupted LSN field in a data page.  Well, that's not too hard to
> believe during normal operation: say the disk drive drops some bits in
> the LSN field, and we read the page in, and don't have any immediate
> need to change it (which would cause the LSN to be overwritten); but we
> do find some transaction status hint bits to set, so the page gets
> marked dirty.  Then when the page is written out, bufmgr will try to
> flush xlog using the corrupted LSN pointer.

I agree with you at least at the point that we had better
continue FlushBufferPool() even though STOP-error occurs.

BTW doesn't the LSN corruption imply the possibility
of the corruption of other parts (of e.g. pg_log) ?

regards,
Hiroshi Inoue

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

14 January 2002, 22:09:21

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> BTW doesn't the LSN corruption imply the possibility
> of the corruption of other parts (of e.g. pg_log) ?

Indeed.  Not sure what we can do about that.

In the case I examined (Jeff Lu's recent problem), pg_log appeared
perfectly valid up through the end of the page containing the last
transaction ID recorded in pg_control.  However, this ID was close to
the end of the page, and WAL entries contained XIDs reaching into the
next page of pg_log.  That page contained complete garbage.  Even
more interesting, there was about 400K of complete garbage beyond that
page, in pages that Postgres should never have touched at all.  (This
seemed like a lot, considering the valid part of pg_log was less than
200K.)

My bet is that the garbaged pages were there before Postgres got to
them.  Both normal operation and WAL recovery would've died at the first
attempt to write out the first garbage page, because of its bad LSN.
Also, AFAICT 7.1 and before contained no explicit code to zero a newly
used pg_log page (it relied on the smgr to fill in zeroes when reading
beyond EOF); nor did the pg_log updating code stop to notice whether the
transaction status bits it was about to overwrite looked sane.  So there
would've been no notice before trying to write the garbage page back out.
(These last two holes, at least, are plugged in 7.2.  But if the OS
gives us back a page of garbage instead of the data we wrote out, how
well can we be expected to survive that?)

Since Jeff was running on a Cygwin/Win2K setup, I'm quite happy to write
this off as an OS hiccup, unless someone can think of a mechanism inside
Postgres that could have provoked it.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

14 January 2002, 22:59:10

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Tom Lane wrote:
>> Indeed.  Not sure what we can do about that.

> One thing I can think of is to prevent a corrupted page
> from spoiling other pages by jumping the page boundary
> in the buffer pool.

We do that already, no?
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Hiroshi Inoue

Date:

14 January 2002, 23:19:12

Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > BTW doesn't the LSN corruption imply the possibility
> > of the corruption of other parts (of e.g. pg_log) ?
> 
> Indeed.  Not sure what we can do about that.

One thing I can think of is to prevent a corrupted page
from spoiling other pages by jumping the page boundary
in the buffer pool.

regards,
Hiroshi Inoue

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

14 January 2002, 23:39:09

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> One thing I can think of is to prevent a corrupted page
> from spoiling other pages by jumping the page boundary
> in the buffer pool.
>> 
>> We do that already, no?

> Oh I may be missing something.
> Where is it checked ?

I know PageRepairFragmentation is real paranoid about this, because I
made it so recently.  I suppose it might be worth adding some more
sanity checks to PageAddItem, maybe PageZero (is that ever called on a
pre-existing page?), and PageIndexTupleDelete.  Seems like that should
about cover it --- noplace else inserts items on disk pages or
reshuffles disk page contents, AFAIK.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Hiroshi Inoue

Date:

14 January 2002, 23:49:10

Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > Tom Lane wrote:
> >> Indeed.  Not sure what we can do about that.
> 
> > One thing I can think of is to prevent a corrupted page
> > from spoiling other pages by jumping the page boundary
> > in the buffer pool.
> 
> We do that already, no?

Oh I may be missing something.
Where is it checked ?

regards,
Hiroshi Inoue

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

14 January 2002, 23:59:16

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Tom Lane wrote:
>> I know PageRepairFragmentation is real paranoid about this, because I
>> made it so recently.  I suppose it might be worth adding some more
>> sanity checks to PageAddItem, maybe PageZero (is that ever called on a
>> pre-existing page?), and PageIndexTupleDelete.  Seems like that should
>> about cover it --- noplace else inserts items on disk pages or
>> reshuffles disk page contents, AFAIK.

> What about PageGetItem ? It seems to be able to touch the item
> via HeapTupleSatisfies etc. 

Hmm.  Strictly speaking I think you are right, but I'm hesitant to add a
bunch of new tests to PageGetItem --- that is much more of a hot spot
than PageAddItem, and it'll cost us something in speed I fear.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Hiroshi Inoue

Date:

15 January 2002, 00:09:15

Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > One thing I can think of is to prevent a corrupted page
> > from spoiling other pages by jumping the page boundary
> > in the buffer pool.
> >>
> >> We do that already, no?
> 
> > Oh I may be missing something.
> > Where is it checked ?
> 
> I know PageRepairFragmentation is real paranoid about this, because I
> made it so recently.  I suppose it might be worth adding some more
> sanity checks to PageAddItem, maybe PageZero (is that ever called on a
> pre-existing page?), and PageIndexTupleDelete.  Seems like that should
> about cover it --- noplace else inserts items on disk pages or
> reshuffles disk page contents, AFAIK.

What about PageGetItem ? It seems to be able to touch the item
via HeapTupleSatisfies etc. 

regards,
Hiroshi Inoue

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

15 January 2002, 11:59:39

I said:
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> What about PageGetItem ? It seems to be able to touch the item
>> via HeapTupleSatisfies etc. 

> Hmm.  Strictly speaking I think you are right, but I'm hesitant to add a
> bunch of new tests to PageGetItem --- that is much more of a hot spot
> than PageAddItem, and it'll cost us something in speed I fear.

I wasn't totally comfortable with this (and I'm sure you weren't
either), but after more thought I still feel it's the right tradeoff.
Here are a couple of heuristic arguments why we don't need more error
checking in PageGetItem:

1. tqual.c won't ever try to set tuple status bits until it's checked
t_xmin or t_xmax against TransactionIdDidCommit/DidAbort.  If messed-up
page headers have caused us to compute a bogus item pointer, one would
expect that a more-or-less-random transaction ID will be passed to 
TransactionIdDidCommit/DidAbort.  Now in 7.2, it's unlikely that more
than about 2 segments (2 million transactions' worth) of CLOG will exist
at any instant, so the odds that asking about a random XID will produce
an answer and not elog(STOP) are less than 1 in 1000.

2. If this happened in the field, the signature would be one or two bits
set apparently at random in an otherwise-okay page.  In the data
corruption cases I've been able to examine personally, I can't recall
ever having seen such a case.  The usual form of corruption is dozens of
consecutive bytes worth of garbage overlaying part of an otherwise-valid
page.  While I tend to blame such stuff on hardware glitches (especially
when the damage is aligned on power-of-2 byte boundaries), it's
certainly possible that it comes from a misdirected memcpy, which is why
I think it's a good idea to introduce more bounds checking in
PageAddItem and so forth.

If we start to see failure reports that look like they might have been
caused by tqual.c let loose on the wrong bits, we can certainly revisit
this decision.  But right now I think that adding more checks in 
PageGetItem would waste a lot of cycles to little purpose.

BTW, to close the loop back to the original topic: I think it's quite
likely that some of the elog(STOP)s in clog.c will need to be reduced to
lesser error levels once we see what sorts of problems arise in the
field, just as we found that this particular elog(STOP) in xlog.c was
overkill.  But I want to wait and see which ones cause problems before
backing off the error severity.

I will go and add a few more sanity checks to bufpage.c.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

"Hiroshi Inoue"

Date:

15 January 2002, 15:59:42

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> I said:
> > Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> >> What about PageGetItem ? It seems to be able to touch the item
> >> via HeapTupleSatisfies etc.
>
> > Hmm.  Strictly speaking I think you are right, but I'm hesitant to add a
> > bunch of new tests to PageGetItem --- that is much more of a hot spot
> > than PageAddItem, and it'll cost us something in speed I fear.
>
> I wasn't totally comfortable with this (and I'm sure you weren't
> either), but after more thought I still feel it's the right tradeoff.
> Here are a couple of heuristic arguments why we don't need more error
> checking in PageGetItem:

What I have minded is e.g. the following case.
Undoubtedly the page is corrupted(too big offset number).
I'm suspicious if other pages are safe under such a situation.

regards,
Hiroshi Inoue
   a part of the report [GENERAL] Database corruption?   by Alvaro Herrera [alvherre@atentus.com]

> DEBUG:  --Relation delay_171--
> NOTICE:  Rel delay_171: TID 15502/4279: OID IS INVALID. TUPGONE 0.
> NOTICE:  Rel delay_171: TID 15502/4291: OID IS INVALID. TUPGONE 1.
> NOTICE:  Rel delay_171: TID 15502/4315: OID IS INVALID. TUPGONE 1.
> NOTICE:  Rel delay_171: TID 15502/4375: OID IS INVALID. TUPGONE 0.
> NOTICE:  Rel delay_171: TID 15502/4723: OID IS INVALID. TUPGONE 1.
> NOTICE:  Rel delay_171: TID 15502/4771: OID IS INVALID. TUPGONE 0.
> NOTICE:  Rel delay_171: TID 15502/4783: OID IS INVALID. TUPGONE 0.
> NOTICE:  Rel delay_171: TID 15502/4831: OID IS INVALID. TUPGONE 1.
> NOTICE:  Rel delay_171: TID 15502/4843: OID IS INVALID. TUPGONE 0.
> NOTICE:  Rel delay_171: TID 15502/4867: InsertTransactionInProgress 0 -
can't shrink relation
> NOTICE:  Rel delay_171: TID 15502/4867: OID IS INVALID. TUPGONE 0.
> [a lot similarly looking lines]
> NOTICE:  Rel delay_171: TID 15502/6067: OID IS INVALID. TUPGONE 0.
> Server process (pid 22773) exited with status 139 at Sun Oct 21 02:30:27
2001
> Terminating any active server processes...

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

15 January 2002, 17:39:40

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> What I have minded is e.g. the following case.
> Undoubtedly the page is corrupted(too big offset number).
> I'm suspicious if other pages are safe under such a situation.

You have a point, but I still don't like slowing down PageGetItem.

How about this instead: whenever we read in a page, check to see
if its page header data is sane.  We could do this right after the
smgrread call in ReadBufferInternal, and follow the "status = SM_FAIL"
exit path if we see trouble.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Hiroshi Inoue

Date:

15 January 2002, 19:19:34

Tom Lane wrote:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > What I have minded is e.g. the following case.
> > Undoubtedly the page is corrupted(too big offset number).
> > I'm suspicious if other pages are safe under such a situation.
> 
> You have a point, but I still don't like slowing down PageGetItem.
> 
> How about this instead: whenever we read in a page, check to see
> if its page header data is sane.  We could do this right after the
> smgrread call in ReadBufferInternal, and follow the "status = SM_FAIL"
> exit path if we see trouble.

Agreed. What we really expect is to not see such troubles
from the first.

regards,
Hiroshi Inoue

Re: Theory about XLogFlush startup failures

From

"Mikheev, Vadim"

Date:

23 January 2002, 18:42:42

> So the failure-to-start-up problem can be blamed entirely on 7.1's
> failure to do anything with LSN fields in pg_log pages.  I was able to

No, first reported problem can be blamed on RAM failures.

> So I am still dissatisfied with doing elog(STOP) for this condition,
> as I regard it as an overly strong reaction to corrupted data;
> moreover, it does nothing to fix the problem and indeed gets in
> the way of fixing the problem.

Totally agreed but..

> I propose the attached patch.
> What do you think?
> 
...
> +     if (XLByteLT(LogwrtResult.Flush, record))
> +         elog(InRecovery ? NOTICE : ERROR,

I suggest also to set some flag here if InRecovery,
to elog(STOP

DATA FILE(S) CORRUPTED!
RESTORE DATA FROM BACKUP OR
RESET WAL TO DUMP/MANUALLY FIX ERRORS

- or something like that -:) - after all data buffers
flushed.

What's wrong with this? It's not Ok automatically restart
knowing about errors in data.

Vadim

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

26 January 2002, 13:03:47

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> So I am still dissatisfied with doing elog(STOP) for this condition,
>> as I regard it as an overly strong reaction to corrupted data;
>> moreover, it does nothing to fix the problem and indeed gets in
>> the way of fixing the problem.

> ... It's not Ok automatically restart
> knowing about errors in data.

Actually, I disagree.  If we come across clearly corrupt data values
(eg, bad length word for a varlena item, or even tuple-header errors 
such as a bad XID), we do not try to force the admin to restore the
database from backup, do we?  A bogus LSN is bad, certainly, but it
is not the end of the world and does not deserve a panic reaction.
At worst it tells us that one data page is corrupt.  A robust system
should report that and keep plugging.

What would be actually useful here is to report which page contains
the bad LSN, so that the admin could look at it and decide what to do.
xlog.c doesn't know that, unfortunately.  I'd be more interested in
expending work to make that happen than in expending work to make
a dbadmin's life more difficult --- and I rank forced stops in the
latter category.
        regards, tom lane

Re: Theory about XLogFlush startup failures

From

Thomas Lockhart

Date:

26 January 2002, 20:33:48

...
> > ... It's not Ok automatically restart
> > knowing about errors in data.
...
> At worst it tells us that one data page is corrupt.  A robust system
> should report that and keep plugging.

Hmm. I'm not sure that this needs an "either-or" resolution on the
general topic of error recovery. Back when I used Ingres, it had the
feature that corruption would mark the database as "readonly" (a mode
I'd like us to have -- even without errors -- to help support upgrades,
updates, and error handling). So an administrator could evaluate the
damage without having further damage caused, but could allow users to
rummage through database at the same time.

I have a hard time believing that we should *always* allow the database
to keep writing in the face of *any* detected error. I'm sure that is
not what Tom is saying, but in this case could further damage be caused
by subsequent writing when we *already* know that there is some
corruption? If so, we should consider supporting some sort of error
state that prevents further damage. Vadim's solution uses the only
current mechanism available, which is to force the database to shut down
until it can be evaluated. But if we had some stronger mechanisms to
support limited operation, that would might help in this case and would
certainly help in other situations too.
                   - Thomas

Re: Theory about XLogFlush startup failures

From

Tom Lane

Date:

26 January 2002, 23:04:36

Thomas Lockhart <lockhart@fourpalms.org> writes:
> If so, we should consider supporting some sort of error
> state that prevents further damage.

This seems reasonable (though I'd still question whether a bad LSN is
sufficient reason to force the whole database into read-only mode).

> Vadim's solution uses the only
> current mechanism available, which is to force the database to shut down
> until it can be evaluated.

But one of the big problems with his solution is that it gets in the way
of evaluating the problem.  A read-only mode seems like a better way.
        regards, tom lane