Thread: Crash while recovering database index relation

Crash while recovering database index relation

From
Guy Thornley
Date:
Hi,

On one of our test boxen here, weve experienced a corrupted file during
database recovery after box power outage. The specific error message is

    PANIC: invalid page header in block 6 of relation "17792"

At this point I fired up a hex dumper to inspect the file, and the last
block in the file (which that error refers to) was clearly garbage

This was on postgres 7.4. The system in question is using ReiserFS, and some
journal transactions were replayed on the same boot as the failed postgres
recovery. I beleive this is significant (see below)

By using postgres single-user database server and zero_damaged_pages option
I manged to get the database up again. There were a LOT of relations with
this problem !

It may be significant that this is an index (primary key) for a relation.
ALL of the files with problems were either indexes or primary keys!

I do NOT believe this was a hardware error. What I think happened is:
- postgres extended some indexes
- reiserfs journalled the metadata
- new file contents got buffered by the kernel in memory
- XLog stuff gets fsync()'d
- Power cycle
- reiserfs replayed metadata journal, extended the files
  Probably makes the last blocks in each file invalid!
- postgres attempts to recover from its log, and bumps into the (now
  garbage) blocks

I'll see if I can get some time to reproduce this reliably

Guy Thornley

Re: Crash while recovering database index relation

From
Tom Lane
Date:
Guy Thornley <guy@esphion.com> writes:
> On one of our test boxen here, weve experienced a corrupted file during
> database recovery after box power outage. The specific error message is
>     PANIC: invalid page header in block 6 of relation "17792"
> This was on postgres 7.4.

I believe this is fixed in 7.4.1:

2003-12-01 11:53  tgl

    * src/backend/storage/buffer/: bufmgr.c (REL7_3_STABLE), bufmgr.c
    (REL7_4_STABLE), bufmgr.c: Force zero_damaged_pages to be
    effectively ON during recovery from WAL, since there is no need to
    worry about damaged pages when we are going to overwrite them
    anyway from the WAL.  Per recent discussion.

> By using postgres single-user database server and zero_damaged_pages option
> I manged to get the database up again. There were a LOT of relations with
> this problem !

And no sign of corruption after you'd run through the recovery with
zero_damaged_pages?  That's what I'd expect if this scenario applies:
the pages will be fixed by WAL recovery, it's just that the recently
added check for broken page headers was interfering :-(

            regards, tom lane

Re: Crash while recovering database index relation

From
Guy Thornley
Date:
> >     PANIC: invalid page header in block 6 of relation "17792"
> > This was on postgres 7.4.
>
> I believe this is fixed in 7.4.1:
 ...

> And no sign of corruption after you'd run through the recovery with
> zero_damaged_pages?
I checked them this morning; there isnt.
Sorry for bugging you about something already fixed

> That's what I'd expect if this scenario applies:
> the pages will be fixed by WAL recovery, it's just that the recently
> added check for broken page headers was interfering :-(

What I don't grok is why all the affected files were indexes, and none
of the heap files appeared to have junk pages

Guy Thornley

Re: Crash while recovering database index relation

From
Tom Lane
Date:
Guy Thornley <guy@esphion.com> writes:
>> That's what I'd expect if this scenario applies:
>> the pages will be fixed by WAL recovery, it's just that the recently
>> added check for broken page headers was interfering :-(

> What I don't grok is why all the affected files were indexes, and none
> of the heap files appeared to have junk pages

Hmmm ... that is mildly interesting, but it doesn't rise to the level of
warning bells in my head.  At least not yet.  Were the indexes involved
all on the same table, or different tables?  If the former, it could
just be that that was the last set of changes to be flushed out after an
update of that table.  If they were on different tables then it's a more
surprising coincidence.  Could happen anyway I suppose --- index pages
are likely to be more heavily accessed than heap pages, and thus less
likely to get flushed out of the buffer cache.

            regards, tom lane

Re: Crash while recovering database index relation

From
Guy Thornley
Date:
> > What I don't grok is why all the affected files were indexes, and none
> > of the heap files appeared to have junk pages
>
> Hmmm ... that is mildly interesting, but it doesn't rise to the level of
> warning bells in my head.

I played around a bit yesterday with an INSERT'ing shell script and a reset
button... I can now, with reasonable confidence, say was pure coincidence
they were all index files. I had the junk pages in normal heap files as well
as index files on several occasions while testing

> Were the indexes involved all on the same table, or different tables?
Different tables, which is what aroused my own curosity :)

Guy