Thread: Crash while recovering database index relation
Hi, On one of our test boxen here, weve experienced a corrupted file during database recovery after box power outage. The specific error message is PANIC: invalid page header in block 6 of relation "17792" At this point I fired up a hex dumper to inspect the file, and the last block in the file (which that error refers to) was clearly garbage This was on postgres 7.4. The system in question is using ReiserFS, and some journal transactions were replayed on the same boot as the failed postgres recovery. I beleive this is significant (see below) By using postgres single-user database server and zero_damaged_pages option I manged to get the database up again. There were a LOT of relations with this problem ! It may be significant that this is an index (primary key) for a relation. ALL of the files with problems were either indexes or primary keys! I do NOT believe this was a hardware error. What I think happened is: - postgres extended some indexes - reiserfs journalled the metadata - new file contents got buffered by the kernel in memory - XLog stuff gets fsync()'d - Power cycle - reiserfs replayed metadata journal, extended the files Probably makes the last blocks in each file invalid! - postgres attempts to recover from its log, and bumps into the (now garbage) blocks I'll see if I can get some time to reproduce this reliably Guy Thornley
Guy Thornley <guy@esphion.com> writes: > On one of our test boxen here, weve experienced a corrupted file during > database recovery after box power outage. The specific error message is > PANIC: invalid page header in block 6 of relation "17792" > This was on postgres 7.4. I believe this is fixed in 7.4.1: 2003-12-01 11:53 tgl * src/backend/storage/buffer/: bufmgr.c (REL7_3_STABLE), bufmgr.c (REL7_4_STABLE), bufmgr.c: Force zero_damaged_pages to be effectively ON during recovery from WAL, since there is no need to worry about damaged pages when we are going to overwrite them anyway from the WAL. Per recent discussion. > By using postgres single-user database server and zero_damaged_pages option > I manged to get the database up again. There were a LOT of relations with > this problem ! And no sign of corruption after you'd run through the recovery with zero_damaged_pages? That's what I'd expect if this scenario applies: the pages will be fixed by WAL recovery, it's just that the recently added check for broken page headers was interfering :-( regards, tom lane
> > PANIC: invalid page header in block 6 of relation "17792" > > This was on postgres 7.4. > > I believe this is fixed in 7.4.1: ... > And no sign of corruption after you'd run through the recovery with > zero_damaged_pages? I checked them this morning; there isnt. Sorry for bugging you about something already fixed > That's what I'd expect if this scenario applies: > the pages will be fixed by WAL recovery, it's just that the recently > added check for broken page headers was interfering :-( What I don't grok is why all the affected files were indexes, and none of the heap files appeared to have junk pages Guy Thornley
Guy Thornley <guy@esphion.com> writes: >> That's what I'd expect if this scenario applies: >> the pages will be fixed by WAL recovery, it's just that the recently >> added check for broken page headers was interfering :-( > What I don't grok is why all the affected files were indexes, and none > of the heap files appeared to have junk pages Hmmm ... that is mildly interesting, but it doesn't rise to the level of warning bells in my head. At least not yet. Were the indexes involved all on the same table, or different tables? If the former, it could just be that that was the last set of changes to be flushed out after an update of that table. If they were on different tables then it's a more surprising coincidence. Could happen anyway I suppose --- index pages are likely to be more heavily accessed than heap pages, and thus less likely to get flushed out of the buffer cache. regards, tom lane
> > What I don't grok is why all the affected files were indexes, and none > > of the heap files appeared to have junk pages > > Hmmm ... that is mildly interesting, but it doesn't rise to the level of > warning bells in my head. I played around a bit yesterday with an INSERT'ing shell script and a reset button... I can now, with reasonable confidence, say was pure coincidence they were all index files. I had the junk pages in normal heap files as well as index files on several occasions while testing > Were the indexes involved all on the same table, or different tables? Different tables, which is what aroused my own curosity :) Guy