Re: Corrupt Table - Mailing list pgsql-general

From Tom Lane
Subject Re: Corrupt Table
Date
Msg-id 14386.968949390@sss.pgh.pa.us
Whole thread Raw
In response to Corrupt Table  ("Bryan White" <bryan@arcamax.com>)
List pgsql-general
"Bryan White" <bryan@arcamax.com> writes:
> Here is what a failure looks like in the log file:
> --------------------
> Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11 2000

That should produce a coredump --- can you get a backtrace?

> I have written a small utility program (pganal).  It looks for
> inconsistancies in page layout and tuple layout.  My original intent was to
> parse the tuple internal structure as well but that proved to be more
> complex that I was ready to handle at the time.

Cool; want to submit it as a contrib item?  This sounds like something
that could be gradually improved into a "fsck" kind of thing...

> Anyway I stopped the database, copied the customer file to another directory
> and restarted the database.  Here is the pganal output from this copy:
> --------------------------
> Analyzing customer
> Page 25878 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Page 31395 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Page 32950 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Tuple 71453.0 Ofs=8029 Len=164 Flags=1 Error: tuple overwrites pd_special
> Tuple 71453.4 Ofs=7346 Len=208 Flags=1 Error: tuple overlaps another
> Tuple 71453.40 Ofs=1365 Len=160 Flags=1 Error: tuple overlaps another
> Page 71958 ERROR: pd_lower has odd value
>     pd_lower=11886 pd_upper=24239 pd_special=109 pd_opaque.od_pagesize=0
> Page 73622 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0

Hmm.  The all-zero pages (I expect you'll find that 25878 etc are *all*
zeroes, not just their header fields) look like an old bug wherein
a newly-added page might not get initialized if the transaction that's
adding the page aborts just after allocating the page.  I thought I'd
fixed that in 7.0 though.  You are running 7.0.2 I hope?

A VACUUM should patch up zero pages.  I'm guessing that you haven't
vacuumed this table in a long time...

Page 71958 looks pretty badly corrupted --- you'll have to look at that
and see if you can clean it up by hand.  Something fishy about 71453
as well.  Worst case, you could set these pages to all-zero by hand,
and just lose the tuples thereon rather than the whole table.

> My question is how do I proceed from here.  Going back to the previous day's
> backup would be very painful in terms of lost data.  I suspect the answer is
> to perform surgery on the bad pages and then rebuild indexes but this is a
> scary idea.

How fast does your app add/update tuples in this table?  If you are
lucky, the tuples in pages 71453 and 71958 might be available from your
last successful backup, in which case trying to patch up the page
contents by hand is probably a waste of effort.  Zero those pages,
dump out the current contents of the file with COPY, and start comparing
that to your last backup.  The fact that you haven't vacuumed will make
this pretty easy, because the tuple ordering should be the same.

If you do choose to recover by zeroing pages, it'd be a good idea to
drop and recreate the indexes on the table.  Sooner or later you should
do a vacuum to fix the zero pages, but not just yet --- you want to
leave the tuples in their current ordering for comparison to your
backup ...

            regards, tom lane

pgsql-general by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: recovery of db after system crash- part II
Next
From: James Hall
Date:
Subject: Unable to install DBD module