Thread: Corrupt Table

Corrupt Table

From
"Bryan White"
Date:
I have apparently picked up a corrupt record in a table.

What happend:
Yesterday at one point the database seems to hang.  There were three backend
processes consuming large amounts of CPU time.  I stopped the server and
rebooted (3 months since last reboot).  The database restarted and seemed to
be fine.

Then last night the nightly backups failed apparently when reading the
'customer' table.  The database restarted itself.  There have been a couple
of database restarts since then.  As far as I can tell it is the customer
table that is the problem.

Here is what a failure looks like in the log file:
--------------------
Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11 2000
Terminating any active server processes...
000914.10:13:11.425  [5879] NOTICE:  Message from PostgreSQL backend:
    The Postmaster has informed me that some other backend died
abnormally....
-------------------
The last entry is repeated multiple times.

I have written a small utility program (pganal).  It looks for
inconsistancies in page layout and tuple layout.  My original intent was to
parse the tuple internal structure as well but that proved to be more
complex that I was ready to handle at the time.

Anyway I stopped the database, copied the customer file to another directory
and restarted the database.  Here is the pganal output from this copy:
--------------------------
Analyzing customer
Page 25878 ERROR: pd_lower is too small
    pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Page 31395 ERROR: pd_lower is too small
    pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Page 32950 ERROR: pd_lower is too small
    pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
Tuple 71453.0 Ofs=8029 Len=164 Flags=1 Error: tuple overwrites pd_special
Tuple 71453.4 Ofs=7346 Len=208 Flags=1 Error: tuple overlaps another
Tuple 71453.40 Ofs=1365 Len=160 Flags=1 Error: tuple overlaps another
Page 71958 ERROR: pd_lower has odd value
    pd_lower=11886 pd_upper=24239 pd_special=109 pd_opaque.od_pagesize=0
Page 73622 ERROR: pd_lower is too small
    pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0

Page Summary
Data Pages   = 76555
Unused Pages = 0
New Pages    = 0
Empty Pages  = 0
Bad Pages    = 5
Total Pages  = 76560

Tuple Summary
O/L Error Tuples = 1
Overlaped Tuples = 2
Unused Tuples    = 47994
Used Tuples      = 3698495
Total Tuples     = 3746492
--------------------------

I suspect the 'pd_lower is too small' may be just my misunderstanding of the
page layout.
The three tuple errors (all on the same page) and the 'pd_lower has odd
value' error seem to be real.
'pd_lower has odd value' comes from:
        int nitems = (phd->pd_lower - sizeof(*phd)) / sizeof(ItemIdData);
        if(nitems * sizeof(ItemIdData) != phd->pd_lower - sizeof(*phd))
            pderr = "pd_lower has odd value";
Basically it means the pd_lower did not leave room for an integral number of
ItemIDData structures.

I seem to have two separate corrupt pages.  I can post the full source to
pganal if anyone is interested.  Its about 300 lines.

My question is how do I proceed from here.  Going back to the previous day's
backup would be very painful in terms of lost data.  I suspect the answer is
to perform surgery on the bad pages and then rebuild indexes but this is a
scary idea.  Has anyone else created tools to deal with this kind of
problem?

Bryan White, ArcaMax.com, VP of Technology
You can't deny that it is not impossible, can you.


Re: Corrupt Table

From
"Bryan White"
Date:
Here is a follow up.  I did a hex/ascii dump of the 3 bad tuples.  In the
dump I could pick out an email address.  This is an indexed field.  I did a
select on each of them in the live database.  The 1st and 3rd were not
found.  The second worked ok if I only selected the customer id (an int4 and
the first field in the record).  The custid reported seems to be nonsense.
The backend crashed if I selected the whole record.


Re: Corrupt Table

From
Tom Lane
Date:
"Bryan White" <bryan@arcamax.com> writes:
> Here is what a failure looks like in the log file:
> --------------------
> Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11 2000

That should produce a coredump --- can you get a backtrace?

> I have written a small utility program (pganal).  It looks for
> inconsistancies in page layout and tuple layout.  My original intent was to
> parse the tuple internal structure as well but that proved to be more
> complex that I was ready to handle at the time.

Cool; want to submit it as a contrib item?  This sounds like something
that could be gradually improved into a "fsck" kind of thing...

> Anyway I stopped the database, copied the customer file to another directory
> and restarted the database.  Here is the pganal output from this copy:
> --------------------------
> Analyzing customer
> Page 25878 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Page 31395 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Page 32950 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0
> Tuple 71453.0 Ofs=8029 Len=164 Flags=1 Error: tuple overwrites pd_special
> Tuple 71453.4 Ofs=7346 Len=208 Flags=1 Error: tuple overlaps another
> Tuple 71453.40 Ofs=1365 Len=160 Flags=1 Error: tuple overlaps another
> Page 71958 ERROR: pd_lower has odd value
>     pd_lower=11886 pd_upper=24239 pd_special=109 pd_opaque.od_pagesize=0
> Page 73622 ERROR: pd_lower is too small
>     pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0

Hmm.  The all-zero pages (I expect you'll find that 25878 etc are *all*
zeroes, not just their header fields) look like an old bug wherein
a newly-added page might not get initialized if the transaction that's
adding the page aborts just after allocating the page.  I thought I'd
fixed that in 7.0 though.  You are running 7.0.2 I hope?

A VACUUM should patch up zero pages.  I'm guessing that you haven't
vacuumed this table in a long time...

Page 71958 looks pretty badly corrupted --- you'll have to look at that
and see if you can clean it up by hand.  Something fishy about 71453
as well.  Worst case, you could set these pages to all-zero by hand,
and just lose the tuples thereon rather than the whole table.

> My question is how do I proceed from here.  Going back to the previous day's
> backup would be very painful in terms of lost data.  I suspect the answer is
> to perform surgery on the bad pages and then rebuild indexes but this is a
> scary idea.

How fast does your app add/update tuples in this table?  If you are
lucky, the tuples in pages 71453 and 71958 might be available from your
last successful backup, in which case trying to patch up the page
contents by hand is probably a waste of effort.  Zero those pages,
dump out the current contents of the file with COPY, and start comparing
that to your last backup.  The fact that you haven't vacuumed will make
this pretty easy, because the tuple ordering should be the same.

If you do choose to recover by zeroing pages, it'd be a good idea to
drop and recreate the indexes on the table.  Sooner or later you should
do a vacuum to fix the zero pages, but not just yet --- you want to
leave the tuples in their current ordering for comparison to your
backup ...

            regards, tom lane

Re: Corrupt Table

From
"Bryan White"
Date:
> > Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11
2000
>
> That should produce a coredump --- can you get a backtrace?

I found a core file.  I am not all that familiar with gdb but the backtrace
looks useless:
#0  0x8064fb4 in ?? ()
#1  0x809da10 in ?? ()
#2  0x809e538 in ?? ()
#3  0x809e0c6 in ?? ()
#4  0x809e176 in ?? ()
#5  0x809e2dc in ?? ()
#6  0x809e5ab in ?? ()
#7  0x809e666 in ?? ()
#8  0x809eb00 in ?? ()
#9  0x80a1f90 in ?? ()
#10 0x809d3ae in ?? ()
#11 0x80a31a4 in ?? ()
#12 0x809d3b7 in ?? ()
#13 0x809c6b9 in ?? ()
#14 0x809bd0e in ?? ()
#15 0x80ec132 in ?? ()
#16 0x80ec19c in ?? ()
#17 0x80eadc7 in ?? ()
#18 0x80eaca7 in ?? ()
#19 0x80ebba2 in ?? ()
#20 0x80d61f2 in ?? ()
#21 0x80d5dd1 in ?? ()
#22 0x80d518a in ?? ()
#23 0x80d4c14 in ?? ()
#24 0x80ab736 in ?? ()
#25 0x401029cb in ?? ()

> Cool; want to submit it as a contrib item?  This sounds like something
> that could be gradually improved into a "fsck" kind of thing...

Right now it is sort of 'hack it up as needed'.  I will try and polish it up
and add command line options to control it.

>Hmm.  The all-zero pages (I expect you'll find that 25878 etc are *all*
>zeroes, not just their header fields) look like an old bug wherein
>a newly-added page might not get initialized if the transaction that's
>adding the page aborts just after allocating the page.  I thought I'd
>fixed that in 7.0 though.  You are running 7.0.2 I hope?

Yes I am running 7.0.2.  The 4 pages in question have 0's in the first 16
bytes but other data after that.  I see some text that look like real data.

> A VACUUM should patch up zero pages.  I'm guessing that you haven't
> vacuumed this table in a long time...

Vacuum occurs nightly just after the backup.  I checked and it ran fine the
night before.  Last nights vacuum reported: psql: The Data Base System is in
recovery mode

>
> Page 71958 looks pretty badly corrupted --- you'll have to look at that
> and see if you can clean it up by hand.  Something fishy about 71453
> as well.  Worst case, you could set these pages to all-zero by hand,
> and just lose the tuples thereon rather than the whole table.
>
> How fast does your app add/update tuples in this table?  If you are
> lucky, the tuples in pages 71453 and 71958 might be available from your
> last successful backup, in which case trying to patch up the page
> contents by hand is probably a waste of effort.  Zero those pages,
> dump out the current contents of the file with COPY, and start comparing
> that to your last backup.  The fact that you haven't vacuumed will make
> this pretty easy, because the tuple ordering should be the same.
>
> If you do choose to recover by zeroing pages, it'd be a good idea to
> drop and recreate the indexes on the table.  Sooner or later you should
> do a vacuum to fix the zero pages, but not just yet --- you want to
> leave the tuples in their current ordering for comparison to your
> backup ...

I suspect diff will produce more output that I want to deal with.  Customer
records are never deleted from this table so I think the thing to do is copy
all customers from the previous good backup that are not in a cleaned up
customer table.  I will lose some edits but it should not be too bad.

Ok here is my plan:
1) Stop the server
2) Backup the physical customer file
3) Zero out all the corrupt pages.
4) Restart the database for localhost access only.
5) Dump the customer table
6) Reload the customer table from the dump (I know it is now clean)
7) Recreate the customer indexes.
8) Vacuum the customer table
9) Restart the database for normal access
10) Load the last good backup into a test database
11) Rename the test database customer table to custbackup.
12) Load the customer dump from above into the test database
13) Run a program on the test database to produce insert customer statements
for records in the custback but not in the customer table.
14) Apply the above insert statements to the live database.




Re: Corrupt Table

From
Tom Lane
Date:
"Bryan White" <bryan@arcamax.com> writes:
> I found a core file.  I am not all that familiar with gdb but the backtrace
> looks useless:
> #0  0x8064fb4 in ?? ()
> #1  0x809da10 in ?? ()

Looks like you are running a stripped executable :-(.  You might want to
consider recompiling with debug symbols so we can get more info if this
happens again.

> Yes I am running 7.0.2.  The 4 pages in question have 0's in the first 16
> bytes but other data after that.  I see some text that look like real data.

Oh, that's interesting.  This isn't a previously known kind of failure.

These dropouts must have occurred since your last vacuum, since vacuum
would have thought that the pages are uninitialized and "fixed" them.
Perhaps they have the same cause as the problems in the other two pages.

I recall once having seen a similar kind of failure (aligned segments
of pages suddenly becoming zeroes) that turned out to be from a hardware
problem --- disk controller wasn't quite compatible with the
motherboard, or something like that, and would occasionally transfer
bad data to/from memory.  I'm not ready to blame the hardware yet, but
it's a possibility to keep in mind, particularly if you've changed
the hardware setup recently.

> Ok here is my plan:

Seems reasonable.  Good luck!

            regards, tom lane

Re: Corrupt Table - Gettting Desparate

From
"Bryan White"
Date:
Ok I nulled out the bad pages.  A pg_dump still fails.  I just noticed there
are 21000 files in my database directory.  Most of the form INDEXNAME.NUMBER
where INDEXNAME is the name of one of my indexes and NUMBER is a sequential
number.  There are 4 or 5 different indexes involved.  All of these files
are 0 bytes in size.  All dated in the last day or two.

When I did the pg_dump I got this in the log file:
000914.18:00:07.600 [10406] FATAL 1:  Memory exhausted in AllocSetAlloc()
Smart Shutdown request at Thu Sep 14 18:07:15 2000

The dump died after putting 100MB in the output file.

My guess is the internal structure of one of the tuples is corrupt.  I have
know idea what all the other files are for or if they one problem is the
cause of the other.

At this moment I am doing a tar of the database directory before I screw
anything else up.

Please help me.




Re: Corrupt Table - Gettting Desparate

From
Tom Lane
Date:
"Bryan White" <bryan@arcamax.com> writes:
> Ok I nulled out the bad pages.  A pg_dump still fails.  I just noticed there
> are 21000 files in my database directory.  Most of the form INDEXNAME.NUMBER
> where INDEXNAME is the name of one of my indexes and NUMBER is a sequential
> number.  There are 4 or 5 different indexes involved.  All of these files
> are 0 bytes in size.  All dated in the last day or two.

This suggests corrupted pointers inside the indexes.  I wouldn't worry
too much about it, you have bigger problems :-(.  The indexes are not
what's keeping you from dumping the database, anyway.

> When I did the pg_dump I got this in the log file:
> 000914.18:00:07.600 [10406] FATAL 1:  Memory exhausted in AllocSetAlloc()
> Smart Shutdown request at Thu Sep 14 18:07:15 2000

> The dump died after putting 100MB in the output file.

> My guess is the internal structure of one of the tuples is corrupt.

So it would seem.  Evidently there's at least one more corrupted page
besides the ones you were able to identify before.

What I did the last time I had to identify a corrupted tuple was to try
    SELECT tid,* FROM table LIMIT 1 OFFSET n
and experiment with different values of n to home in on the corrupted
tuple.  The last tuple you can print this way without a crash is the
one before the damaged tuple.  The TID of that tuple gives you the
block number it's in.

            regards, tom lane

RE: Corrupt Table - Gettting Desparate

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane
>
> "Bryan White" <bryan@arcamax.com> writes:
> > Ok I nulled out the bad pages.  A pg_dump still fails.  I just
> noticed there
> > are 21000 files in my database directory.  Most of the form
> INDEXNAME.NUMBER
> > where INDEXNAME is the name of one of my indexes and NUMBER is
> a sequential
> > number.  There are 4 or 5 different indexes involved.  All of
> these files
> > are 0 bytes in size.  All dated in the last day or two.
>
> This suggests corrupted pointers inside the indexes.  I wouldn't worry
> too much about it, you have bigger problems :-(.  The indexes are not
> what's keeping you from dumping the database, anyway.
>
> > When I did the pg_dump I got this in the log file:
> > 000914.18:00:07.600 [10406] FATAL 1:  Memory exhausted in
> AllocSetAlloc()
> > Smart Shutdown request at Thu Sep 14 18:07:15 2000
>
> > The dump died after putting 100MB in the output file.
>
> > My guess is the internal structure of one of the tuples is corrupt.
>
> So it would seem.  Evidently there's at least one more corrupted page
> besides the ones you were able to identify before.
>
> What I did the last time I had to identify a corrupted tuple was to try
>     SELECT tid,* FROM table LIMIT 1 OFFSET n
                        ^^^^
          ctid intead of tid ?

Hiroshi Inoue

Re: Corrupt Table - Gettting Desparate

From
Tom Lane
Date:
I said:
> What I did the last time I had to identify a corrupted tuple was to try
>     SELECT tid,* FROM table LIMIT 1 OFFSET n

Er, make that "ctid" ... sorry for the error ...

            regards, tom lane

Re: Corrupt Table

From
"Bryan White"
Date:
> > My guess is the internal structure of one of the tuples is corrupt.
>
> So it would seem.  Evidently there's at least one more corrupted page
> besides the ones you were able to identify before.

I punted last night and reloaded the customer table from a backup.  Turns
out I could recreate the most critical records from other sources.  Life
goes on.

I would still like to refine my pganal tool to look inside of tuples.  Where
should I look to find information about the internal structure?  Is it
parsable at some level on its own or do I have to consult the system tables
to determine stucture.  I suspect this might get more complicated once TOAST
is available.

Bryan White


Re: Corrupt Table

From
Tom Lane
Date:
"Bryan White" <bryan@arcamax.com> writes:
> I would still like to refine my pganal tool to look inside of tuples.  Where
> should I look to find information about the internal structure?  Is it
> parsable at some level on its own or do I have to consult the system tables
> to determine stucture.  I suspect this might get more complicated once TOAST
> is available.

The tuple layout is basically

    header
    bitmap of which fields are null
    values for non-null fields

The header is type HeapTupleHeaderData defined in
src/include/access/htup.h.  The bitmap is omitted if the header's
infomask shows the tuple contains no nulls; otherwise its length
in bits is the same as the t_natts field of the header.

The data values are uninterpretable without looking up the set of
column datatypes for the table...

            regards, tom lane