Thread: Recovering database after disk crash

Recovering database after disk crash

From
"Vic Simkus"
Date:
Hello

A disk hosting an instance of 8.2 crashed on me (hardware failure).  I
was able to pull most of the data off the drive, but the one database
that I need the most is corrupt.  I'm not really sure where to
start... so here are some error messages:

postgres@vsimkus-laptop:/var/log/postgresql$ pg_dump  EPC
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  index
"pg_opclass_am_name_nsp_index" contains unexpected zero page at block
0
HINT:  Please REINDEX it.
pg_dump: The command was: SELECT tableoid, oid, *, (SELECT rolname
FROM pg_catalog.pg_roles WHERE oid = '10') as lanowner FROM
pg_language WHERE lanispl ORDER BY oid

postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC
NOTICE:  table "pg_class" was reindexed
reindexdb: reindexing of database "EPC" failed: ERROR:  catalog is
missing 4 attribute(s) for relid 10762

Almost every operation I attempt ends in a similar error message:

EPC=# \dt
ERROR:  index "pg_opclass_am_name_nsp_index" contains unexpected zero
page at block 0
HINT:  Please REINDEX it.

EPC=# select * from siemr_sys.zip_codes;
ERROR:  catalog is missing 10 attribute(s) for relid 1337158

EPC=# select * from siemr_sys.sex_list;
ERROR:  catalog is missing 2 attribute(s) for relid 1337150

EPC=# select * from siemr_data.visits;
ERROR:  catalog is missing 7 attribute(s) for relid 1337095

Is there some way I can extract the data from the tables even if the
metadata is corrupt?

Thanks
Vic

--
Men never do evil so completely and cheerfully as when they do it from
religious conviction.

-Blaise Pascal

Re: Recovering database after disk crash

From
"Vic Simkus"
Date:
After doing some more reading I've come to the conclusion that I'm in
completely over my head.  I got a fresh copy of the corrupt data and
am starting from the very beginning.  Here's the error I get on
startup:

postgres@vsimkus-laptop:/usr/lib/postgresql/8.2/bin$ ./postgres  -D
/var/lib/postgresql/8.2/main
2008-05-12 16:16:02 CDT LOG:  could not load root certificate file
"root.crt": no SSL error reported
2008-05-12 16:16:02 CDT DETAIL:  Will not verify client certificates.
2008-05-12 16:16:02 CDT LOG:  database system was interrupted at
2008-04-16 09:28:36 CDT
2008-05-12 16:16:02 CDT LOG:  unexpected pageaddr 0/1114A000 in log
file 0, segment 17, offset 14295040
2008-05-12 16:16:02 CDT LOG:  invalid primary checkpoint record
2008-05-12 16:16:02 CDT LOG:  unexpected pageaddr 0/1114A000 in log
file 0, segment 17, offset 14295040
2008-05-12 16:16:02 CDT LOG:  invalid secondary checkpoint record
2008-05-12 16:16:02 CDT PANIC:  could not locate a valid checkpoint record
2008-05-12 16:16:02 CDT LOG:  startup process (PID 8359) was
terminated by signal 6
2008-05-12 16:16:02 CDT LOG:  aborting startup due to startup process failure

pg_control output:

 ./pg_controldata /var/lib/postgresql/8.2/main
pg_control version number:            822
Catalog version number:               200611241
Database system identifier:           5126483664462806975
Database cluster state:               in production
pg_control last modified:             Wed 16 Apr 2008 09:28:36 AM CDT
Current log file ID:                  0
Next log file segment:                18
Latest checkpoint location:           0/11DA3C38
Prior checkpoint location:            0/11DA3BF0
Latest checkpoint's REDO location:    0/11DA3C38
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/908105
Latest checkpoint's NextOID:          2001259
Latest checkpoint's NextMultiXactId:  7
Latest checkpoint's NextMultiOffset:  13
Time of latest checkpoint:            Wed 16 Apr 2008 09:13:56 AM CDT
Minimum recovery ending location:     0/0
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8

I ran pg_resetxlog in my previous mucking around, but that's when all
hell broke loose, so I'm holding off until someone more qualified
advises me to do so.

TIA

Vic

Re: Recovering database after disk crash

From
Tom Lane
Date:
"Vic Simkus" <vic.simkus@gmail.com> writes:
> postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC
> NOTICE:  table "pg_class" was reindexed
> reindexdb: reindexing of database "EPC" failed: ERROR:  catalog is
> missing 4 attribute(s) for relid 10762

If you're really lucky, doing the reindex with ignore_system_indexes
enabled will go through.  Otherwise, it's time to go back to your
last backups :-(

            regards, tom lane

Re: Recovering database after disk crash

From
"Vic Simkus"
Date:
If I'm understanding the errors correctly it seems that the corruption
is in the system catalogs (metadata). The database does not use any
fancy datatypes. Is there any way for me to rebuild the metadata
manually? If I can see the leftover metadata and the data minus the
missing metadata I can probably piece everything together enough to
get the data out.

 Does that make sense or am I off my rocker?



On 5/12/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Vic Simkus" <vic.simkus@gmail.com> writes:
> > postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC
> > NOTICE:  table "pg_class" was reindexed
> > reindexdb: reindexing of database "EPC" failed: ERROR:  catalog is
> > missing 4 attribute(s) for relid 10762
>
> If you're really lucky, doing the reindex with ignore_system_indexes
> enabled will go through.  Otherwise, it's time to go back to your
> last backups :-(
>
>             regards, tom lane
>


--
Men never do evil so completely and cheerfully as when they do it from
religious conviction.

-Blaise Pascal

Re: Recovering database after disk crash

From
Tom Lane
Date:
"Vic Simkus" <vic.simkus@gmail.com> writes:
> If I'm understanding the errors correctly it seems that the corruption
> is in the system catalogs (metadata). The database does not use any
> fancy datatypes. Is there any way for me to rebuild the metadata
> manually? If I can see the leftover metadata and the data minus the
> missing metadata I can probably piece everything together enough to
> get the data out.

Based on the evidence so far, the disk failure has zeroed out multiple,
randomly-chosen pages of your system catalogs.  I'd think it very likely
indeed that random pages of your table files got the same favor.
You won't have any way to know what is missing ...

            regards, tom lane

Re: Recovering database after disk crash

From
"Vic Simkus"
Date:
During the initial ill-educated messing around I had set the
zero_damaged_pages to yes, but I'm guessing that the end result is the
same... Ill try it with the fresh copy of [the corrupt] data

What kind of a database can't deal with a bit of random values
injected into its sytem files anyways? :)



On 5/12/08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Vic Simkus" <vic.simkus@gmail.com> writes:
> > If I'm understanding the errors correctly it seems that the corruption
> > is in the system catalogs (metadata). The database does not use any
> > fancy datatypes. Is there any way for me to rebuild the metadata
> > manually? If I can see the leftover metadata and the data minus the
> > missing metadata I can probably piece everything together enough to
> > get the data out.
>
> Based on the evidence so far, the disk failure has zeroed out multiple,
> randomly-chosen pages of your system catalogs.  I'd think it very likely
> indeed that random pages of your table files got the same favor.
> You won't have any way to know what is missing ...
>
>             regards, tom lane
>


--
Men never do evil so completely and cheerfully as when they do it from
religious conviction.

-Blaise Pascal