Thread: db recovery after hd crash (could not open relation 1663/16385/16400: No such file or directory)

Hello all,

Background:
DB resides on a mirrored array, HD1 goes bad and before it can be
replaced, HD2 starts throwing hardware errors. We were able to get a
DD from HD2 and have restored the box quite successfully with the
exception of the pg db. There WAS corruption to the filesystem.  Since
the last backup, a significant number of records have been added to
the db. If there is a way to recover some data, it would likely be
easier to reconcile and fix manually than reenter. Box is FreeBSD 8,
postgresql version 8.1.

Progress so far:
On new hardware, installed pg8.1 and restored data directory and
attempted to start pg. After fixing some config differences, it
complained of a missing pg_clog/xxx file. After searching the archives
for hints, added a file of zero's and pg starts successfully. The db
in question exists with what appears to be intact structure. When
attempting to dump the db ran into the error: "could not open relation
1663/16385/16400: No such file or directory" and run into similar
errors (with different relation numbers) trying to select * from
certain tables.

I'm stuck here. Assuming the damage to filesystem, what is the best a
way forward? Does the relation error indicate damage to the data? or
to an index??? Is there a process I can do to try and salvage partial
data?

Thanks for any help/tips/suggestions.

Roland Wells

Roland Wells <roland.wells@gmail.com> writes:
> On new hardware, installed pg8.1 and restored data directory and
> attempted to start pg. After fixing some config differences, it
> complained of a missing pg_clog/xxx file. After searching the archives
> for hints, added a file of zero's and pg starts successfully. The db
> in question exists with what appears to be intact structure. When
> attempting to dump the db ran into the error: "could not open relation
> 1663/16385/16400: No such file or directory" and run into similar
> errors (with different relation numbers) trying to select * from
> certain tables.

> I'm stuck here. Assuming the damage to filesystem, what is the best a
> way forward? Does the relation error indicate damage to the data? or
> to an index??? Is there a process I can do to try and salvage partial
> data?

You can look for the third number in pg_class.relfilenode to see exactly
which table or index it's complaining about.  If it's happening in
pg_dump then it's probably a table not an index, unfortunately.
Not sure there's much you can do except go back to your last backup
--- PG can't do anything to recover data that the filesystem has lost
track of, which is what this is sounding like.

But having said that, did you try fsck'ing the DD image to see if fsck
could recover any files?

            regards, tom lane

On Fri, Feb 26, 2010 at 7:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Roland Wells <roland.wells@gmail.com> writes:
>> On new hardware, installed pg8.1 and restored data directory and
>> attempted to start pg. After fixing some config differences, it
>> complained of a missing pg_clog/xxx file. After searching the archives
>> for hints, added a file of zero's and pg starts successfully. The db
>> in question exists with what appears to be intact structure. When
>> attempting to dump the db ran into the error: "could not open relation
>> 1663/16385/16400: No such file or directory" and run into similar
>> errors (with different relation numbers) trying to select * from
>> certain tables.
>
>> I'm stuck here. Assuming the damage to filesystem, what is the best a
>> way forward? Does the relation error indicate damage to the data? or
>> to an index??? Is there a process I can do to try and salvage partial
>> data?
>
> You can look for the third number in pg_class.relfilenode to see exactly
> which table or index it's complaining about.  If it's happening in
> pg_dump then it's probably a table not an index, unfortunately.
> Not sure there's much you can do except go back to your last backup
> --- PG can't do anything to recover data that the filesystem has lost
> track of, which is what this is sounding like.
>
> But having said that, did you try fsck'ing the DD image to see if fsck
> could recover any files?
>
Thanks for the replies:

A look at pg_class.relfilenode seems to confirm that tables are
missing (7 out of 52 tables in the db are missing). I did do an fsck
on the restored disk image and there is quite a few files in the
lost+found. I am still wading through them at the moment, but not
hopeful of recovery this way as many I've looked at so far appear to
only be partial files. Additionally, several of the tables in the db
show far less rows than should exist, so there may be damage to those
files as well...

Probably will head to the backup. Thanks again. Roland