Thread: Recovering database from crashed HD (bad sectors)
On Sat, Jul 18, 2015 at 6:59 PM, Amitabh Kant <amitabhkant@gmail.com> wrote: > A development box hard disk failed which was running a PG instance with > multiple databases on it. I got the data recovered with some bad sector > errors. Ran another instance of PG (same version), and was to able to take > dump (using pg_dump) of all but one database. For one database I am getting > the following error: > > pg_dump -Fc alpha_45 > alpha_45.dump > pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL: > could not open file "base/525035/11678": No such file or directory > > These are the only two files in the directory similar to the one above: > /usr/local/pgsql/data/base/525035/11678_fsm > /usr/local/pgsql/data/base/525035/11678_vm > > Is there any hope of recovering this DB, or should I start looking into > restoring from old backups? Data loss is not a concern, I just would like to > know if I should even try working on it. I looked into this page > http://www.postgresql.org/docs/9.1/static/runtime-config-developer.html, and > tried using "zero_damaged_pages = on" in postgresql.conf, but it was of no > help. Visibly your data is already lost in alpha_45... Hence if you can recover this database from an older backup you had better do it. Now perhaps other folks here have other recommendations and strategies though :) -- Michael
Michael Paquier wrote: > On Sat, Jul 18, 2015 at 6:59 PM, Amitabh Kant <amitabhkant@gmail.com> wrote: > > A development box hard disk failed which was running a PG instance with > > multiple databases on it. I got the data recovered with some bad sector > > errors. Ran another instance of PG (same version), and was to able to take > > dump (using pg_dump) of all but one database. For one database I am getting > > the following error: > > > > pg_dump -Fc alpha_45 > alpha_45.dump > > pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL: > > could not open file "base/525035/11678": No such file or directory > > > > These are the only two files in the directory similar to the one above: > > /usr/local/pgsql/data/base/525035/11678_fsm > > /usr/local/pgsql/data/base/525035/11678_vm > > > > Is there any hope of recovering this DB, or should I start looking into > > restoring from old backups? Data loss is not a concern, I just would like to > > know if I should even try working on it. I looked into this page > > http://www.postgresql.org/docs/9.1/static/runtime-config-developer.html, and > > tried using "zero_damaged_pages = on" in postgresql.conf, but it was of no > > help. > > Visibly your data is already lost in alpha_45... Hence if you can > recover this database from an older backup you had better do it. Now > perhaps other folks here have other recommendations and strategies > though :) If the data since the last backup is very valuable, there are certainly other avenues to explore, but it's an ad-hoc job that takes time and someone knowledgeable enough directly in charge of the job. It's not something I would attempt over a mailing list, for sure. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amitabh Kant <amitabhkant@gmail.com> writes: > A development box hard disk failed which was running a PG instance with > multiple databases on it. I got the data recovered with some bad sector > errors. Ran another instance of PG (same version), and was to able to take > dump (using pg_dump) of all but one database. For one database I am getting > the following error: > pg_dump -Fc alpha_45 > alpha_45.dump > pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL: > could not open file "base/525035/11678": No such file or directory > These are the only two files in the directory similar to the one above: > /usr/local/pgsql/data/base/525035/11678_fsm > /usr/local/pgsql/data/base/525035/11678_vm > Is there any hope of recovering this DB, or should I start looking into > restoring from old backups? Well, as others have said, if the data is worth money to you then it would be worthwhile hiring a pro. If you just want to experiment, the first thing to do is to find out which system catalog or index that is, which you should be able to do by connecting to any of the non-broken databases in the same installation and issuing select * from pg_class where pg_relation_filenode(oid) = 11678; If you're lucky and it's only an index, you could try connecting to the busted database with PGOPTIONS=-P (ignore_system_indexes) and reindexing the broken index. If it's a catalog, whether recovery is possible would depend on which one. In any case, it would be prudent to make a tar-style copy of the whole $PGDATA tree (with the postmaster shut down) before experimenting, so that you can undo any catastrophic mistakes. regards, tom lane
template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasoids | relhaspkey | relhasrules | relhastriggers | relhassubclass | relfrozenxid | relacl | reloptions
----------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+------------+------------+-------------+----------------+----------------+--------------+------------+------------
pg_class | 11 | 83 | 0 | 10 | 0 | 0 | 0 | 8 | 281 | 0 | 0 | t | f | p | r | 26 | 0 | t | f | f | f | f | 662 | {=r/pgsql} |
(1 row)
From what I can understand, this is not a index as reltype is non-zero . Is there anything else that I can try?
With regards
Amitabh
Amitabh Kant <amitabhkant@gmail.com> writes:
> A development box hard disk failed which was running a PG instance with
> multiple databases on it. I got the data recovered with some bad sector
> errors. Ran another instance of PG (same version), and was to able to take
> dump (using pg_dump) of all but one database. For one database I am getting
> the following error:
> pg_dump -Fc alpha_45 > alpha_45.dump
> pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL:
> could not open file "base/525035/11678": No such file or directory
> These are the only two files in the directory similar to the one above:
> /usr/local/pgsql/data/base/525035/11678_fsm
> /usr/local/pgsql/data/base/525035/11678_vm
> Is there any hope of recovering this DB, or should I start looking into
> restoring from old backups?
Well, as others have said, if the data is worth money to you then it
would be worthwhile hiring a pro. If you just want to experiment,
the first thing to do is to find out which system catalog or index
that is, which you should be able to do by connecting to any of the
non-broken databases in the same installation and issuing
select * from pg_class where pg_relation_filenode(oid) = 11678;
If you're lucky and it's only an index, you could try connecting to the
busted database with PGOPTIONS=-P (ignore_system_indexes) and reindexing
the broken index. If it's a catalog, whether recovery is possible would
depend on which one.
In any case, it would be prudent to make a tar-style copy of the whole
$PGDATA tree (with the postmaster shut down) before experimenting,
so that you can undo any catastrophic mistakes.
regards, tom lane
Amitabh Kant <amitabhkant@gmail.com> writes: > As for running the sql command as suggested by Tom, here is the result: > template1=# select * from pg_class where pg_relation_filenode(oid) = 11678; > pg_class | 11 | 83 | 0 | 10 | 0 | > 0 | 0 | 8 | 281 | 0 | 0 > | t | f | p | r | 26 | > 0 | t | f | f | f | f > | 662 | {=r/pgsql} | That's about the worst possible answer :-(. Without pg_class, you have little hope of telling which is which among the other files; and there would be no real commonality with the contents of pg_class from other databases in the installation, so no way to jury-rig something. Moreover, because pg_class is consulted *very* early in backend startup, it seems entirely likely that the failure you're seeing is only the tip of the iceberg; there very possibly are other files that are also missing or badly damaged. It's possible that a professional data recovery team could extract something from the wreckage, but it would take a lot of time and money. Personally I'd say it's time to go to your backups. regards, tom lane
Tom Lane wrote: > Amitabh Kant <amitabhkant@gmail.com> writes: > > As for running the sql command as suggested by Tom, here is the result: > > template1=# select * from pg_class where pg_relation_filenode(oid) = 11678; > > > pg_class | 11 | 83 | 0 | 10 | 0 | > > 0 | 0 | 8 | 281 | 0 | 0 > > | t | f | p | r | 26 | > > 0 | t | f | f | f | f > > | 662 | {=r/pgsql} | > > That's about the worst possible answer :-(. Without pg_class, you have > little hope of telling which is which among the other files; and there > would be no real commonality with the contents of pg_class from other > databases in the installation, so no way to jury-rig something. Moreover, > because pg_class is consulted *very* early in backend startup, it seems > entirely likely that the failure you're seeing is only the tip of the > iceberg; there very possibly are other files that are also missing or > badly damaged. I would look for the file in the partition's lost+found directory; with luck, the file is there. pg_filedump can identify the number of attributes each tuple has in each file, if there are many such files; you need to look for a file whose tuples have 26 attributes (relnatts in the above row). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tom Lane wrote:
> Amitabh Kant <amitabhkant@gmail.com> writes:
> > As for running the sql command as suggested by Tom, here is the result:
> > template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;
>
> > pg_class | 11 | 83 | 0 | 10 | 0 |
> > 0 | 0 | 8 | 281 | 0 | 0
> > | t | f | p | r | 26 |
> > 0 | t | f | f | f | f
> > | 662 | {=r/pgsql} |
>
> That's about the worst possible answer :-(. Without pg_class, you have
> little hope of telling which is which among the other files; and there
> would be no real commonality with the contents of pg_class from other
> databases in the installation, so no way to jury-rig something. Moreover,
> because pg_class is consulted *very* early in backend startup, it seems
> entirely likely that the failure you're seeing is only the tip of the
> iceberg; there very possibly are other files that are also missing or
> badly damaged.
I would look for the file in the partition's lost+found directory; with
luck, the file is there. pg_filedump can identify the number of
attributes each tuple has in each file, if there are many such files;
you need to look for a file whose tuples have 26 attributes (relnatts in
the above row).
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services