Thread: Recovering database from crashed HD (bad sectors)

Recovering database from crashed HD (bad sectors)

From

Amitabh Kant

Date:

18 July 2015, 09:59:47

A development box hard disk failed which was running a PG instance with multiple databases on it. I got the data recovered with some bad sector errors. Ran another instance of PG (same version), and was to able to take dump (using pg_dump) of all but one database. For one database I am getting the following error:

pg_dump -Fc alpha_45 > alpha_45.dump

pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL: could not open file "base/525035/11678": No such file or directory

These are the only two files in the directory similar to the one above:

/usr/local/pgsql/data/base/525035/11678_fsm

/usr/local/pgsql/data/base/525035/11678_vm

Is there any hope of recovering this DB, or should I start looking into restoring from old backups? Data loss is not a concern, I just would like to know if I should even try working on it. I looked into this page http://www.postgresql.org/docs/9.1/static/runtime-config-developer.html, and tried using "zero_damaged_pages = on" in postgresql.conf, but it was of no help.

OS: FreeBSD 9.3

PG Version: 9.1

With regards

Amitabh

Re: Recovering database from crashed HD (bad sectors)

From

Michael Paquier

Date:

18 July 2015, 13:28:59

On Sat, Jul 18, 2015 at 6:59 PM, Amitabh Kant <amitabhkant@gmail.com> wrote:
> A development box hard disk failed which was running a PG instance with
> multiple databases on it.  I got the data recovered with some bad sector
> errors. Ran another instance of PG (same version), and was to able to take
> dump (using pg_dump) of all but one database. For one database I am getting
> the following error:
>
> pg_dump -Fc alpha_45 > alpha_45.dump
> pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL:
> could not open file "base/525035/11678": No such file or directory
>
> These are the only two files in the directory similar to the one above:
> /usr/local/pgsql/data/base/525035/11678_fsm
> /usr/local/pgsql/data/base/525035/11678_vm
>
> Is there any hope of recovering this DB, or should I start looking into
> restoring from old backups? Data loss is not a concern, I just would like to
> know if I should even try working on it. I looked into this page
> http://www.postgresql.org/docs/9.1/static/runtime-config-developer.html, and
> tried using "zero_damaged_pages = on" in postgresql.conf, but it was of no
> help.

Visibly your data is already lost in alpha_45... Hence if you can
recover this database from an older backup you had better do it. Now
perhaps other folks here have other recommendations and strategies
though :)
--
Michael

Re: Recovering database from crashed HD (bad sectors)

From

Alvaro Herrera

Date:

18 July 2015, 13:36:04

Michael Paquier wrote:
> On Sat, Jul 18, 2015 at 6:59 PM, Amitabh Kant <amitabhkant@gmail.com> wrote:
> > A development box hard disk failed which was running a PG instance with
> > multiple databases on it.  I got the data recovered with some bad sector
> > errors. Ran another instance of PG (same version), and was to able to take
> > dump (using pg_dump) of all but one database. For one database I am getting
> > the following error:
> >
> > pg_dump -Fc alpha_45 > alpha_45.dump
> > pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL:
> > could not open file "base/525035/11678": No such file or directory
> >
> > These are the only two files in the directory similar to the one above:
> > /usr/local/pgsql/data/base/525035/11678_fsm
> > /usr/local/pgsql/data/base/525035/11678_vm
> >
> > Is there any hope of recovering this DB, or should I start looking into
> > restoring from old backups? Data loss is not a concern, I just would like to
> > know if I should even try working on it. I looked into this page
> > http://www.postgresql.org/docs/9.1/static/runtime-config-developer.html, and
> > tried using "zero_damaged_pages = on" in postgresql.conf, but it was of no
> > help.
>
> Visibly your data is already lost in alpha_45... Hence if you can
> recover this database from an older backup you had better do it. Now
> perhaps other folks here have other recommendations and strategies
> though :)

If the data since the last backup is very valuable, there are certainly
other avenues to explore, but it's an ad-hoc job that takes time and
someone knowledgeable enough directly in charge of the job.  It's not
something I would attempt over a mailing list, for sure.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Recovering database from crashed HD (bad sectors)

From

Tom Lane

Date:

18 July 2015, 14:12:29

Amitabh Kant <amitabhkant@gmail.com> writes:
> A development box hard disk failed which was running a PG instance with
> multiple databases on it.  I got the data recovered with some bad sector
> errors. Ran another instance of PG (same version), and was to able to take
> dump (using pg_dump) of all but one database. For one database I am getting
> the following error:

> pg_dump -Fc alpha_45 > alpha_45.dump
> pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL:
>  could not open file "base/525035/11678": No such file or directory

> These are the only two files in the directory similar to the one above:
> /usr/local/pgsql/data/base/525035/11678_fsm
> /usr/local/pgsql/data/base/525035/11678_vm

> Is there any hope of recovering this DB, or should I start looking into
> restoring from old backups?

Well, as others have said, if the data is worth money to you then it
would be worthwhile hiring a pro.  If you just want to experiment,
the first thing to do is to find out which system catalog or index
that is, which you should be able to do by connecting to any of the
non-broken databases in the same installation and issuing
select * from pg_class where pg_relation_filenode(oid) = 11678;

If you're lucky and it's only an index, you could try connecting to the
busted database with PGOPTIONS=-P (ignore_system_indexes) and reindexing
the broken index.  If it's a catalog, whether recovery is possible would
depend on which one.

In any case, it would be prudent to make a tar-style copy of the whole
$PGDATA tree (with the postmaster shut down) before experimenting,
so that you can undo any catastrophic mistakes.

            regards, tom lane

Re: Recovering database from crashed HD (bad sectors)

From

Amitabh Kant

Date:

18 July 2015, 17:13:51

I am just experimenting with the crashed DB. Before I tried anything else, I did make a complete copy of the recovered directory tree.

As for running the sql command as suggested by Tom, here is the result:

template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;

----------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+------------+------------+-------------+----------------+----------------+--------------+------------+------------

pg_class | 11 | 83 | 0 | 10 | 0 | 0 | 0 | 8 | 281 | 0 | 0 | t | f | p | r | 26 | 0 | t | f | f | f | f | 662 | {=r/pgsql} |

(1 row)

From what I can understand, this is not a index as reltype is non-zero . Is there anything else that I can try?

With regards

Amitabh

On Sat, Jul 18, 2015 at 7:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amitabh Kant <amitabhkant@gmail.com> writes:
> A development box hard disk failed which was running a PG instance with
> multiple databases on it. I got the data recovered with some bad sector
> errors. Ran another instance of PG (same version), and was to able to take
> dump (using pg_dump) of all but one database. For one database I am getting
> the following error:

> pg_dump -Fc alpha_45 > alpha_45.dump
> pg_dump: [archiver (db)] connection to database "alpha_45" failed: FATAL:
> could not open file "base/525035/11678": No such file or directory

> These are the only two files in the directory similar to the one above:
> /usr/local/pgsql/data/base/525035/11678_fsm
> /usr/local/pgsql/data/base/525035/11678_vm

> Is there any hope of recovering this DB, or should I start looking into
> restoring from old backups?

Well, as others have said, if the data is worth money to you then it
would be worthwhile hiring a pro. If you just want to experiment,
the first thing to do is to find out which system catalog or index
that is, which you should be able to do by connecting to any of the
non-broken databases in the same installation and issuing
select * from pg_class where pg_relation_filenode(oid) = 11678;

If you're lucky and it's only an index, you could try connecting to the
busted database with PGOPTIONS=-P (ignore_system_indexes) and reindexing
the broken index. If it's a catalog, whether recovery is possible would
depend on which one.

In any case, it would be prudent to make a tar-style copy of the whole
$PGDATA tree (with the postmaster shut down) before experimenting,
so that you can undo any catastrophic mistakes.

regards, tom lane

Re: Recovering database from crashed HD (bad sectors)

From

Tom Lane

Date:

18 July 2015, 17:30:49

Amitabh Kant <amitabhkant@gmail.com> writes:
> As for running the sql command as suggested by Tom, here is the result:
> template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;

>  pg_class |           11 |      83 |         0 |       10 |     0 |
>   0 |             0 |        8 |       281 |             0 |             0
> | t           | f           | p              | r       |       26 |
> 0 | t          | f          | f           | f              | f
> |          662 | {=r/pgsql} |

That's about the worst possible answer :-(.  Without pg_class, you have
little hope of telling which is which among the other files; and there
would be no real commonality with the contents of pg_class from other
databases in the installation, so no way to jury-rig something.  Moreover,
because pg_class is consulted *very* early in backend startup, it seems
entirely likely that the failure you're seeing is only the tip of the
iceberg; there very possibly are other files that are also missing or
badly damaged.

It's possible that a professional data recovery team could extract
something from the wreckage, but it would take a lot of time and money.
Personally I'd say it's time to go to your backups.

            regards, tom lane

Re: Recovering database from crashed HD (bad sectors)

From

Alvaro Herrera

Date:

18 July 2015, 22:40:25

Tom Lane wrote:
> Amitabh Kant <amitabhkant@gmail.com> writes:
> > As for running the sql command as suggested by Tom, here is the result:
> > template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;
>
> >  pg_class |           11 |      83 |         0 |       10 |     0 |
> >   0 |             0 |        8 |       281 |             0 |             0
> > | t           | f           | p              | r       |       26 |
> > 0 | t          | f          | f           | f              | f
> > |          662 | {=r/pgsql} |
>
> That's about the worst possible answer :-(.  Without pg_class, you have
> little hope of telling which is which among the other files; and there
> would be no real commonality with the contents of pg_class from other
> databases in the installation, so no way to jury-rig something.  Moreover,
> because pg_class is consulted *very* early in backend startup, it seems
> entirely likely that the failure you're seeing is only the tip of the
> iceberg; there very possibly are other files that are also missing or
> badly damaged.

I would look for the file in the partition's lost+found directory; with
luck, the file is there.  pg_filedump can identify the number of
attributes each tuple has in each file, if there are many such files;
you need to look for a file whose tuples have 26 attributes (relnatts in
the above row).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Recovering database from crashed HD (bad sectors)

From

Amitabh Kant

Date:

19 July 2015, 03:38:45

On Sun, Jul 19, 2015 at 4:10 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Tom Lane wrote:
> Amitabh Kant <amitabhkant@gmail.com> writes:
> > As for running the sql command as suggested by Tom, here is the result:
> > template1=# select * from pg_class where pg_relation_filenode(oid) = 11678;
>
> > pg_class | 11 | 83 | 0 | 10 | 0 |
> > 0 | 0 | 8 | 281 | 0 | 0
> > | t | f | p | r | 26 |
> > 0 | t | f | f | f | f
> > | 662 | {=r/pgsql} |
>
> That's about the worst possible answer :-(. Without pg_class, you have
> little hope of telling which is which among the other files; and there
> would be no real commonality with the contents of pg_class from other
> databases in the installation, so no way to jury-rig something. Moreover,
> because pg_class is consulted *very* early in backend startup, it seems
> entirely likely that the failure you're seeing is only the tip of the
> iceberg; there very possibly are other files that are also missing or
> badly damaged.

I would look for the file in the partition's lost+found directory; with
luck, the file is there. pg_filedump can identify the number of
attributes each tuple has in each file, if there are many such files;
you need to look for a file whose tuples have 26 attributes (relnatts in
the above row).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

There is no lost+found directory in the recovered files. Since the DB was not critical, and can easily be replaced, we are not inclined towards sending it for professional recovery. As Tom suggested, this would just be the tip iceberg, as even during recovery, there were lots of read errors from the disk. I will recreate the DB from the backups.

Thanks for the help and insights though.

With regards

Amitabh