Thread: how to recover after harddisk error

how to recover after harddisk error

From

"Peter Alberer"

Date:

26 February 2003, 04:15:15

Hi,

Yesterday at about 8pm the harddisk subsystem of our web application
crashed, because of some scsi-error. The system could be restarted today
in the morning, but the database would not come up again. The following
info could be found in the log file.

2003-02-26 09:03:06 [1291]   DEBUG:  database system was interrupted at
2003-02-25 20:19:22 CET
2003-02-26 09:03:06 [1291]   DEBUG:  open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
201) failed
: No such file or directory
2003-02-26 09:03:06 [1291]   DEBUG:  invalid primary checkpoint record
2003-02-26 09:03:06 [1291]   DEBUG:  open of
/usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
200) failed
: No such file or directory
2003-02-26 09:03:06 [1291]   DEBUG:  invalid secondary checkpoint record
2003-02-26 09:03:06 [1291]   FATAL 2:  unable to locate a valid
checkpoint record
2003-02-26 09:03:06 [1277]   DEBUG:  startup process (pid 1291) exited
with exit code 2
2003-02-26 09:03:06 [1277]   DEBUG:  aborting startup due to startup
process failure

I did the following steps to get the system running again:

- a new initdb in another data-directory
- create the database again
- restore the data from the last available nightly dump

Is there a better way to get the system running again? Had there been
any way to access the old system again? The steps I did took about 45
min which is quite long (cause the db-dump is rather large) and if there
had been some important data it had been lost...

TIA, peter

Re: how to recover after harddisk error

From

Björn Metzdorf

Date:

26 February 2003, 04:27:58

> 2003-02-26 09:03:06 [1291]   DEBUG:  invalid primary checkpoint record
> 2003-02-26 09:03:06 [1291]   DEBUG:  open of
> /usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
> 200) failed
>
> I did the following steps to get the system running again:
>
> - a new initdb in another data-directory
> - create the database again
> - restore the data from the last available nightly dump
>
> Is there a better way to get the system running again? Had there been
> any way to access the old system again? The steps I did took about 45
> min which is quite long (cause the db-dump is rather large) and if there
> had been some important data it had been lost...

pg_resetxlog from contrib

Regards,
Bjoern

Re: how to recover after harddisk error

From

"Peter Alberer"

Date:

26 February 2003, 05:20:44

Thanks a lot Bjoern.

Just wanted to mention that I found pg_resetxlog to be available per
default in pg7.3.2.

>-----Ursprüngliche Nachricht-----
>Von: Björn Metzdorf [mailto:bm@turtle-entertainment.de]
>Gesendet: Mittwoch, 26. Februar 2003 10:25
>An: Peter Alberer; pgsql-general@postgresql.org
>Betreff: Re: [GENERAL] how to recover after harddisk error
>
>> 2003-02-26 09:03:06 [1291]   DEBUG:  invalid primary checkpoint
record
>> 2003-02-26 09:03:06 [1291]   DEBUG:  open of
>> /usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
>> 200) failed
>>
>> I did the following steps to get the system running again:
>>
>> - a new initdb in another data-directory
>> - create the database again
>> - restore the data from the last available nightly dump
>>
>> Is there a better way to get the system running again? Had there been
>> any way to access the old system again? The steps I did took about 45
>> min which is quite long (cause the db-dump is rather large) and if
there
>> had been some important data it had been lost...
>
>pg_resetxlog from contrib
>
>Regards,
>Bjoern

Re: how to recover after harddisk error

From

Tom Lane

Date:

26 February 2003, 10:44:41

"Peter Alberer" <h9351252@obelix.wu-wien.ac.at> writes:
> 2003-02-26 09:03:06 [1291]   DEBUG:  open of
> /usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
> 201) failed
> : No such file or directory
> 2003-02-26 09:03:06 [1291]   DEBUG:  invalid primary checkpoint record
> 2003-02-26 09:03:06 [1291]   DEBUG:  open of
> /usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
> 200) failed
> : No such file or directory
> 2003-02-26 09:03:06 [1291]   DEBUG:  invalid secondary checkpoint record
> 2003-02-26 09:03:06 [1291]   FATAL 2:  unable to locate a valid
> checkpoint record

Assuming you haven't wiped the old database directory yet...

What file name(s) are actually present in /usr/local/pgsql/data/pg_xlog/
?  What does pg_controldata show --- do the other fields of pg_control
look sane?

pg_resetxlog would have allowed you to restart, but at the price of
losing any consistency guarantees about the results of
recently-committed transactions.  So I consider it a very last resort.
What I'd like to understand first is why the system couldn't restart
normally.

            regards, tom lane

Re: how to recover after harddisk error

From

Peter Alberer

Date:

26 February 2003, 15:36:25

Too bad, i had intended to keep the old database instance around, but i had to remove the files a
few hours ago after running low on harddisk capacity...

ciao, peter

> "Peter Alberer" <h9351252@obelix.wu-wien.ac.at> writes:
> > 2003-02-26 09:03:06 [1291]   DEBUG:  open of
> > /usr/local/pgsql/data/pg_xlog/0000001A000000C9 (log file 26, segment
> > 201) failed
> > : No such file or directory
> > 2003-02-26 09:03:06 [1291]   DEBUG:  invalid primary checkpoint record
> > 2003-02-26 09:03:06 [1291]   DEBUG:  open of
> > /usr/local/pgsql/data/pg_xlog/0000001A000000C8 (log file 26, segment
> > 200) failed
> > : No such file or directory
> > 2003-02-26 09:03:06 [1291]   DEBUG:  invalid secondary checkpoint record
> > 2003-02-26 09:03:06 [1291]   FATAL 2:  unable to locate a valid
> > checkpoint record
>
> Assuming you haven't wiped the old database directory yet...
>
> What file name(s) are actually present in /usr/local/pgsql/data/pg_xlog/
> ?  What does pg_controldata show --- do the other fields of pg_control
> look sane?
>
> pg_resetxlog would have allowed you to restart, but at the price of
> losing any consistency guarantees about the results of
> recently-committed transactions.  So I consider it a very last resort.
> What I'd like to understand first is why the system couldn't restart
> normally.
>
>             regards, tom lane
>