Thread: Unable to restart postgres - database system was interrupted

Unable to restart postgres - database system was interrupted

From
andy rost
Date:
Running Postgres version 8.1.3 on Opteron box running FreeBSD
6.0-RELEASE #10.

We stopped postgres using kill -TERM. When we tried to restart the
engine, it would not recover. The logs stated the following:



2006-12-04 10:18:39 CST LOG:  archived transaction log file
"00000001000000690000002E"
2006-12-04 10:25:53 CST LOG:  archived transaction log file
"00000001000000690000002F"
2006-12-04 10:25:58 CST LOG:  archived transaction log file
"000000010000006900000030"
2006-12-04 10:26:09 CST LOG:  archived transaction log file
"000000010000006900000031"
2006-12-04 10:30:11 CST LOG:  received smart shutdown request
2006-12-04 10:30:11 CST LOG:  shutting down
2006-12-04 10:30:11 CST LOG:  database system is shut down
2006-12-04 10:30:11 CST LOG:  logger shutting down
2006-12-04 17:00:54 CST LOG:  database system was interrupted at
2006-12-02 16:45:06 CST
2006-12-04 17:00:54 CST LOG:  could not open file
"pg_xlog/0000000100000065000000F7" (log file 101, segment 247): No such
file or directory
2006-12-04 17:00:54 CST LOG:  invalid primary checkpoint record
2006-12-04 17:00:54 CST LOG:  could not open file
"pg_xlog/0000000100000065000000F0" (log file 101, segment 240): No such
file or directory
2006-12-04 17:00:54 CST LOG:  invalid secondary checkpoint record
2006-12-04 17:00:54 CST PANIC:  could not locate a valid checkpoint record
2006-12-04 17:00:54 CST LOG:  startup process (PID 786) was terminated
by signal 6
2006-12-04 17:00:54 CST LOG:  aborting startup due to startup process
failure
2006-12-04 17:00:54 CST LOG:  logger shutting down




Note, the complaint of "database system was interrupted at 2006-12-02
16:45:06 CST". The log for the time in question contains:




  2006-12-02 16:30:40 CST LOG:  archived transaction log file
"0000000100000065000000F1"
2006-12-02 16:31:05 CST LOG:  archived transaction log file
"0000000100000065000000F2"
2006-12-02 16:32:08 CST LOG:  transaction ID wrap limit is 1183309811,
limited by database "development"
2006-12-02 16:32:08 CST LOG:  transaction ID wrap limit is 1183309811,
limited by database "development"
2006-12-02 16:32:08 CST LOG:  duration: 1927143.031 ms  statement: set
statement_timeout=0; VACUUM ANALYZE VERBOSE;
2006-12-02 16:32:11 CST LOG:  archived transaction log file
"0000000100000065000000F3"
2006-12-02 16:32:29 CST LOG:  archived transaction log file
"0000000100000065000000F4"
2006-12-02 16:32:44 CST LOG:  archived transaction log file
"0000000100000065000000F5"
2006-12-02 16:36:25 CST LOG:  archived transaction log file
"0000000100000065000000F6"
2006-12-02 17:03:33 CST LOG:  transaction ID wrap limit is 1183351842,
limited by database "work"
2006-12-02 17:03:33 CST LOG:  transaction ID wrap limit is 1183351842,
limited by database "work"
2006-12-02 17:03:33 CST LOG:  duration: 1885232.821 ms  statement: set
statement_timeout=0; VACUUM ANALYZE VERBOSE;
2006-12-02 17:03:34 CST LOG:  transaction ID wrap limit is 1183524362,
limited by database "postgres"
2006-12-02 17:03:34 CST LOG:  transaction ID wrap limit is 1183524362,
limited by database "postgres"
2006-12-02 17:11:13 CST LOG:  archived transaction log file
"0000000100000065000000F7"
2006-12-02 17:11:23 CST LOG:  archived transaction log file
"0000000100000065000000F8"
2006-12-02 17:11:33 CST LOG:  archived transaction log file
"0000000100000065000000F9"




I'm sure I'm probably over looking the obvious. But I don't see where
the problem is.

My concern is that now the pg_xlog directory doesn't have what it takes
for me to restart the server.

Two questions:

a) how did I get into this mess?; and
b) how do I get out of it?

Can I run pg_resetxlog with any hope of success?

Many thanks ...

Andy
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------


Re: Unable to restart postgres - database system was interrupted

From
Tom Lane
Date:
andy rost <Andy.Rost@noaa.gov> writes:
> We stopped postgres using kill -TERM. When we tried to restart the
> engine, it would not recover.

Since you're apparently using archiving, you could pull the missing xlog
files back from the archive no?  Either manually, or automatically by
installing a recovery.conf file.  I am kinda wondering what happened
here though.  Did you do anything to the database between the shutdown
and the attempted restart?  It looks like you have a pg_control file
that is quite a bit older than it should be.

            regards, tom lane

Re: Unable to restart postgres - database system was

From
andy rost
Date:
We perform a daily PTR backups of the database. Part of this process is
to delete old archived WALs between backups (no need to keep archived
transaction logs that are older than the most recent full backup, or is
there?). Since we had no indication of a problem, and since the server
continued to run, the backup process ran on schedule. The problem, in
our case, is that when we restarted the server it asked for transaction
logs that preceded our most recent backup. And since they've been
deleted from our WAL archive directory ....

I'm curious about a couple of things. Why didn't the logs reflect the
problem that it noticed when it tried to restart on 2006-12-04(what I
mean by that, is Postgres thought the server had been interrupted on
2006-12-02 16:45 yet the logs for that date and time don't show that
anything unusual happened). I suppose that its possible that the nature
of the problem prevented Postgres from logging the problem.

Secondly, how did Postgres know at the restart that a) a problem had
occurred sometime in the past and b) a specific set of transaction logs
is required to get back up again. I'd like to incorporate that logic in
a) our server monitoring scripts to look for problems that might not
make it to the Postgres logs and b) constrain our PTR backup process
(ie, stop it from running until the problem can be resolved - as near as
I can tell our current backup is compromised).

But, back to your questions. Nothing was done to the database other than
a shutdown and restart. The age of the pg_control file coincided with
the the time we stopped Postgres.

Tom Lane wrote:
> andy rost <Andy.Rost@noaa.gov> writes:
>> We stopped postgres using kill -TERM. When we tried to restart the
>> engine, it would not recover.
>
> Since you're apparently using archiving, you could pull the missing xlog
> files back from the archive no?  Either manually, or automatically by
> installing a recovery.conf file.  I am kinda wondering what happened
> here though.  Did you do anything to the database between the shutdown
> and the attempted restart?  It looks like you have a pg_control file
> that is quite a bit older than it should be.
>
>             regards, tom lane


--
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------


Re: Unable to restart postgres - database system was interrupted

From
Tom Lane
Date:
andy rost <Andy.Rost@noaa.gov> writes:
> I'm curious about a couple of things. Why didn't the logs reflect the
> problem that it noticed when it tried to restart on 2006-12-04(what I
> mean by that, is Postgres thought the server had been interrupted on
> 2006-12-02 16:45 yet the logs for that date and time don't show that
> anything unusual happened).

Probably nothing did.  That message is actually just reporting the
last-update timestamp found in $PGDATA/global/pg_control, which was
probably updated during a routine checkpoint or log segment switch.
IOW it's not the time of a problem, but the time the server was last
known to be functioning normally.

The question is why do you have a two-day-stale copy of pg_control :-(
... it should certainly have been updated many times since then.
In particular, given your log entries that indicate normal shutdown at
2006-12-04 10:30:11, pg_control *should* have contained a timestamp
equal to that (plus or minus a second or so at most).

> Secondly, how did Postgres know at the restart that a) a problem had
> occurred sometime in the past and b) a specific set of transaction logs
> is required to get back up again.

Again, this is based on the checkpoint pointer found in pg_control;
it wants xlog files starting at where the last checkpoint is alleged
to be by pg_control.  It'd seem that pg_control is a lot older than
what is in pg_xlog/.  I suspect if you checked the logs you'd find
that 0000000100000065000000F7 corresponds to about 2006-12-02 16:45.

The only previous instances that I can recall of something like this
were in databases that are normally mounted on NFS volumes, and because
of some NFS problem or other the database volume had become dismounted,
leaving the postmaster seeing directories underneath the mount point
on the root volume --- and in particular a different copy of pg_control.
Usually this causes all hell to break loose immediately, though, so if
you hadn't had any signs of trouble or missing data before you stopped
the database, I doubt that could be the explanation.

            regards, tom lane