Thread: Unable to restart postgres - database system was interrupted
Running Postgres version 8.1.3 on Opteron box running FreeBSD 6.0-RELEASE #10. We stopped postgres using kill -TERM. When we tried to restart the engine, it would not recover. The logs stated the following: 2006-12-04 10:18:39 CST LOG: archived transaction log file "00000001000000690000002E" 2006-12-04 10:25:53 CST LOG: archived transaction log file "00000001000000690000002F" 2006-12-04 10:25:58 CST LOG: archived transaction log file "000000010000006900000030" 2006-12-04 10:26:09 CST LOG: archived transaction log file "000000010000006900000031" 2006-12-04 10:30:11 CST LOG: received smart shutdown request 2006-12-04 10:30:11 CST LOG: shutting down 2006-12-04 10:30:11 CST LOG: database system is shut down 2006-12-04 10:30:11 CST LOG: logger shutting down 2006-12-04 17:00:54 CST LOG: database system was interrupted at 2006-12-02 16:45:06 CST 2006-12-04 17:00:54 CST LOG: could not open file "pg_xlog/0000000100000065000000F7" (log file 101, segment 247): No such file or directory 2006-12-04 17:00:54 CST LOG: invalid primary checkpoint record 2006-12-04 17:00:54 CST LOG: could not open file "pg_xlog/0000000100000065000000F0" (log file 101, segment 240): No such file or directory 2006-12-04 17:00:54 CST LOG: invalid secondary checkpoint record 2006-12-04 17:00:54 CST PANIC: could not locate a valid checkpoint record 2006-12-04 17:00:54 CST LOG: startup process (PID 786) was terminated by signal 6 2006-12-04 17:00:54 CST LOG: aborting startup due to startup process failure 2006-12-04 17:00:54 CST LOG: logger shutting down Note, the complaint of "database system was interrupted at 2006-12-02 16:45:06 CST". The log for the time in question contains: 2006-12-02 16:30:40 CST LOG: archived transaction log file "0000000100000065000000F1" 2006-12-02 16:31:05 CST LOG: archived transaction log file "0000000100000065000000F2" 2006-12-02 16:32:08 CST LOG: transaction ID wrap limit is 1183309811, limited by database "development" 2006-12-02 16:32:08 CST LOG: transaction ID wrap limit is 1183309811, limited by database "development" 2006-12-02 16:32:08 CST LOG: duration: 1927143.031 ms statement: set statement_timeout=0; VACUUM ANALYZE VERBOSE; 2006-12-02 16:32:11 CST LOG: archived transaction log file "0000000100000065000000F3" 2006-12-02 16:32:29 CST LOG: archived transaction log file "0000000100000065000000F4" 2006-12-02 16:32:44 CST LOG: archived transaction log file "0000000100000065000000F5" 2006-12-02 16:36:25 CST LOG: archived transaction log file "0000000100000065000000F6" 2006-12-02 17:03:33 CST LOG: transaction ID wrap limit is 1183351842, limited by database "work" 2006-12-02 17:03:33 CST LOG: transaction ID wrap limit is 1183351842, limited by database "work" 2006-12-02 17:03:33 CST LOG: duration: 1885232.821 ms statement: set statement_timeout=0; VACUUM ANALYZE VERBOSE; 2006-12-02 17:03:34 CST LOG: transaction ID wrap limit is 1183524362, limited by database "postgres" 2006-12-02 17:03:34 CST LOG: transaction ID wrap limit is 1183524362, limited by database "postgres" 2006-12-02 17:11:13 CST LOG: archived transaction log file "0000000100000065000000F7" 2006-12-02 17:11:23 CST LOG: archived transaction log file "0000000100000065000000F8" 2006-12-02 17:11:33 CST LOG: archived transaction log file "0000000100000065000000F9" I'm sure I'm probably over looking the obvious. But I don't see where the problem is. My concern is that now the pg_xlog directory doesn't have what it takes for me to restart the server. Two questions: a) how did I get into this mess?; and b) how do I get out of it? Can I run pg_resetxlog with any hope of success? Many thanks ... Andy -------------------------------------------------------------------------------- Andrew Rost National Operational Hydrologic Remote Sensing Center (NOHRSC) National Weather Service, NOAA 1735 Lake Dr. West, Chanhassen, MN 55317-8582 Voice: (952)361-6610 x 234 Fax: (952)361-6634 andy.rost@noaa.gov http://www.nohrsc.noaa.gov --------------------------------------------------------------------------------
andy rost <Andy.Rost@noaa.gov> writes: > We stopped postgres using kill -TERM. When we tried to restart the > engine, it would not recover. Since you're apparently using archiving, you could pull the missing xlog files back from the archive no? Either manually, or automatically by installing a recovery.conf file. I am kinda wondering what happened here though. Did you do anything to the database between the shutdown and the attempted restart? It looks like you have a pg_control file that is quite a bit older than it should be. regards, tom lane
We perform a daily PTR backups of the database. Part of this process is to delete old archived WALs between backups (no need to keep archived transaction logs that are older than the most recent full backup, or is there?). Since we had no indication of a problem, and since the server continued to run, the backup process ran on schedule. The problem, in our case, is that when we restarted the server it asked for transaction logs that preceded our most recent backup. And since they've been deleted from our WAL archive directory .... I'm curious about a couple of things. Why didn't the logs reflect the problem that it noticed when it tried to restart on 2006-12-04(what I mean by that, is Postgres thought the server had been interrupted on 2006-12-02 16:45 yet the logs for that date and time don't show that anything unusual happened). I suppose that its possible that the nature of the problem prevented Postgres from logging the problem. Secondly, how did Postgres know at the restart that a) a problem had occurred sometime in the past and b) a specific set of transaction logs is required to get back up again. I'd like to incorporate that logic in a) our server monitoring scripts to look for problems that might not make it to the Postgres logs and b) constrain our PTR backup process (ie, stop it from running until the problem can be resolved - as near as I can tell our current backup is compromised). But, back to your questions. Nothing was done to the database other than a shutdown and restart. The age of the pg_control file coincided with the the time we stopped Postgres. Tom Lane wrote: > andy rost <Andy.Rost@noaa.gov> writes: >> We stopped postgres using kill -TERM. When we tried to restart the >> engine, it would not recover. > > Since you're apparently using archiving, you could pull the missing xlog > files back from the archive no? Either manually, or automatically by > installing a recovery.conf file. I am kinda wondering what happened > here though. Did you do anything to the database between the shutdown > and the attempted restart? It looks like you have a pg_control file > that is quite a bit older than it should be. > > regards, tom lane -- -------------------------------------------------------------------------------- Andrew Rost National Operational Hydrologic Remote Sensing Center (NOHRSC) National Weather Service, NOAA 1735 Lake Dr. West, Chanhassen, MN 55317-8582 Voice: (952)361-6610 x 234 Fax: (952)361-6634 andy.rost@noaa.gov http://www.nohrsc.noaa.gov --------------------------------------------------------------------------------
andy rost <Andy.Rost@noaa.gov> writes: > I'm curious about a couple of things. Why didn't the logs reflect the > problem that it noticed when it tried to restart on 2006-12-04(what I > mean by that, is Postgres thought the server had been interrupted on > 2006-12-02 16:45 yet the logs for that date and time don't show that > anything unusual happened). Probably nothing did. That message is actually just reporting the last-update timestamp found in $PGDATA/global/pg_control, which was probably updated during a routine checkpoint or log segment switch. IOW it's not the time of a problem, but the time the server was last known to be functioning normally. The question is why do you have a two-day-stale copy of pg_control :-( ... it should certainly have been updated many times since then. In particular, given your log entries that indicate normal shutdown at 2006-12-04 10:30:11, pg_control *should* have contained a timestamp equal to that (plus or minus a second or so at most). > Secondly, how did Postgres know at the restart that a) a problem had > occurred sometime in the past and b) a specific set of transaction logs > is required to get back up again. Again, this is based on the checkpoint pointer found in pg_control; it wants xlog files starting at where the last checkpoint is alleged to be by pg_control. It'd seem that pg_control is a lot older than what is in pg_xlog/. I suspect if you checked the logs you'd find that 0000000100000065000000F7 corresponds to about 2006-12-02 16:45. The only previous instances that I can recall of something like this were in databases that are normally mounted on NFS volumes, and because of some NFS problem or other the database volume had become dismounted, leaving the postmaster seeing directories underneath the mount point on the root volume --- and in particular a different copy of pg_control. Usually this causes all hell to break loose immediately, though, so if you hadn't had any signs of trouble or missing data before you stopped the database, I doubt that could be the explanation. regards, tom lane