Thread: Recovery of PGSQL after system crash failing!!!
Guess this what I get for attempting to use a beta version of pgsql in a production system. :( My database server crashed (kernel paging fault it looks like) and after reboot, postmaster refuses to start up. There error it gives is: DEBUG: starting up DEBUG: database system was interrupted at 2001-02-11 04:08:12 DEBUG: Checkpoint record at (0, 805076492) postmaster: reaping dead processes... Startup failed - abort And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get the same thing each time I run it. I assume that WAL is for some reason failing to restore/recover the database. The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql 7.1beta4. The system crash occured during the nightly update of the databases (from another, internal, non-pgsql, database system). Is there anyway to recover the database, or do I need to do a 'rm -rf data; initdb'? A quick response would be greatly appreciated. Thanks. --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------
Ryan Kirkpatrick <pgsql@rkirkpat.net> writes: > DEBUG: Checkpoint record at (0, 805076492) > postmaster: reaping dead processes... > Startup failed - abort Hm. All we can tell from this is that the startup subprocess exited with nonzero status. Did it leave a corefile? If so, what's the stack trace? regards, tom lane
> DEBUG: starting up > DEBUG: database system was interrupted at 2001-02-11 04:08:12 > DEBUG: Checkpoint record at (0, 805076492) > postmaster: reaping dead processes... > Startup failed - abort > > And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get > the same thing each time I run it. I assume that WAL is for some reason > failing to restore/recover the database. > The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql > 7.1beta4. The system crash occured during the nightly update of the > databases (from another, internal, non-pgsql, database system). Is there > anyway to recover the database, or do I need to do a 'rm -rf Please try to restart with option wal_debug = 1 so postmaster log will be more informative and send this log me. > data; initdb'? A quick response would be greatly appreciated. Thanks. Please archieve PG' data dir - it probably will be useful to find bug. Vadim
On Sun, 11 Feb 2001, Vadim Mikheev wrote: > Please try to restart with option wal_debug = 1 so postmaster log > will be more informative and send this log me. I enabled 'wal_debug=1' via both the -c command line option and (seperately) via ./data/postgresql.conf, as well as setting wal_debug=16 in ./data/postgresql.conf and I got no addition postmaster log information than in my last email. :(Also set my coredump limit to unlimited (ulimit -c unlimited) and started postmaster up. I got a core file, and here is what gdb has to say about it: GNU gdb 19990928 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i686-pc-linux-gnu"... (no debugging symbols found)... Core was generated by `postmaster -d 5 -D /usr/local/pgsql/data/'. Program terminated with signal 6, Aborted. Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done. Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done. Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done. Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done. Reading symbols from /lib/libreadline.so.4...(no debugging symbols found)...done. Reading symbols from /lib/libncurses.so.5...(no debugging symbols found)...done. Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done. Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. #0 0x20c931 in kill () from /lib/libc.so.6 (gdb) bt #0 0x20c931 in kill () from /lib/libc.so.6 #1 0x20c618 in raise () from /lib/libc.so.6 #2 0x20dc71 in abort () from /lib/libc.so.6 #3 0x8080495 in XLogFileOpen () #4 0x8080b52 in ReadRecord () #5 0x8081f66 in StartupXLOG () #6 0x80853ea in BootstrapMain () #7 0x80ee1e7 in SSDataBase () #8 0x80ec766 in PostmasterMain () #9 0x80cd194 in main () #10 0x206a42 in __libc_start_main () from /lib/libc.so.6 Also, since it appears it died in XLogFileOpen(), here is what the directory structure looks like for xlog related files: drwx--S--- 5 postgres postgres 4096 Feb 12 20:51 data drwx--S--- 2 postgres postgres 4096 Feb 11 04:12 data/pg_xlog -rw------- 1 postgres postgres 16777216 Feb 11 04:12 data/pg_xlog/0000000000000030 The file listed in data/pg_xlog is the only file in this directory. Does not look like a lot of help to me, but here it is also. One other wrench to thrown into the works... The kernel on this machine is 2.2.18 with the patches listed at www.linuxraid.org applied. I have a feeling that the linux-security patches mentioned on that page may be giving pgsql heartburn on recovery. I am going to recompile the kernel w/o them enabled and see if anything different results, and will post my results. > > data; initdb'? A quick response would be greatly appreciated. Thanks. > > Please archieve PG' data dir - it probably will be useful to find bug. Archived. It is a bit over 11MB, and I can put it on my web server if some one wants to look at it (10 minute download with a 192kbit or faster link). Though I would like to limit its distribution as it does have relatively sensitive company data buried in it (custom lists and the like).Though there is nothing I need to retrieve from it... This database is from the web site that is updated every night from the internal databases. For the time being I have fallen back to 7.0.3 for production use.Thank you for all of your help. TTYL. --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------
On Mon, 12 Feb 2001, Ryan Kirkpatrick wrote: > One other wrench to thrown into the works... The kernel on this > machine is 2.2.18 with the patches listed at www.linuxraid.org applied. I > have a feeling that the linux-security patches mentioned on that page may > be giving pgsql heartburn on recovery. I am going to recompile the kernel > w/o them enabled and see if anything different results, and will post my > results. Did as above, disabling all security options in the kernel, recompiling, and rebooting. Postgres behaves exactly the same as before. :( --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------
Ryan Kirkpatrick <pgsql@rkirkpat.net> writes: > #2 0x20dc71 in abort () from /lib/libc.so.6 > #3 0x8080495 in XLogFileOpen () Hm. Evidently it's failing to open the xlog file, but the code is set up in such a way that it dies before telling you why :-( Take a look at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code to tell you the path and errno it's failing on before it abort()s. regards, tom lane
> > #2 0x20dc71 in abort () from /lib/libc.so.6 > > #3 0x8080495 in XLogFileOpen () > > Hm. Evidently it's failing to open the xlog file, but the code is set > up in such a way that it dies before telling you why :-( Take a look > at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code > to tell you the path and errno it's failing on before it abort()s. I don't remember why there is abort() in XLogFileOpen just before appropriate elog(STOP) there - I'll remove it in few moments, - but it's already obvious why open failed: there is no file with checkpoint record pointed by pg_control - data/pg_xlog/000000000000002F. So, the question is who removed this file - PG or Linux? Ryan, do you have postmaster' log before crash (where MoveOfflineLogs reports WAL files to be deleted) and/or some logs from Linux' startup? And meanwhile I'll take a look arround MoveOfflineLogs... Vadim
On Tue, 13 Feb 2001, Vadim Mikheev wrote: > I don't remember why there is abort() in XLogFileOpen just before > appropriate elog(STOP) there - I'll remove it in few moments, - but > it's already obvious why open failed: there is no file with checkpoint > record pointed by pg_control - data/pg_xlog/000000000000002F. > So, the question is who removed this file - PG or Linux? When the system crashed, it was updating the database rather heavily (i.e. drop everything, reload from external source). Therefore there was a lot of activity going on to be logged. Still haven't determined what caused the system to crash, the error message from the kernel was along the lines 'can not handle kernel paging request'. Of course, the machine crashed again ~12 hours later w/o any kernel error messages. :( There may be a hardware problem with the machine that is causing these problems.... > Ryan, do you have postmaster' log before crash (where MoveOfflineLogs > reports WAL files to be deleted) and/or some logs from Linux' startup? Sorry, I don't have the log file (got overwritten during reboot). As for logs from Linux startup, I have them, but there is nothing of any interest to postgres in them... The fsck on the disks after the system came back up was clean, and there are no files in lost+found for the partition the database is on. > And meanwhile I'll take a look arround MoveOfflineLogs... Good hunting... :) --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------
> Might it be, that pg_control is older than it should be ? > I mean, that the write to pg_control did not make it to disk, > but the checkpoint already completed (removed the logs) ? Well, WAL does *pg_fsync()* of pg_control before removing old logs, so it's only possible if Ryan run PG with -F (fsync = off). Ryan? Vadim
On Tue, 13 Feb 2001, Mikheev, Vadim wrote: > > Might it be, that pg_control is older than it should be ? > > I mean, that the write to pg_control did not make it to disk, > > but the checkpoint already completed (removed the logs) ? > > Well, WAL does *pg_fsync()* of pg_control before removing old > logs, so it's only possible if Ryan run PG with -F (fsync = off). > Ryan? Guilty as charged I am afraind... :( Here, I though with WAL and all (bad pun :), I would not need fsync anymore and decided to be reckless. Guess I ought to reconsider that decision.... Though wasn't WAL supposed to remove the need for fsync, or was it just to improve recovery ablity? Anyway, if that is root of the problem, very bad timing on a system crash, then I will consider this problem solved. Thanks for everyone's help. --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------
> Guilty as charged I am afraind... :( Here, I though with WAL and > all (bad pun :), I would not need fsync anymore and decided to be > reckless. Guess I ought to reconsider that decision.... Though wasn't WAL > supposed to remove the need for fsync, or was it just to improve recovery > ablity? It removes the need to disable fsync to get best performance! The converse is not true; it does not eliminate the need to fsync to help guard data integrity, and the WAL file management may be a bit less robust than that for other tables. I can see how this might have been omitted from much of the discussion, so it is important that we remind ourselves about this. Thanks for the reminder :/ Since there is a fundamental recovery problem if the WAL file disappears, then perhaps we should have a workaround which can ignore the requirement for that file on startup? Or maybe we do already? Vadim?? Also, could the "-F" option be disabled now that WAL is enabled? Or is there still some reason to encourage/allow folks to use it? - Thomas
> Since there is a fundamental recovery problem if the WAL file > disappears, then perhaps we should have a workaround which can ignore > the requirement for that file on startup? Or maybe we do already? > Vadim?? > > Also, could the "-F" option be disabled now that WAL is enabled? Or is > there still some reason to encourage/allow folks to use it? The system still fsyncs, so -F is still useful, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > Also, could the "-F" option be disabled now that WAL is enabled? Or is > there still some reason to encourage/allow folks to use it? I was the one who put it back in after Vadim turned it off ;-) ... and I'll object to any attempt to remove the option. I think that there's no longer any good reason for people to consider -F in production use. On the other hand, for development or debugging work where you don't really *care* about powerfail survivability, I see no reason to incur extra wear on your disk drives by forcing fsyncs. My drives only have so many seeks left in 'em, and I'd rather see those seeks expended on writing source-code files than on fsyncs of test databases. regards, tom lane
Tom Lane writes: > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > Also, could the "-F" option be disabled now that WAL is enabled? Or is > > there still some reason to encourage/allow folks to use it? > > I was the one who put it back in after Vadim turned it off ;-) ... and > I'll object to any attempt to remove the option. The description should be updated though: http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync performance with fsync turned on" and extrapolated "Imagine what it can do if I turn off fsync anyway." -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> Tom Lane writes: > > > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > > Also, could the "-F" option be disabled now that WAL is enabled? Or is > > > there still some reason to encourage/allow folks to use it? > > > > I was the one who put it back in after Vadim turned it off ;-) ... and > > I'll object to any attempt to remove the option. > > The description should be updated though: > http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL > > I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync > performance with fsync turned on" and extrapolated "Imagine what it can do > if I turn off fsync anyway." That is a very subtle point, and one I can imagine many people incorrectly assuming. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, 14 Feb 2001, Peter Eisentraut wrote: > Tom Lane writes: > > > Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > > > Also, could the "-F" option be disabled now that WAL is enabled? Or is > > > there still some reason to encourage/allow folks to use it? > > > > I was the one who put it back in after Vadim turned it off ;-) ... and > > I'll object to any attempt to remove the option. > > The description should be updated though: > http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL > > I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync > performance with fsync turned on" and extrapolated "Imagine what it can do > if I turn off fsync anyway." That is exactly what I did... Figured that will WAL removing the need for fsync, it wasn't needed and could be disabled for a nice perfomance increase. Now I am quite a bit wiser, and will be leaving fsyncing enabled on all 7.1 production servers. :) Thank you for bring that subtle point out and yes, the documention could do with a bit of help on this point. TTYL. --------------------------------------------------------------------------- | "For to me to live is Christ, and to die is gain." | | --- Philippians 1:21 (KJV) | --------------------------------------------------------------------------- | Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------