Thread: Recovery of PGSQL after system crash failing!!!

Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
Guess this what I get for attempting to use a beta version of
pgsql in a production system. :( My database server crashed (kernel
paging fault it looks like) and after reboot, postmaster refuses to start
up. There error it gives is:

DEBUG: starting up
DEBUG: database system was interrupted at 2001-02-11 04:08:12
DEBUG: Checkpoint record at (0, 805076492)
postmaster: reaping dead processes...
Startup failed - abort

And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get
the same thing each time I run it. I assume that WAL is for some reason
failing to restore/recover the database. The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql
7.1beta4. The system crash occured during the nightly update of the
databases (from another, internal, non-pgsql, database system). Is there
anyway to recover the database, or do I need to do a 'rm -rf
data; initdb'? A quick response would be greatly appreciated. Thanks.

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------



Re: Recovery of PGSQL after system crash failing!!!

From
Tom Lane
Date:
Ryan Kirkpatrick <pgsql@rkirkpat.net> writes:
> DEBUG: Checkpoint record at (0, 805076492)
> postmaster: reaping dead processes...
> Startup failed - abort

Hm.  All we can tell from this is that the startup subprocess exited
with nonzero status.  Did it leave a corefile?  If so, what's the
stack trace?
        regards, tom lane


Re: Recovery of PGSQL after system crash failing!!!

From
"Vadim Mikheev"
Date:
> DEBUG: starting up
> DEBUG: database system was interrupted at 2001-02-11 04:08:12
> DEBUG: Checkpoint record at (0, 805076492)
> postmaster: reaping dead processes...
> Startup failed - abort
> 
> And that is it, from running 'postmaster -D /usr/local/pgsql/data/'. I get
> the same thing each time I run it. I assume that WAL is for some reason
> failing to restore/recover the database. 
> The system is a stock Debian 2.2 system, Dual PPro200, w/pgsql
> 7.1beta4. The system crash occured during the nightly update of the
> databases (from another, internal, non-pgsql, database system). Is there
> anyway to recover the database, or do I need to do a 'rm -rf

Please try to restart with option wal_debug = 1 so postmaster log
will be more informative and send this log me.

> data; initdb'? A quick response would be greatly appreciated. Thanks.

Please archieve PG' data dir - it probably will be useful to find bug.

Vadim




Re: Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
On Sun, 11 Feb 2001, Vadim Mikheev wrote:

> Please try to restart with option wal_debug = 1 so postmaster log
> will be more informative and send this log me.
I enabled 'wal_debug=1' via both the -c command line option and
(seperately) via ./data/postgresql.conf, as well as setting wal_debug=16
in ./data/postgresql.conf and I got no addition postmaster log information
than in my last email. :(Also set my coredump limit to unlimited (ulimit -c unlimited) and
started postmaster up. I got a core file, and here is what gdb has to say
about it:

GNU gdb 19990928
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
(no debugging symbols found)...
Core was generated by `postmaster -d 5 -D /usr/local/pgsql/data/'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/libreadline.so.4...(no debugging symbols found)...done.
Reading symbols from /lib/libncurses.so.5...(no debugging symbols found)...done.
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols
found)...done.
#0  0x20c931 in kill () from /lib/libc.so.6
(gdb) bt 
#0  0x20c931 in kill () from /lib/libc.so.6
#1  0x20c618 in raise () from /lib/libc.so.6
#2  0x20dc71 in abort () from /lib/libc.so.6
#3  0x8080495 in XLogFileOpen ()
#4  0x8080b52 in ReadRecord ()
#5  0x8081f66 in StartupXLOG ()
#6  0x80853ea in BootstrapMain ()
#7  0x80ee1e7 in SSDataBase ()
#8  0x80ec766 in PostmasterMain ()
#9  0x80cd194 in main ()
#10 0x206a42 in __libc_start_main () from /lib/libc.so.6

Also, since it appears it died in XLogFileOpen(), here is what the
directory structure looks like for xlog related files:

drwx--S---    5 postgres postgres     4096 Feb 12 20:51 data
drwx--S---    2 postgres postgres     4096 Feb 11 04:12 data/pg_xlog
-rw-------    1 postgres postgres 16777216 Feb 11 04:12 data/pg_xlog/0000000000000030

The file listed in data/pg_xlog is the only file in this directory. Does
not look like a lot of help to me, but here it is also. One other wrench to thrown into the works... The kernel on
this
machine is 2.2.18 with the patches listed at www.linuxraid.org applied. I
have a feeling that the linux-security patches mentioned on that page may
be giving pgsql heartburn on recovery. I am going to recompile the kernel
w/o them enabled and see if anything different results, and will post my
results.

> > data; initdb'? A quick response would be greatly appreciated. Thanks.
> 
> Please archieve PG' data dir - it probably will be useful to find bug.
Archived. It is a bit over 11MB, and I can put it on my web server
if some one wants to look at it (10 minute download with a 192kbit or
faster link). Though I would like to limit its distribution as it does
have relatively sensitive company data buried in it (custom lists and the
like).Though there is nothing I need to retrieve from it... This
database is from the web site that is updated every night from the
internal databases. For the time being I have fallen back to 7.0.3 for
production use.Thank you for all of your help. TTYL.

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------



Re: Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
On Mon, 12 Feb 2001, Ryan Kirkpatrick wrote:

>     One other wrench to thrown into the works... The kernel on this
> machine is 2.2.18 with the patches listed at www.linuxraid.org applied. I
> have a feeling that the linux-security patches mentioned on that page may
> be giving pgsql heartburn on recovery. I am going to recompile the kernel
> w/o them enabled and see if anything different results, and will post my
> results.
Did as above, disabling all security options in the kernel,
recompiling, and rebooting. Postgres behaves exactly the same as
before. :(

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------



Re: Recovery of PGSQL after system crash failing!!!

From
Tom Lane
Date:
Ryan Kirkpatrick <pgsql@rkirkpat.net> writes:
> #2  0x20dc71 in abort () from /lib/libc.so.6
> #3  0x8080495 in XLogFileOpen ()

Hm.  Evidently it's failing to open the xlog file, but the code is set
up in such a way that it dies before telling you why :-(  Take a look
at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code
to tell you the path and errno it's failing on before it abort()s.
        regards, tom lane


Re: Recovery of PGSQL after system crash failing!!!

From
"Vadim Mikheev"
Date:
> > #2  0x20dc71 in abort () from /lib/libc.so.6
> > #3  0x8080495 in XLogFileOpen ()
> 
> Hm.  Evidently it's failing to open the xlog file, but the code is set
> up in such a way that it dies before telling you why :-(  Take a look
> at XLogFileOpen in src/backend/access/transam/xlog.c and tweak the code
> to tell you the path and errno it's failing on before it abort()s.

I don't remember why there is abort() in XLogFileOpen just before
appropriate elog(STOP) there - I'll remove it in few moments, - but
it's already obvious why open failed: there is no file with checkpoint
record pointed by pg_control - data/pg_xlog/000000000000002F.
So, the question is who removed this file - PG or Linux?
Ryan, do you have postmaster' log before crash (where MoveOfflineLogs
reports WAL files to be deleted) and/or some logs from Linux' startup?
And meanwhile I'll take a look arround MoveOfflineLogs...

Vadim




Re: Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
On Tue, 13 Feb 2001, Vadim Mikheev wrote:

> I don't remember why there is abort() in XLogFileOpen just before
> appropriate elog(STOP) there - I'll remove it in few moments, - but
> it's already obvious why open failed: there is no file with checkpoint
> record pointed by pg_control - data/pg_xlog/000000000000002F.
> So, the question is who removed this file - PG or Linux?
When the system crashed, it was updating the database rather
heavily (i.e. drop everything, reload from external source). Therefore
there was a lot of activity going on to be logged. Still haven't
determined what caused the system to crash, the error message from the
kernel was along the lines 'can not handle kernel paging request'. Of
course, the machine crashed again ~12 hours later w/o any kernel error
messages. :( There may be a hardware problem with the machine that is
causing these problems....

> Ryan, do you have postmaster' log before crash (where MoveOfflineLogs
> reports WAL files to be deleted) and/or some logs from Linux' startup?
Sorry, I don't have the log file (got overwritten during reboot).
As for logs from Linux startup, I have them, but there is nothing of any
interest to postgres in them... The fsck on the disks after the system
came back up was clean, and there are no files in lost+found for the
partition the database is on.

> And meanwhile I'll take a look arround MoveOfflineLogs...
Good hunting... :)

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------



RE: Recovery of PGSQL after system crash failing!!!

From
"Mikheev, Vadim"
Date:
> Might it be, that pg_control is older than it should be ?
> I mean, that the write to pg_control did not make it to disk,
> but the checkpoint already completed (removed the logs) ?

Well, WAL does *pg_fsync()* of pg_control before removing old
logs, so it's only possible if Ryan run PG with -F (fsync = off).
Ryan?

Vadim


RE: Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
On Tue, 13 Feb 2001, Mikheev, Vadim wrote:

> > Might it be, that pg_control is older than it should be ?
> > I mean, that the write to pg_control did not make it to disk,
> > but the checkpoint already completed (removed the logs) ?
> 
> Well, WAL does *pg_fsync()* of pg_control before removing old
> logs, so it's only possible if Ryan run PG with -F (fsync = off).
> Ryan?
Guilty as charged I am afraind... :( Here, I though with WAL and
all (bad pun :), I would not need fsync anymore and decided to be
reckless. Guess I ought to reconsider that decision.... Though wasn't WAL
supposed to remove the need for fsync, or was it just to improve recovery
ablity? Anyway, if that is root of the problem, very bad timing on a
system crash, then I will consider this problem solved. Thanks for
everyone's help.

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------



Re: Recovery of PGSQL after system crash failing!!!

From
Thomas Lockhart
Date:
>         Guilty as charged I am afraind... :( Here, I though with WAL and
> all (bad pun :), I would not need fsync anymore and decided to be
> reckless. Guess I ought to reconsider that decision.... Though wasn't WAL
> supposed to remove the need for fsync, or was it just to improve recovery
> ablity?

It removes the need to disable fsync to get best performance! The
converse is not true; it does not eliminate the need to fsync to help
guard data integrity, and the WAL file management may be a bit less
robust than that for other tables. I can see how this might have been
omitted from much of the discussion, so it is important that we remind
ourselves about this. Thanks for the reminder :/

Since there is a fundamental recovery problem if the WAL file
disappears, then perhaps we should have a workaround which can ignore
the requirement for that file on startup? Or maybe we do already?
Vadim??

Also, could the "-F" option be disabled now that WAL is enabled? Or is
there still some reason to encourage/allow folks to use it?
                       - Thomas


Re: Re: Recovery of PGSQL after system crash failing!!!

From
Bruce Momjian
Date:
> Since there is a fundamental recovery problem if the WAL file
> disappears, then perhaps we should have a workaround which can ignore
> the requirement for that file on startup? Or maybe we do already?
> Vadim??
> 
> Also, could the "-F" option be disabled now that WAL is enabled? Or is
> there still some reason to encourage/allow folks to use it?

The system still fsyncs, so -F is still useful, I think.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Recovery of PGSQL after system crash failing!!!

From
Tom Lane
Date:
Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> Also, could the "-F" option be disabled now that WAL is enabled? Or is
> there still some reason to encourage/allow folks to use it?

I was the one who put it back in after Vadim turned it off ;-) ... and
I'll object to any attempt to remove the option.

I think that there's no longer any good reason for people to consider -F
in production use.  On the other hand, for development or debugging work
where you don't really *care* about powerfail survivability, I see no
reason to incur extra wear on your disk drives by forcing fsyncs.  My
drives only have so many seeks left in 'em, and I'd rather see those
seeks expended on writing source-code files than on fsyncs of test
databases.
        regards, tom lane


Re: Re: Recovery of PGSQL after system crash failing!!!

From
Peter Eisentraut
Date:
Tom Lane writes:

> Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> > Also, could the "-F" option be disabled now that WAL is enabled? Or is
> > there still some reason to encourage/allow folks to use it?
>
> I was the one who put it back in after Vadim turned it off ;-) ... and
> I'll object to any attempt to remove the option.

The description should be updated though:
http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL

I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync
performance with fsync turned on" and extrapolated "Imagine what it can do
if I turn off fsync anyway."

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: Re: Recovery of PGSQL after system crash failing!!!

From
Bruce Momjian
Date:
> Tom Lane writes:
> 
> > Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> > > Also, could the "-F" option be disabled now that WAL is enabled? Or is
> > > there still some reason to encourage/allow folks to use it?
> >
> > I was the one who put it back in after Vadim turned it off ;-) ... and
> > I'll object to any attempt to remove the option.
> 
> The description should be updated though:
> http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL
> 
> I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync
> performance with fsync turned on" and extrapolated "Imagine what it can do
> if I turn off fsync anyway."

That is a very subtle point, and one I can imagine many people
incorrectly assuming.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: Recovery of PGSQL after system crash failing!!!

From
Ryan Kirkpatrick
Date:
On Wed, 14 Feb 2001, Peter Eisentraut wrote:

> Tom Lane writes:
> 
> > Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> > > Also, could the "-F" option be disabled now that WAL is enabled? Or is
> > > there still some reason to encourage/allow folks to use it?
> >
> > I was the one who put it back in after Vadim turned it off ;-) ... and
> > I'll object to any attempt to remove the option.
> 
> The description should be updated though:
> http://www.postgresql.org/devel-corner/docs/postgres/runtime-config.htm#RUNTIME-CONFIG-GENERAL
> 
> I guess a lot of people have heard the rumour "PG 7.1 offers no-fsync
> performance with fsync turned on" and extrapolated "Imagine what it can do
> if I turn off fsync anyway."
That is exactly what I did... Figured that will WAL removing the
need for fsync, it wasn't needed and could be disabled for a nice
perfomance increase. Now I am quite a bit wiser, and will be leaving
fsyncing enabled on all 7.1 production servers. :) Thank you for bring that subtle point out and yes, the documention
could do with a bit of help on this point. TTYL.

---------------------------------------------------------------------------
|   "For to me to live is Christ, and to die is gain."                    |
|                                            --- Philippians 1:21 (KJV)   |
---------------------------------------------------------------------------
|   Ryan Kirkpatrick  |  Boulder, Colorado  |  http://www.rkirkpat.net/   |
---------------------------------------------------------------------------