Thread: Tips for investigating DBMS failure

Tips for investigating DBMS failure

From
Karl DeBisschop
Date:
We had a production DBMS die in a failrly dramitica way this morning. I
suspect the cause may have been interactions between lock and pg_dump,
but I'm a long way off from being able to say that with any certainty
at all.  In the end, I moved the data directroy off to the side and
restored all of if the data into a newly initiallized data directory.

The databases are backed up fairly frequently, but there is likely to
be some data since the previous backup that I'd like to recover. I
have tried starting a postmaster on a different port to get the data,
but I just can't seem to make any headway. Postmater runs, and I can
get copious debugging text, but psql always complains that the backend
failed unexpectedly whenever I try to make any sort of connection.

Can anyone give me general tips on how I might recover this data?
(PostgreSQL 7.0-2, redhat 6.2smp on a dual processor netfinity).

If it's of any use, I have attached the debug output - I don't see
where the problem is occurring, but I don't pretend to fully understand
the output, other than the exit is status 25. Someone with just a little
bit of insight would probably do alot better with this than I.

>>>>> debug
postmaster: ServerLoop:         handling reading 4
postmaster: ServerLoop:         handling reading 4
postmaster: ServerLoop:         handling writing 4
postmaster: BackendStartup: environ dump:
-----------------------------------------
        USERNAME=
        HISTSIZE=1000
        HISTFILESIZE=1000
        SSH_TTY=/dev/pts/2
        MAIL=/var/spool/mail/kdebisschop
        TERM=xterm
        HOSTTYPE=i386
        PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:
        INPUTRC=/etc/inputrc
        SHELL=/bin/bash
        PS1=[\u@\h \W]\$
        USER=postgres
        OSTYPE=Linux
        SHLVL=1
        _=/usr/bin/sudo
        SUDO_COMMAND=/usr/bin/postmaster -d 99 -D /disk/1/pgsql/data1 -p 5433
        POSTPORT=5433
        POSTID=2147483644
        PGDATA=/disk/1/pgsql/data1
        IPC_KEY=5433100
        PG_USER=postgres
-----------------------------------------

postmaster: BackendStartup: pid 4532 user postgres db template1 socket 4
postmaster child[4532]: starting with (/usr/bin/postgres -d99 -v131072 -p template1 )
001016.12:36:23.302  [4532] read_pg_options: verbose=2,query=2,syslog=0
FindExec: found "/usr/bin/postgres" using argv[0]
001016.12:36:23.303  [4532] started: host=localhost user=postgres database=template1
001016.12:36:23.303  [4532] InitPostgres
001016.12:36:23.333  [4532] reset_client_encoding()..
001016.12:36:23.333  [4532] reset_client_encoding() done.
001016.12:36:23.333  [4532] StartTransactionCommand
001016.12:36:23.333  [4532] query: select getdatabaseencoding()
001016.12:36:23.337  [4532] parser outputs:

{ QUERY :command 1  :utility <> :resultRelation 0 :into <> :isPortal false :isBinary false :isTemp false :unionall
false:distinctClause <> :sortClause <> :rtable <> :targetlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 19
:restypmod-1 :resname getdatabaseencoding :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { EXPR
:typeOid19  :opType func :oper { FUNC :funcid 1039 :functype 19 :funcisindex false :funcsize 0  :func_fcache @ 0x0
:func_tlist({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 19 :restypmod -1 :resname \<noname> :reskey 0 :reskeyop 0
:ressortgroupref0 :resjunk false } :expr { VAR :varno -1 :varattno 1 :vartype 19 :vartypmod -1  :varlevelsup 0
:varnoold-1 :varoattno 1}}) :func_planlist <>} :args <>}}) :qual <> :groupClause <> :havingQual <> :hasAggs false
:hasSubLinksfalse :unionClause <> :intersectClause <> :limitOffset <> :limitCount <> :rowMark <>} 

001016.12:36:23.337  [4532] after rewriting:
{ QUERY
   :command 1
   :utility <>
   :resultRelation 0
   :into <>
   :isPortal false
   :isBinary false
   :isTemp false
   :unionall false
   :distinctClause <>
   :sortClause <>
   :rtable <>
   :targetlist (

       { TARGETENTRY
      :resdom
         { RESDOM
         :resno 1
         :restype 19
         :restypmod -1
         :resname getdatabaseencoding
         :reskey 0
         :reskeyop 0
         :ressortgroupref 0
         :resjunk false
         }

      :expr
         { EXPR
         :typeOid 19
         :opType func
         :oper
            { FUNC
            :funcid 1039
            :functype 19
            :funcisindex false
            :funcsize 0
            :func_fcache @ 0x0
            :func_tlist (
               { TARGETENTRY
               :resdom
                  { RESDOM
                  :resno 1
                  :restype 19
                  :restypmod -1
                  :resname \<noname>
                  :reskey 0
                  :reskeyop 0
                  :ressortgroupref 0
                  :resjunk false
                  }

               :expr
                  { VAR
                  :varno -1
                  :varattno 1
                  :vartype 19
                  :vartypmod -1
                  :varlevelsup 0
                  :varnoold -1
                  :varoattno 1
                  }
               }
            )

            :func_planlist <>
            }

         :args <>
         }
      }
   )

   :qual <>
   :groupClause <>
   :havingQual <>
   :hasAggs false
   :hasSubLinks false
   :unionClause <>
   :intersectClause <>
   :limitOffset <>
   :limitCount <>
   :rowMark <>
   }


001016.12:36:23.339  [4532] plan:

{ RESULT :startup_cost 0.00 :total_cost 0.00 :rows 0 :width 0 :state <> :qptargetlist ({ TARGETENTRY :resdom { RESDOM
:resno1 :restype 19 :restypmod -1 :resname getdatabaseencoding :reskey 0 :reskeyop 0 :ressortgroupref 0 :resjunk false
}:expr { EXPR :typeOid 19  :opType func :oper { FUNC :funcid 1039 :functype 19 :funcisindex false :funcsize 0
:func_fcache@ 0x0 :func_tlist ({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 19 :restypmod -1 :resname \<noname>
:reskey0 :reskeyop 0 :ressortgroupref 0 :resjunk false } :expr { VAR :varno -1 :varattno 1 :vartype 19 :vartypmod -1
:varlevelsup0 :varnoold -1 :varoattno 1}}) :func_planlist <>} :args <>}}) :qpqual <> :lefttree <> :righttree <> :extprm
():locprm () :initplan <> :nprm 0  :resconstantqual <>} 

001016.12:36:23.339  [4532] ProcessQuery
001016.12:36:23.339  [4532] CommitTransactionCommand
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 4532 exited with status 25
Server process (pid 4532) exited with status 25 at Mon Oct 16 12:36:23 2000
Terminating any active server processes...
Server processes were terminated at Mon Oct 16 12:36:23 2000
Reinitializing shared memory and semaphores
001016.12:36:23.340  [4508] shmem_exit(0)
binding ShmemCreate(key=52e771, size=1104896)
001016.12:36:23.343  [4533] DEBUG:  Data Base System is starting up at Mon Oct 16 12:36:23 2000
001016.12:36:23.343  [4533] DEBUG:  Data Base System was interrupted being in production at Mon Oct 16 12:35:44 2000
001016.12:36:23.365  [4533] DEBUG:  Data Base System is in production state at Mon Oct 16 12:36:23 2000
001016.12:36:23.365  [4533] proc_exit(0)
001016.12:36:23.365  [4533] shmem_exit(0)
001016.12:36:23.365  [4533] exit(0)
postmaster: reaping dead processes...
<<<<<<<< debug

Many thanks for any insight.

--
Karl DeBisschop                        kdebisschop@alert.infoplease.com
Learning Network Reference             http://www.infoplease.com
Netsaint Plugin Developer              kdebisschop@users.sourceforge.net

Re: Tips for investigating DBMS failure

From
Tom Lane
Date:
Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> If it's of any use, I have attached the debug output - I don't see
> where the problem is occurring, but I don't pretend to fully understand
> the output, other than the exit is status 25.

Signal 25??  What is that on your machine?  (see /usr/include/signal.h)
On mine it's SIGTSTP, which doesn't make any sense --- for one thing,
that doesn't cause a process exit, only suspension ...

            regards, tom lane

Re: Tips for investigating DBMS failure

From
Vince Vielhaber
Date:
On Mon, 16 Oct 2000, Tom Lane wrote:

> Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> > If it's of any use, I have attached the debug output - I don't see
> > where the problem is occurring, but I don't pretend to fully understand
> > the output, other than the exit is status 25.
>
> Signal 25??  What is that on your machine?  (see /usr/include/signal.h)
> On mine it's SIGTSTP, which doesn't make any sense --- for one thing,
> that doesn't cause a process exit, only suspension ...

That's odd.  Here SIGTSTP is 18.  25 is SIGXFSZ (exceeded file size
limit).

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net
 128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
        Online Campground Directory    http://www.camping-usa.com
       Online Giftshop Superstore    http://www.cloudninegifts.com
==========================================================================