Thread: random backend crashes - how to debug ( Is crash dump handler released ? )

random backend crashes - how to debug ( Is crash dump handler released ? )

From
BangarRaju Vadapalli
Date:

Hi Everybody,

 

      We are using PostGRE 8.4 version and experiencing random backend crashes. We have enabled logging and are able to see some logging happening in pg_log directory but not of much use. Here are the logs.

 

   2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of another server process

2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect to the database and repeat your command.

2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of another server process

2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect to the database and repeat your command.

2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of another server process

2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect to the database and repeat your command.

2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of another server process

2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

 

  I searched online and found crash dump handler idea has been proposed and patch for that has already been released if I am not wrong. Could anyone please detail the steps to install crash dump handler in windows? Also could you please help me with the ways to debug the crashes happening as shown above.

 

Thanks,

Bangar Raju

 

 

Re: random backend crashes - how to debug ( Is crash dump handler released ? )

From
Merlin Moncure
Date:
On Tue, Jun 14, 2011 at 9:26 AM, BangarRaju Vadapalli
<BangarRaju.Vadapalli@infor.com> wrote:
> Hi Everybody,
>
>
>
>       We are using PostGRE 8.4 version and experiencing random backend
> crashes. We have enabled logging and are able to see some logging happening
> in pg_log directory but not of much use. Here are the logs.
>
>
>
>    2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash
> of another server process
>
> 2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
> 2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect
> to the database and repeat your command.
>
> 2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of
> another server process
>
> 2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
> 2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect
> to the database and repeat your command.
>
> 2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of
> another server process
>
> 2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
> 2011-06-14 18:06:04 IST HINT:  In a moment you should be able to reconnect
> to the database and repeat your command.
>
> 2011-06-14 18:06:04 IST WARNING:  terminating connection because of crash of
> another server process
>
> 2011-06-14 18:06:04 IST DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
>
>
>   I searched online and found crash dump handler idea has been proposed and
> patch for that has already been released if I am not wrong. Could anyone
> please detail the steps to install crash dump handler in windows? Also could
> you please help me with the ways to debug the crashes happening as shown
> above.

right.  well, are you running any third party code?  C functions?
external modules?  Is it practical to log queries from the client? On
the server?

Which exact version of postgres 8.4 are you running?  How often do you
see the crashes?

merlin

On 06/14/2011 10:26 PM, BangarRaju Vadapalli wrote:
> Hi Everybody,
>
> We are using PostGRE 8.4 version and experiencing random backend
> crashes. We have enabled logging and are able to see some logging
> happening in pg_log directory but not of much use. Here are the logs.

Thankyou for collecting the logs and including your version. A little
more information would be helpful, like the exact version, your OS and
architecture, etc. See this link for a list of suggested information:

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

> I searched online and found crash dump handler idea has been proposed
> and patch for that has already been released if I am not wrong.

It is integrated into PostgreSQL 9.0 as a core part of the server.

There's no reason it can't be compiled for PostgreSQL 8.4, though I
never tested that. It shouldn't take long so I'll give it a go and get
back to you.

> Could
> anyone please detail the steps to install crash dump handler in windows?
> Also could you please help me with the ways to debug the crashes
> happening as shown above.

The first thing to do is to try to figure out if they're really random,
or if they're related to a particular query or event. Enable more
detailed logging in PostgreSQL - at least query logging, possibly also
additional debug levels - and examine the logs to see if you can find a
pattern.

--
Craig Ringer

On 15/06/2011 7:50 AM, Craig Ringer wrote:
> There's no reason it can't be compiled for PostgreSQL 8.4, though I
> never tested that. It shouldn't take long so I'll give it a go and get
> back to you.

Okies. I've built a version for 8.4.

You can download it (32-bit only) from:

http://www.postnewspapers.com.au/~craig/webfiles/crashdump_pg_84_32bit/crashdump.dll

If you're on Windows XP or Vista you will also need:

http://www.postnewspapers.com.au/~craig/webfiles/crashdump_pg_84_32bit/dbghelp.dll

... and since this DLL was compiled with VC++ 2008, you'll need the
VC++2008 redist installed if you don't already have it:

http://www.microsoft.com/downloads/en/details.aspx?familyid=9b2da534-3e03-4391-8a4d-074b9f2bc1bf&displaylang=en

Put crashdump.dll and dbghelp.dll into
   C:\Program Files\PostgreSQL\8.4\lib

Edit postgresql.conf, uncomment shared_preload_libraries if it's
commented out and add 'crashdump' to it, eg:

   shared_preload_libraries = 'crashdump'

Create a folder called "crashdumps" inside the data directory, at the
same level as the "pg_log", "pg_xlog", "base" etc directories. Get
properties on the new "crashdumps" directory and in the security tab add
"Full Control" to the "postgres" user. Save your changes.

Stop and start the postgresql-8.4 service from Start->Run->services.msc.



You should now have a working crash dump handler. To test it, run:

CREATE FUNCTION crashdump_crashme() RETURNS void AS 'crashdump.dll'
LANGUAGE 'C';

then invoke it to crash your database system:

SELECT crashdump_crashme();

If all goes well (heh) you'll lose your connection and the server will
crash and - hopefully - restart. If it doesn't restart, relaunch it
manually using services.msc.

You should now see a file in the "crashdumps" folder.You can email it to
me directly and I'll extract a backtrace.

Alternately, if you want to get the backtrace yourself you will need to
set up your NT_SYMBOL_PATH environment variable to match your install as
per the instructions here:


http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows#Configuring_the_symbol_path

You will then be able to open it for debugging using Microsoft Visual
Studio 2008 Express Edition (or any paid Visual Studio edition). Once
open, right click on the dump file in the left bar and choose "debug new
instance". Alternately you can use windbg.exe from Debugging Tools for
Windows to analyse the dump as per the instructions here:

http://archives.postgresql.org/message-id/4CAB4294.2070104@postnewspapers.com.au


You'll probably want to

   DROP FUNCTION crashdump_crashme();

after running your crashme test, though it's harmless if left in place
so long as it's not invoked. Dropping the crashme function doesn't
affect the crashdump handler; it's loaded by shared_preload_libraries
and will remain in place.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

On 15/06/2011 7:50 AM, Craig Ringer wrote:

>> I searched online and found crash dump handler idea has been proposed
>> and patch for that has already been released if I am not wrong.
>
> It is integrated into PostgreSQL 9.0 as a core part of the server.

Correction - it's in 9.1 not 9.0 . Whoops, I should know that!

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: random backend crashes - Needed Information included

From
Craig Ringer
Date:
On 06/15/2011 05:52 PM, BangarRaju Vadapalli wrote:

> 7. Attached the database side and application side logs.

Yeah, there's definitely a crash, you just chopped it off in the
abbreviated logs you sent earlier. With this kind of log volume that's
easy enough to do.

Anyway, the crash is:

2011-06-15 13:55:59 IST postgres epimart PANIC:  XX000: cannot abort
transaction 19773146, it was already committed

from:

2011-06-15 13:55:59 IST postgres epimart ERROR:  XX000: could not open
relation base/2850136/3344343_vm: A blocking operation was interrupted
by a call to WSACancelBlockingCall.^M
2011-06-15 13:55:59 IST postgres epimart LOCATION:  mdopen,
.\src\backend\storage\smgr\md.c:526
2011-06-15 13:55:59 IST postgres epimart STATEMENT:  COMMIT
2011-06-15 13:55:59 IST postgres epimeta DEBUG:  00000: CommitTransaction
2011-06-15 13:55:59 IST postgres epimeta LOCATION:
ShowTransactionState, .\src\backend\access\transam\xact.c:4074
2011-06-15 13:55:59 IST postgres epimart WARNING:  01000:
AbortTransaction while in COMMIT state
2011-06-15 13:55:59 IST postgres epimart LOCATION:  AbortTransaction,
.\src\backend\access\transam\xact.c:2011
2011-06-15 13:55:59 IST postgres epimart DEBUG:  00000: StartTransaction
2011-06-15 13:55:59 IST postgres epimart LOCATION:
ShowTransactionState, .\src\backend\access\transam\xact.c:4074
2011-06-15 13:55:59 IST postgres epimart PANIC:  XX000: cannot abort
transaction 19773146, it was already committed
2011-06-15 13:55:59 IST postgres epimart LOCATION:
RecordTransactionAbort, .\src\backend\access\transam\xact.c:1200


While I recall some known issues that resulted in panics because of
"cannot abort transaction" errors, I seem to recall that they were
related to autovacuum, which this doesn't particularly seem to be.

I'd still recommend updating to 8.4.8, as 8.4.2 was patched six more
times for good reasons.

--
Craig Ringer

On 06/14/2011 10:26 PM, BangarRaju Vadapalli wrote:
> Hi Everybody,
>
> We are using PostGRE 8.4 version and experiencing random backend
> crashes. We have enabled logging and are able to see some logging
> happening in pg_log directory but not of much use. Here are the logs.

- Examination of the full length logs sent off-list shows these lines
leading up to the crash:

(crash1): 2011-06-15 13:55:59 IST postgres epimart ERROR:  XX000: could
not open relation base/2850136/3344343_vm: A blocking operation was
interrupted by a call to WSACancelBlockingCall.^M

(crash2) 2011-06-15 14:22:40 IST postgres epimart ERROR:  XX000: could
not open relation base/2850136/3352537_fsm: A blocking operation was
interrupted by a call to WSACancelBlockingCall.^M

... in both cases followed by:

XX000: cannot abort transaction 19859931, it was already committed

then a Windows runtime message reporting that the backend crashed.

Ideas?



After some off-list conversation, we've established that:

- The crash is part of a batch process where the OP loads data from
external sources. It's the 11th stage of a 12 stage process, and cannot
be easily separated into a small self-contained test case. If the OP
runs just the 11th stage standalone, without having just run the prior
stages, the crash does not occur. It only crashes if the whole process
is run in one go.

(OP: please confirm that my summary is accurate, as it's condensed from
several emails).

- The crash is reproducible on 9.0 . It hasn't yet been reproduced on
9.1 because the OP is having some problems with views on 9.1 that he'll
be posting about separately.

- We can't seem to get a crash dump or attach a debugger to get a
backtrace. I built a copy of the early version  of the crash dump
handler before it was integrated into 9.1 so he could load it as a DLL
into 8.4, and it works when the backend is intentionally crashed but
doesn't capture the crash that's causing the problem.

- I still haven't been able to confirm how the batch process in question
works. Does it all run in a single connection with a single transaction?
Or is it a multi-connection affair with multiple scripts / programs
involved? If Bangar Raju could describe this part in more detail that
would be very helpful.

--
Craig Ringer

Attachment

Re: random backend crashes - how to debug ( Is crash dump handler released ? )

From
BangarRaju Vadapalli
Date:
Answers Below....

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Sunday, June 19, 2011 3:03 PM
To: BangarRaju Vadapalli
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] random backend crashes - how to debug ( Is crash dump handler released ? )

On 06/14/2011 10:26 PM, BangarRaju Vadapalli wrote:
> Hi Everybody,
>
> We are using PostGRE 8.4 version and experiencing random backend
> crashes. We have enabled logging and are able to see some logging
> happening in pg_log directory but not of much use. Here are the logs.

- Examination of the full length logs sent off-list shows these lines leading up to the crash:

(crash1): 2011-06-15 13:55:59 IST postgres epimart ERROR:  XX000: could not open relation base/2850136/3344343_vm: A
blockingoperation was interrupted by a call to WSACancelBlockingCall.^M 

(crash2) 2011-06-15 14:22:40 IST postgres epimart ERROR:  XX000: could not open relation base/2850136/3352537_fsm: A
blockingoperation was interrupted by a call to WSACancelBlockingCall.^M 

... in both cases followed by:

XX000: cannot abort transaction 19859931, it was already committed

then a Windows runtime message reporting that the backend crashed.

Ideas?



After some off-list conversation, we've established that:

- The crash is part of a batch process where the OP loads data from external sources. It's the 11th stage of a 12 stage
process,and cannot be easily separated into a small self-contained test case. If the OP runs just the 11th stage
standalone,without having just run the prior stages, the crash does not occur. It only crashes if the whole process is
runin one go. 

Bangar - correct...

(OP: please confirm that my summary is accurate, as it's condensed from several emails).

- The crash is reproducible on 9.0 . It hasn't yet been reproduced on
9.1 because the OP is having some problems with views on 9.1 that he'll be posting about separately.

Bangar - I have posted the question...

- We can't seem to get a crash dump or attach a debugger to get a backtrace. I built a copy of the early version  of
thecrash dump handler before it was integrated into 9.1 so he could load it as a DLL into 8.4, and it works when the
backendis intentionally crashed but doesn't capture the crash that's causing the problem. 

- I still haven't been able to confirm how the batch process in question works. Does it all run in a single connection
witha single transaction?  
Or is it a multi-connection affair with multiple scripts / programs involved? If Bangar Raju could describe this part
inmore detail that would be very helpful. 

Bangar - The process involves multiple connections and multiple transactions and our code has both java and native
layersusing postgre jdbc drivers and libpq libraries... 

--
Craig Ringer