Re: Core dump - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Core dump
Date
Msg-id 27214.971381455@sss.pgh.pa.us
Whole thread Raw
In response to Core dump  (Dan Moschuk <dan@freebsd.org>)
Responses Re: Core dump  (Dan Moschuk <dan@freebsd.org>)
List pgsql-hackers
Dan Moschuk <dan@freebsd.org> writes:
> Sparc solaris 2.7 with postgres 7.0.2
> It seems to be reproducable, the server crashes on us at a rate of about
> every few hours.

That's a very bizarre backtrace.  Why the multiple levels of recursive
entry to the quickdie() signal handler?  I wonder if you aren't looking
at some kind of Solaris bug --- perhaps it's not able to cope with a
signal handler turning around and issuing new kernel calls.

The core file you are looking at is probably *not* from the original
failure, whatever that is.  The sequence is probably

1. Some backend crashes for unknown reason, dumping core.

2. Postmaster observes messy death of a child, decides that mass suicide  followed by restart is called for.
Postmastersends SIGUSR1 to all  remaining backends to make them commit hara-kiri.
 

3. One or more other backends crash trying to obey postmaster's command.  The corefile left for you to examine comes
fromwhichever crashed  last.
 

So there are at least two problems here, but we only have evidence of
the second one.

Since the problem is fairly reproducible, I'd suggest you temporarily
dike out the elog(NOTICE) call in quickdie() (in
src/backend/tcop/postgres.c), which will probably allow the backends
to honor SIGUSR1 without dumping core.  Then you have a shot at seeing
the core from the original failure.

Assuming that this works (ie, you find a core that's not got anything
to do with quickdie()), I'd suggest an inquiry to Sun about whether
their signal handler logic hasn't got a problem with write() issued
from inside a signal handler.  Meanwhile let us know what the new
backtrace shows.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: possible constraint bug?
Next
From: Joseph Shraibman
Date:
Subject: Re: [INTERFACES] JDBC Large ResultSet problem + BadTimeStamp Patch