Home > mailing lists

Re: Core dump - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Core dump
Date	October 12, 2000 16:11:08
Msg-id	27214.971381455@sss.pgh.pa.us Whole thread Raw
In response to	Core dump (Dan Moschuk <dan@freebsd.org>)
Responses	Re: Core dump
List	pgsql-hackers

Tree view

Dan Moschuk <dan@freebsd.org> writes:
> Sparc solaris 2.7 with postgres 7.0.2
> It seems to be reproducable, the server crashes on us at a rate of about
> every few hours.

That's a very bizarre backtrace. Why the multiple levels of recursive
entry to the quickdie() signal handler? I wonder if you aren't looking
at some kind of Solaris bug --- perhaps it's not able to cope with a
signal handler turning around and issuing new kernel calls.

The core file you are looking at is probably *not* from the original
failure, whatever that is. The sequence is probably

1. Some backend crashes for unknown reason, dumping core.

2. Postmaster observes messy death of a child, decides that mass suicide followed by restart is called for.
Postmastersends SIGUSR1 to all remaining backends to make them commit hara-kiri.

3. One or more other backends crash trying to obey postmaster's command. The corefile left for you to examine comes
fromwhichever crashed last.

So there are at least two problems here, but we only have evidence of
the second one.

Since the problem is fairly reproducible, I'd suggest you temporarily
dike out the elog(NOTICE) call in quickdie() (in
src/backend/tcop/postgres.c), which will probably allow the backends
to honor SIGUSR1 without dumping core. Then you have a shot at seeing
the core from the original failure.

Assuming that this works (ie, you find a core that's not got anything
to do with quickdie()), I'd suggest an inquiry to Sun about whether
their signal handler logic hasn't got a problem with write() issued
from inside a signal handler. Meanwhile let us know what the new
backtrace shows.
regards, tom lane

pgsql-hackers by date:

From: Stephan Szabo
Date: 12 October 2000, 15:46:31
Subject: Re: possible constraint bug?

From: Joseph Shraibman
Date: 12 October 2000, 16:17:32
Subject: Re: [INTERFACES] JDBC Large ResultSet problem + BadTimeStamp Patch

Re: Core dump - Mailing list pgsql-hackers

Previous

Next