Re: [ADMIN] Streaming Replication Server Crash - Mailing list pgsql-general

From Craig Ringer
Subject Re: [ADMIN] Streaming Replication Server Crash
Date
Msg-id 50862525.5060904@ringerc.id.au
Whole thread Raw
In response to Re: [ADMIN] Streaming Replication Server Crash  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [ADMIN] Streaming Replication Server Crash  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On 10/22/2012 08:52 PM, Tom Lane wrote:
> Craig Ringer <ringerc@ringerc.id.au> writes:
>> On 10/19/2012 04:40 PM, raghu ram wrote:
>>> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG:  server process
>>> (PID 15565) was terminated by signal 10
>
>> That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.
>
>> Was the server intentionally sent SIGUSR1 by an admin? Do you know what
>> triggered the signal?
>
> SIGUSR1 is used for all sorts of internal cross-process signaling
> purposes.  There's no need to hypothesize any external force sending
> it; if somebody had broken a PG process's signal handling setup for
> SIGUSR1, a crash of this sort could be expected in short order.
>
> But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
> AFAIK, that signal number is not at all compatible across different
> flavors of Unix.  (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

A bit of searching suggests that on Solaris/SunOS, signal 10 is SIGBUS:

http://www.s-gms.ms.edus.si/cgi-bin/man-cgi?signal+3HEAD
http://docs.oracle.com/cd/E23824_01/html/821-1464/signal-3head.html

... which tends to suggest an entirely different interpretation than
"someone broke a signal hander":

https://blogs.oracle.com/peteh/entry/sigbus_versus_sigsegv_according_to

such as:

- Bad mmap()ed read
- alignment error
- hardware fault

so it's not immensely different to a segfault in that it can be caused
by errors in hardware, OS, or applications.

Raghu, did PostgreSQL dump a core file? If it didn't, you might want to
enable core dumps in future. If it did dump a core, attaching a debugger
to the core file might tell you where it crashed, possibly offering some
more information to diagnose the issue. I'm not familiar enough with
Solaris to offer detailed advice on that, especially as you haven't
mentioned your Solaris version, how you installed Pg, etc. This may be
of some use:


http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris

--
Craig Ringer


pgsql-general by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Plug-pull testing worked, diskchecker.pl failed
Next
From: Tom Lane
Date:
Subject: Re: [ADMIN] Streaming Replication Server Crash