Thread: Streaming Replication Server Crash
2012-10-19 12:26:43 IST [11934]: [1-1] user=postgres,db=hmis LOG: duration: 20345.702 ms execute <unnamed>: SELECT DISTINCT y.ais_insm_name,y.ais_insm_type,ha_dism_name,ha_hudm_name,x2.ais_equip_receipt_details_value_lov, x3.ais_equip_receipt_details_value,x4.ais_equip_receipt_details_value, mast_frm_freq, t_report_code, t_report_date, t_report_month, t_report_year,t_report_createdate,t_report_date2, t_report_month2, t_report_year2,mast_frm_validity_days FROM TNHSPTOOL_FORM_MASTER, TRAN_712_MASTER full join ais_institution_master as y on( t_instiution_code = y.ais_insm_code) full join ha_district_master on( t_district_code=ha_dism_code ) full join ha_hud_master on( t_hud_code=ha_hudm_code )full join tran_712_664 x2 on (x2.t_rep_code=t_report_code and x2.t_rep_rowno=3) full join tran_712_664 x3 on (x3.t_rep_code=t_report_code and x3.t_rep_rowno=5) full join tran_712_664 x4 on (x4.t_rep_code=t_report_code and x4.t_rep_rowno=36) WHERE mast_frm_code = T_REPORT_FRMID AND y.ais_insm_code ='00231' and T_REPORT_FRMID =712 and T_REPORT_OLDFLAG='O' ORDER BY T_REPORT_CODE DESC
2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process (PID 15565) was terminated by signal 10
2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG: terminating any other active server processes
2012-10-19 12:26:46 IST [18450]: [3-1] user=postgres,db=DBHMS WARNING: terminating connection because of crash of another server process
Standby Server configuration Details as follows:
max_wal_senders = 5
wal_sender_delay = 200ms
wal_keep_segments = 128
vacuum_defer_cleanup_age = 0
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1
Thanks & Regards,
Raghu Ram
On 10/19/2012 04:40 PM, raghu ram wrote: > Hi All, > > We have configured Streaming Replication b/w Primary and Standby server > and Pgpool-II load balancing module diverting > SELECT statements to Standby server. As per our observations, Standby > server crashed during peak hours on today and error message as follows > 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process > (PID 15565) was terminated by signal 10 > > 2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG: terminating any > other active server processes That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL. Was the server intentionally sent SIGUSR1 by an admin? Do you know what triggered the signal? Are you running any procedural languages other than PL/PgSQL, or any custom C extensions? Anything that might have unwittingly cleared the signal handler for SIGUSR1? -- Craig Ringer
Craig Ringer <ringerc@ringerc.id.au> writes: > On 10/19/2012 04:40 PM, raghu ram wrote: >> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process >> (PID 15565) was terminated by signal 10 > That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL. > Was the server intentionally sent SIGUSR1 by an admin? Do you know what > triggered the signal? SIGUSR1 is used for all sorts of internal cross-process signaling purposes. There's no need to hypothesize any external force sending it; if somebody had broken a PG process's signal handling setup for SIGUSR1, a crash of this sort could be expected in short order. But having said that, are we sure 10 is SIGUSR1 on the OP's platform? AFAIK, that signal number is not at all compatible across different flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.) > Are you running any procedural languages other than PL/PgSQL, or any > custom C extensions? Anything that might have unwittingly cleared the > signal handler for SIGUSR1? libperl has a bad habit of thinking it can mess with the process's signal setup ... regards, tom lane
On 10/22/2012 08:52 PM, Tom Lane wrote: > Craig Ringer <ringerc@ringerc.id.au> writes: >> On 10/19/2012 04:40 PM, raghu ram wrote: >>> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process >>> (PID 15565) was terminated by signal 10 > >> That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL. > >> Was the server intentionally sent SIGUSR1 by an admin? Do you know what >> triggered the signal? > > SIGUSR1 is used for all sorts of internal cross-process signaling > purposes. There's no need to hypothesize any external force sending > it; if somebody had broken a PG process's signal handling setup for > SIGUSR1, a crash of this sort could be expected in short order. > > But having said that, are we sure 10 is SIGUSR1 on the OP's platform? > AFAIK, that signal number is not at all compatible across different > flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.) Gah. I incorrectly though that POSIX specified signal *numbers*, not just names. That does not appear to actually be the case. Thanks. A bit of searching suggests that on Solaris/SunOS, signal 10 is SIGBUS: http://www.s-gms.ms.edus.si/cgi-bin/man-cgi?signal+3HEAD http://docs.oracle.com/cd/E23824_01/html/821-1464/signal-3head.html ... which tends to suggest an entirely different interpretation than "someone broke a signal hander": https://blogs.oracle.com/peteh/entry/sigbus_versus_sigsegv_according_to such as: - Bad mmap()ed read - alignment error - hardware fault so it's not immensely different to a segfault in that it can be caused by errors in hardware, OS, or applications. Raghu, did PostgreSQL dump a core file? If it didn't, you might want to enable core dumps in future. If it did dump a core, attaching a debugger to the core file might tell you where it crashed, possibly offering some more information to diagnose the issue. I'm not familiar enough with Solaris to offer detailed advice on that, especially as you haven't mentioned your Solaris version, how you installed Pg, etc. This may be of some use: http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris -- Craig Ringer
http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris
Actually, that link doesn't apply to this problem, it's for getting a stack trace programmatically:
Try:
http://publib.boulder.ibm.com/httpserv/ihsdiag/get_backtrace.html
http://www.princeton.edu/~unix/Solaris/troubleshoot/adb.html
Most of the good links I could find were on blogs.sun.com, which Oracle have helpfully redirected to www.oracle.com - where the pages don't actually exist.
--
Craig Ringer
Craig Ringer <ringerc@ringerc.id.au> writes: > On 10/22/2012 08:52 PM, Tom Lane wrote: >> But having said that, are we sure 10 is SIGUSR1 on the OP's platform? >> AFAIK, that signal number is not at all compatible across different >> flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.) > Gah. I incorrectly though that POSIX specified signal *numbers*, not > just names. That does not appear to actually be the case. Thanks. This isn't the first time I've wondered exactly which signal was meant in a postmaster child-crash report. Seems like it might be worth expending some code on a symbolic translation, instead of just printing the number. That'd be easy enough (for common signal names) on Unix, but has anyone got a suggestion how we might do something useful on Windows? regards, tom lane
This isn't the first time I've wondered exactly which signal was meant in a postmaster child-crash report. Seems like it might be worth expending some code on a symbolic translation, instead of just printing the number. That'd be easy enough (for common signal names) on Unix, but has anyone got a suggestion how we might do something useful on Windows?
Here's a typical Windows exception:
2012-10-04 14:29:08 CEST LOG: server process (PID 1416) was terminated by exception 0xC0000005
2012-10-04 14:29:08 CEST HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
These codes can be translated with FormatMessage:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351(v=vs.85).aspx
http://support.microsoft.com/kb/259693
FormatMessage may not be safe to perform in the context of a munged heap or some other failure conditions, so you probably don't want to do it from a crash handler. It is safe for the postmaster to do it based on the exception code it gets from the dying backend, though.
I'd say the best option is for the postmaster to print the FormatMessage( FORMAT_MESSAGE_ALLOCATE_BUFFER|FORMAT_MESSAGE_FROM_SYSTEM|FORMAT_MESSAGE_FROM_HMODULE, ...) output when it sees the exception code from the dying backend.
RtlNtStatusToDosError may also be of interest: http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600(v=vs.85).aspx ... but it's in Winternl.h so it's not guaranteed to exist / be compatible between versions and can only be accessed via runtime dynamic linking. Not ideal.
--
Craig Ringer
It looks like there's no standard way to do that. Here's how I'd do it in Python: [CODE] import signal dict((k, v) for v, k in signal.__dict__.iteritems() if v.startswith('SIG')) [/CODE] In C, I guess I'd just do a switch statement on the common signal names between Windows and POSIX as exposed SIGNAL.H. Lookslike all you get in Windows is: http://msdn.microsoft.com/en-us/library/xdkz3x12(v=vs.110).aspx Brian -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Tom Lane Sent: Monday, October 22, 2012 10:21 PM To: Craig Ringer Cc: raghu ram; pgsql-admin@postgresql.org; pgsql-general Subject: Re: [ADMIN] [GENERAL] Streaming Replication Server Crash Craig Ringer <ringerc@ringerc.id.au> writes: > On 10/22/2012 08:52 PM, Tom Lane wrote: >> But having said that, are we sure 10 is SIGUSR1 on the OP's platform? >> AFAIK, that signal number is not at all compatible across different >> flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.) > Gah. I incorrectly though that POSIX specified signal *numbers*, not > just names. That does not appear to actually be the case. Thanks. This isn't the first time I've wondered exactly which signal was meant in a postmaster child-crash report. Seems like itmight be worth expending some code on a symbolic translation, instead of just printing the number. That'd be easy enough(for common signal names) on Unix, but has anyone got a suggestion how we might do something useful on Windows? regards, tom lane -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin