Thread: [HACKERS] Questions regarding signal handler of postmaster

[HACKERS] Questions regarding signal handler of postmaster

From
Tatsuo Ishii
Date:
In postmaster.c signal handler pmdie() calls ereport() and
errmsg_internal(), which could call palloc() then malloc() if
necessary. Because it is possible that pmdie() gets called while
malloc() gets called in postmaster, I think it is possible that a
deadlock situation could occur through an internal locking inside
malloc(). I have not observed the exact case in PostgreSQL but I see a
suspected case in Pgpool-II. In the stack trace #14, malloc() is
called by Pgpool-II. It is interrupted by a signal in #11, and the
signal handler calls malloc() again, and it is stuck at #0.

So my question is, is my concern about PostgreSQL valid?
If so, how can we fix it?

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f67fe20ccba in _L_lock_12808 () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f67fe20a6b5 in __GI___libc_malloc (bytes=15) at malloc.c:2887
#3  0x00007f67fe21072a in __GI___strdup (s=0x7f67fe305dd8 "/etc/localtime") at strdup.c:42
#4  0x00007f67fe239f51 in tzset_internal (always=<optimized out>, explicit=explicit@entry=1)   at tzset.c:444
#5  0x00007f67fe23a913 in __tz_convert (timer=timer@entry=0x7ffce1c1b7f8,    use_localtime=use_localtime@entry=1,
tp=tp@entry=0x7f67fe54bde0<_tmbuf>) at tzset.c:632
 
#6  0x00007f67fe2387d1 in __GI_localtime (t=t@entry=0x7ffce1c1b7f8) at localtime.c:42
#7  0x000000000045627b in log_line_prefix (buf=buf@entry=0x7ffce1c1b8d0, line_prefix=<optimized out>,
edata=<optimizedout>) at ../../src/utils/error/elog.c:2059
 
#8  0x000000000045894d in send_message_to_server_log (edata=0x753320 <errordata>)   at
../../src/utils/error/elog.c:2084
#9  EmitErrorReport () at ../../src/utils/error/elog.c:1129
#10 0x0000000000456d8e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434
#11 0x0000000000421f57 in die (sig=2) at protocol/child.c:925
#12 <signal handler called>
#13 _int_malloc (av=0x7f67fe546760 <main_arena>, bytes=4176) at malloc.c:3302
#14 0x00007f67fe20a6c0 in __GI___libc_malloc (bytes=4176) at malloc.c:2891

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] Questions regarding signal handler of postmaster

From
Tom Lane
Date:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> In postmaster.c signal handler pmdie() calls ereport() and
> errmsg_internal(), which could call palloc() then malloc() if
> necessary. Because it is possible that pmdie() gets called while
> malloc() gets called in postmaster, I think it is possible that a
> deadlock situation could occur through an internal locking inside
> malloc().

But we keep signals blocked almost all the time in the postmaster,
so in reality no signal handler can interrupt anything except the
select() wait call.  People complain about that coding technique
all the time, but no one has presented any reason to believe that
it's broken.
        regards, tom lane



Re: [HACKERS] Questions regarding signal handler of postmaster

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tatsuo Ishii
> In postmaster.c signal handler pmdie() calls ereport() and
> errmsg_internal(), which could call palloc() then malloc() if necessary.
> Because it is possible that pmdie() gets called while
> malloc() gets called in postmaster, I think it is possible that a deadlock
> situation could occur through an internal locking inside malloc(). I have
> not observed the exact case in PostgreSQL but I see a suspected case in
> Pgpool-II. In the stack trace #14, malloc() is called by Pgpool-II. It is
> interrupted by a signal in #11, and the signal handler calls malloc() again,
> and it is stuck at #0.

I encountered that problem with postmaster and fixed it in 9.4.0 (it's not back-patched to earlier releases because
it'srelatively complex).
 

https://www.postgresql.org/message-id/20DAEA8949EC4E2289C6E8E58560DEC0@maumau


[Excerpt from 9.4 release note]
During crash recovery or immediate shutdown, send uncatchable termination signals (SIGKILL) to child processes that do
notshut down promptly (MauMau, Álvaro Herrera)
 
This reduces the likelihood of leaving orphaned child processes behind after postmaster shutdown, as well as ensuring
thatcrash recovery can proceed if some child processes have become “stuck”.
 

Regards
Takayuki Tsunakawa




Re: [HACKERS] Questions regarding signal handler of postmaster

From
Tatsuo Ishii
Date:
> I encountered that problem with postmaster and fixed it in 9.4.0 (it's not back-patched to earlier releases because
it'srelatively complex).
 
> 
> https://www.postgresql.org/message-id/20DAEA8949EC4E2289C6E8E58560DEC0@maumau
> 
> 
> [Excerpt from 9.4 release note]
> During crash recovery or immediate shutdown, send uncatchable termination signals (SIGKILL) to child processes that
donot shut down promptly (MauMau, Álvaro Herrera)
 
> This reduces the likelihood of leaving orphaned child processes behind after postmaster shutdown, as well as ensuring
thatcrash recovery can proceed if some child processes have become “stuck”.
 

Looks wild to me:-) I hope there exists better way to solve the problem...

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] Questions regarding signal handler of postmaster

From
Tatsuo Ishii
Date:
> But we keep signals blocked almost all the time in the postmaster,
> so in reality no signal handler can interrupt anything except the
> select() wait call.  People complain about that coding technique
> all the time, but no one has presented any reason to believe that
> it's broken.

Ok, there seems no better solution than always blocking signals.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp