Thread: [HACKERS] Questions regarding signal handler of postmaster
In postmaster.c signal handler pmdie() calls ereport() and errmsg_internal(), which could call palloc() then malloc() if necessary. Because it is possible that pmdie() gets called while malloc() gets called in postmaster, I think it is possible that a deadlock situation could occur through an internal locking inside malloc(). I have not observed the exact case in PostgreSQL but I see a suspected case in Pgpool-II. In the stack trace #14, malloc() is called by Pgpool-II. It is interrupted by a signal in #11, and the signal handler calls malloc() again, and it is stuck at #0. So my question is, is my concern about PostgreSQL valid? If so, how can we fix it? #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f67fe20ccba in _L_lock_12808 () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f67fe20a6b5 in __GI___libc_malloc (bytes=15) at malloc.c:2887 #3 0x00007f67fe21072a in __GI___strdup (s=0x7f67fe305dd8 "/etc/localtime") at strdup.c:42 #4 0x00007f67fe239f51 in tzset_internal (always=<optimized out>, explicit=explicit@entry=1) at tzset.c:444 #5 0x00007f67fe23a913 in __tz_convert (timer=timer@entry=0x7ffce1c1b7f8, use_localtime=use_localtime@entry=1, tp=tp@entry=0x7f67fe54bde0<_tmbuf>) at tzset.c:632 #6 0x00007f67fe2387d1 in __GI_localtime (t=t@entry=0x7ffce1c1b7f8) at localtime.c:42 #7 0x000000000045627b in log_line_prefix (buf=buf@entry=0x7ffce1c1b8d0, line_prefix=<optimized out>, edata=<optimizedout>) at ../../src/utils/error/elog.c:2059 #8 0x000000000045894d in send_message_to_server_log (edata=0x753320 <errordata>) at ../../src/utils/error/elog.c:2084 #9 EmitErrorReport () at ../../src/utils/error/elog.c:1129 #10 0x0000000000456d8e in errfinish (dummy=<optimized out>) at ../../src/utils/error/elog.c:434 #11 0x0000000000421f57 in die (sig=2) at protocol/child.c:925 #12 <signal handler called> #13 _int_malloc (av=0x7f67fe546760 <main_arena>, bytes=4176) at malloc.c:3302 #14 0x00007f67fe20a6c0 in __GI___libc_malloc (bytes=4176) at malloc.c:2891 Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > In postmaster.c signal handler pmdie() calls ereport() and > errmsg_internal(), which could call palloc() then malloc() if > necessary. Because it is possible that pmdie() gets called while > malloc() gets called in postmaster, I think it is possible that a > deadlock situation could occur through an internal locking inside > malloc(). But we keep signals blocked almost all the time in the postmaster, so in reality no signal handler can interrupt anything except the select() wait call. People complain about that coding technique all the time, but no one has presented any reason to believe that it's broken. regards, tom lane
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tatsuo Ishii > In postmaster.c signal handler pmdie() calls ereport() and > errmsg_internal(), which could call palloc() then malloc() if necessary. > Because it is possible that pmdie() gets called while > malloc() gets called in postmaster, I think it is possible that a deadlock > situation could occur through an internal locking inside malloc(). I have > not observed the exact case in PostgreSQL but I see a suspected case in > Pgpool-II. In the stack trace #14, malloc() is called by Pgpool-II. It is > interrupted by a signal in #11, and the signal handler calls malloc() again, > and it is stuck at #0. I encountered that problem with postmaster and fixed it in 9.4.0 (it's not back-patched to earlier releases because it'srelatively complex). https://www.postgresql.org/message-id/20DAEA8949EC4E2289C6E8E58560DEC0@maumau [Excerpt from 9.4 release note] During crash recovery or immediate shutdown, send uncatchable termination signals (SIGKILL) to child processes that do notshut down promptly (MauMau, Álvaro Herrera) This reduces the likelihood of leaving orphaned child processes behind after postmaster shutdown, as well as ensuring thatcrash recovery can proceed if some child processes have become “stuck”. Regards Takayuki Tsunakawa
> I encountered that problem with postmaster and fixed it in 9.4.0 (it's not back-patched to earlier releases because it'srelatively complex). > > https://www.postgresql.org/message-id/20DAEA8949EC4E2289C6E8E58560DEC0@maumau > > > [Excerpt from 9.4 release note] > During crash recovery or immediate shutdown, send uncatchable termination signals (SIGKILL) to child processes that donot shut down promptly (MauMau, Álvaro Herrera) > This reduces the likelihood of leaving orphaned child processes behind after postmaster shutdown, as well as ensuring thatcrash recovery can proceed if some child processes have become “stuck”. Looks wild to me:-) I hope there exists better way to solve the problem... Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> But we keep signals blocked almost all the time in the postmaster, > so in reality no signal handler can interrupt anything except the > select() wait call. People complain about that coding technique > all the time, but no one has presented any reason to believe that > it's broken. Ok, there seems no better solution than always blocking signals. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp