On Sat, Apr 27, 2019 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I have spent a fair amount of time trying to replicate these failures
> locally, with little success. I now think that the most promising theory
> is Munro's idea in [1] that the walreceiver is hanging up during its
> unsafe attempt to do ereport(FATAL) from inside a signal handler. It's
> extremely plausible that that could result in a deadlock inside libc's
> malloc/free, or some similar place. Moreover, if that's what's causing
> it, then the windows for trouble are fixed by the length of time that
> malloc might hold internal locks, which fits with the results I've gotten
> that inserting delays in various promising-looking places doesn't do a
> thing towards making this reproducible.
For Greenplum (based on 9.4 but current master code looks the same) we
did see deadlocks recently hit in CI many times for walreceiver which
I believe confirms above finding.
#0 __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>,
p=0x26bb3b0, have_lock=0) at malloc.c:3962
#2 0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#3 0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#4 0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#5 0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235
#6 0x00007f0637ea0ff8 in __run_exit_handlers (status=1,
listp=0x7f063822b5f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#7 0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104
#8 0x00000000008c72c7 in proc_exit ()
#9 0x0000000000a75867 in errfinish ()
#10 0x000000000089ea53 in ProcessWalRcvInterrupts ()
#11 0x000000000089eac5 in WalRcvShutdownHandler ()
#12 <signal handler called>
#13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>,
bytes=bytes@entry=16384) at malloc.c:3802
#14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913
#15 0x00000000007754c3 in makeEmptyPGconn ()
#16 0x0000000000779686 in PQconnectStart ()
#17 0x0000000000779b8b in PQconnectdb ()
#18 0x00000000008aae52 in libpqrcv_connect ()
#19 0x000000000089f735 in WalReceiverMain ()
#20 0x00000000005c5eab in AuxiliaryProcessMain ()
#21 0x00000000004cd5f1 in ServerLoop ()
#22 0x000000000086fb18 in PostmasterMain ()
#23 0x00000000004d2e28 in main ()
ImmediateInterruptOK was removed from regular backends but not for
walreceiver and walreceiver performing elog(FATAL) inside signal
handler is dangerous.