Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers

From Ashwin Agrawal
Subject Re: Race conditions with checkpointer and shutdown
Date
Msg-id CALfoeittsAtXddKz98wYxFYmkP46p59EJOCV+dKnKEAtxxhAVA@mail.gmail.com
Whole thread Raw
In response to Re: Race conditions with checkpointer and shutdown  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race conditions with checkpointer and shutdown
List pgsql-hackers
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I have spent a fair amount of time trying to replicate these failures
> locally, with little success.  I now think that the most promising theory
> is Munro's idea in [1] that the walreceiver is hanging up during its
> unsafe attempt to do ereport(FATAL) from inside a signal handler.  It's
> extremely plausible that that could result in a deadlock inside libc's
> malloc/free, or some similar place.  Moreover, if that's what's causing
> it, then the windows for trouble are fixed by the length of time that
> malloc might hold internal locks, which fits with the results I've gotten
> that inserting delays in various promising-looking places doesn't do a
> thing towards making this reproducible.

For Greenplum (based on 9.4 but current master code looks the same) we
did see deadlocks recently hit in CI many times for walreceiver which
I believe confirms above finding.

#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>,
p=0x26bb3b0, have_lock=0) at malloc.c:3962
#2  0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#3  0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#4  0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#5  0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235
#6  0x00007f0637ea0ff8 in __run_exit_handlers (status=1,
listp=0x7f063822b5f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#7  0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104
#8  0x00000000008c72c7 in proc_exit ()
#9  0x0000000000a75867 in errfinish ()
#10 0x000000000089ea53 in ProcessWalRcvInterrupts ()
#11 0x000000000089eac5 in WalRcvShutdownHandler ()
#12 <signal handler called>
#13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>,
bytes=bytes@entry=16384) at malloc.c:3802
#14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913
#15 0x00000000007754c3 in makeEmptyPGconn ()
#16 0x0000000000779686 in PQconnectStart ()
#17 0x0000000000779b8b in PQconnectdb ()
#18 0x00000000008aae52 in libpqrcv_connect ()
#19 0x000000000089f735 in WalReceiverMain ()
#20 0x00000000005c5eab in AuxiliaryProcessMain ()
#21 0x00000000004cd5f1 in ServerLoop ()
#22 0x000000000086fb18 in PostmasterMain ()
#23 0x00000000004d2e28 in main ()

ImmediateInterruptOK was removed from regular backends but not for
walreceiver and walreceiver performing elog(FATAL) inside signal
handler is dangerous.



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: "long" type is not appropriate for counting tuples
Next
From: Andres Freund
Date:
Subject: Re: "long" type is not appropriate for counting tuples