Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe? - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?
Date
Msg-id CAHyXU0z04R5hmoW61stz3YQHA5O=w1VO4noEMV+V8Wwc30ru0Q@mail.gmail.com
Whole thread Raw
In response to Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Mar 13, 2020 at 2:28 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On March 13, 2020 12:08:32 PM PDT, Merlin Moncure <mmoncure@gmail.com> wrote:
> >I have 5 servers in a testing environment that are comprise a data
> >warehousing cluster.   They will typically get each get exactly the
> >same query at approximately the same time.  Yesterday, around 1pm, 3
> >of the five got stuck on the same query.  Each of them yields similar
> >stack traces.  This  happens now and then.  The server is 9.6.12
> >(which is obviously old, but I did not see any changes in relevant
> >code).
> >
> >(gdb) bt
> >#0  0x00007fe856c0b463 in __epoll_wait_nocancel () from
> >/lib64/libc.so.6
> >#1  0x00000000006b4416 in WaitEventSetWaitBlock (nevents=1,
> >occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
> >latch.c:1053
> >#2  WaitEventSetWait (set=0x27cace8, timeout=timeout@entry=-1,
> >occurred_events=occurred_events@entry=0x7ffc9f2b0f60,
> >nevents=nevents@entry=1) at latch.c:1007
> >#3  0x00000000005f26dd in secure_write (port=0x27f16a0,
> >ptr=ptr@entry=0x27f5528, len=len@entry=192) at be-secure.c:255
> >#4  0x00000000005fb51b in internal_flush () at pqcomm.c:1410
> >#5  0x00000000005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
> >s@entry=0x2a4f228 "", len=70) at pqcomm.c:1356
> >#6  0x00000000005fb7f0 in socket_putmessage (msgtype=68 'D',
> >s=0x2a4f228 "", len=<optimized out>) at pqcomm.c:1553
> >#7  0x00000000005fd5d9 in pq_endmessage (buf=buf@entry=0x7ffc9f2b1040)
> >at pqformat.c:347
> >#8  0x0000000000479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
> >printtup.c:372
> >#9  0x00000000005c1cc9 in ExecutePlan (dest=0x2b6bca0,
> >direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
> >operation=CMD_SELECT,
> >    use_parallel_mode=<optimized out>, planstate=0x2958cf8,
> >estate=0x2958be8) at execMain.c:1606
> >#10 standard_ExecutorRun (queryDesc=0x2834998, direction=<optimized
> >out>, count=0) at execMain.c:339
> >#11 0x00000000006d69a7 in PortalRunSelect
> >(portal=portal@entry=0x2894e38, forward=forward@entry=1 '\001',
> >count=0, count@entry=9223372036854775807,
> >    dest=dest@entry=0x2b6bca0) at pquery.c:948
> >#12 0x00000000006d7dbb in PortalRun (portal=0x2894e38,
> >count=9223372036854775807, isTopLevel=<optimized out>, dest=0x2b6bca0,
> >altdest=0x2b6bca0,
> >    completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
> >#13 0x00000000006d5a06 in PostgresMain (argc=<optimized out>,
> >argv=<optimized out>, dbname=<optimized out>, username=<optimized
> >out>) at postgres.c:1109
> >#14 0x000000000046fc28 in BackendRun (port=0x27f16a0) at
> >postmaster.c:4342
> >#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
> >#16 ServerLoop () at postmaster.c:1721
> >#17 0x0000000000678119 in PostmasterMain (argc=argc@entry=3,
> >argv=argv@entry=0x27c8c90) at postmaster.c:1329
> >#18 0x000000000047088e in main (argc=3, argv=0x27c8c90) at main.c:228
> >(gdb) quit
> >
> >Now, the fact that this happened to multiple servers at time strongly
> >suggest an external (to the database) problem.  The system initiating
> >the query, a cross database query over dblink, has been has given up
> >(and has been restarted as a precaution) a long time ago, and the
> >connection is dead.   secure_write() sets however an infinite timeout
> >to the latch, and there are clearly scenarios where epoll waits
> >forever for an event that is never going to occur.  If/when this
> >happens, the only recourse is to restart the impacted database.  The
> >question is, shouldn't the latch have a looping timeout that checks
> >for interrupts?   What would the risks be of jumping directly out of
> >the latch loop?
>
> Unless there is a kernel problem latches are interruptible by signals, as the signal handler should do a SetLatch().
>
> This backtrace just looks like the backend is trying to send data to the client? What makes you think it's stuck?

Well, the client has been gone for > 24 hours.   But your right, when
I send cancel to the backend, here is what happens according to
strace:
epoll_wait(3, 0x2915e08, 1, -1)         = -1 EINTR (Interrupted system call)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=5024, si_uid=26} ---
write(13, "\0", 1)                      = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
sendto(11, "\0\0\0\00545780\0\0\0\003508D\0\0\0d\0\t\0\0\0\00615202"...,
5640, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=43081176, u64=43081176}}], 1, -1) = 1
read(12, "\0", 16)                      = 1
epoll_wait(3,


...pg_terminate_backend() however, does properly kill the query.

> If the connection is dead, epoll should return (both because we ask for the relevant events, and because it just
alwaysimplicitly does do so).
 
>
> So it seems likely that either your connection isn't actually dead (e.g. waiting for tcp timeouts), or you have a
kennelbug.
 

maybe, I suspect firewall issue. hard to say

merlin



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Re: Optimize crash recovery
Next
From: Tom Lane
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)