Re: Backends stunk in wait event IPC/MessageQueueInternal - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Backends stunk in wait event IPC/MessageQueueInternal |
Date | |
Msg-id | CA+hUKGKuV-TSSRVMjRhV4GuSktxj3-HuA6S+H1JQku7anFY5gw@mail.gmail.com Whole thread Raw |
In response to | Re: Backends stunk in wait event IPC/MessageQueueInternal (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Backends stunk in wait event IPC/MessageQueueInternal
|
List | pgsql-hackers |
On Sat, May 14, 2022 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 13, 2022 at 6:16 AM Japin Li <japinli@hotmail.com> wrote: > > The process cannot be terminated by pg_terminate_backend(), although > > it returns true. > One thing I find a bit curious is that the top of the stack in your > case is ioctl(). And there are no calls to ioctl() anywhere in > latch.c, nor have there ever been. What operating system is this? We > have 4 different versions of WaitEventSetWaitBlock() that call > epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() > respectively. I wonder which of those we're using, and whether one of > those calls is showing up as ioctl() in the stacktrace, or whether > there's some other function being called in here that is somehow > resulting in ioctl() getting called. I guess this is really illumos (née OpenSolaris), not Solaris, using our epoll build mode, with illumos's emulation of epoll, which maps epoll onto Sun's /dev/poll driver: https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/sys/epoll.c#L230 That'd explain: fffffb7fef216f4a ioctl (d, d001, fffffb7fffdfa0e0) That matches the value DP_POLL from: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/devpoll.h#L44 Or if it's really Solaris, huh, are people moving illumos code back into closed Solaris these days? As for why it's hanging, I don't know, but one thing that we changed in 14 was that we started using signalfd() to receive latch signals on systems that have it, and illumos also has an emulation of signalfd() that our configure script finds: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/signalfd.c There were in fact a couple of unexplained hangs on the illumos build farm animals, and then they were changed to use -DWAIT_USE_POLL so that they wouldn't automatically choose epoll()/signalfd(). That is not very satisfactory, but as far as I know there is a bug in either epoll() or signalfd(), or at least some difference compared to the Linux implementation they are emulating. spent quite a bit of time ping ponging emails back and forth with the owner of a hanging BF animal trying to get a minimal repro for a bug report, without success. I mean, it's possible that the bug is in PostgreSQL (though no complaint has ever reached me about this stuff on Linux), but while trying to investigate it a kernel panic happened[1], which I think counts as a point against that theory... (For what it's worth, WSL1 also emulates these two Linux interfaces and also apparently doesn't do so well enough for our purposes, also for reasons not understood by us.) In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that we could do something to prevent the selection of WAIT_USE_EPOLL on that platform, or that we should have a halfway option epoll() but not signalfd() (= go back to using the self-pipe trick), patches welcome, but that feels kinda strange and would be very niche combination that isn't fun to maintain... the real solution is to fix the bug. [1] https://www.illumos.org/issues/13700
pgsql-hackers by date: