Thread: Backends stunk in wait event IPC/MessageQueueInternal
Hi, hackers I had an incident on my Postgres 14 that queries hung in wait event IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], however, it doesn't have any discussions. The process cannot be terminated by pg_terminate_backend(), although it returns true. Here is the call stack comes from pstack: 485073: /opt/local/pgsql/14/bin/postgres fffffb7fef216f4a ioctl (d, d001, fffffb7fffdfa0e0) 00000000008b8ec2 WaitEventSetWait () + 112 00000000008b920f WaitLatch () + 6f 00000000008bf434 shm_mq_wait_internal () + 64 00000000008bff74 shm_mq_receive () + 2b4 000000000079fdc8 TupleQueueReaderNext () + 28 000000000077d8ca gather_merge_readnext () + 13a 000000000077db25 ExecGatherMerge () + 215 0000000000790675 ExecNextLoop () + 175 0000000000790675 ExecNextLoop () + 175 000000000076267d standard_ExecutorRun () + fd fffffb7fe3965fbd pgss_executorRun () + fd 00000000008df99b PortalRunSelect () + 1cb 00000000008e0dcf PortalRun () + 17f 00000000008ddacd PostgresMain () + 100d 0000000000857f62 ServerLoop () + cd2 0000000000858cee main () + 453 00000000005ab777 _start_crt () + 87 00000000005ab6d8 _start () + 18 Any suggestions? Thanks in advance! [1] https://www.postgresql.org/message-id/flat/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote: > I had an incident on my Postgres 14 that queries hung in wait event > IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], > however, it doesn't have any discussions. If the process is still running, or if the problem recurs, I suggest to create a corefile with gcore, aka gdb generate-core-file. Then, we can look at the backtrace at our leisure, even if the cluster needed to be restarted right away. What minor version of postgres is this, and what OS ? -- Justin
On Fri, 13 May 2022 at 19:41, Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote: >> I had an incident on my Postgres 14 that queries hung in wait event >> IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], >> however, it doesn't have any discussions. > > If the process is still running, or if the problem recurs, I suggest to create > a corefile with gcore, aka gdb generate-core-file. Then, we can look at the > backtrace at our leisure, even if the cluster needed to be restarted right > away. > Thanks for your advice, I will try it later. > What minor version of postgres is this, and what OS ? PostgreSQL 14.2 and Solaris. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On Fri, May 13, 2022 at 6:16 AM Japin Li <japinli@hotmail.com> wrote: > The process cannot be terminated by pg_terminate_backend(), although > it returns true. pg_terminate_backend() just sends SIGINT. What I'm wondering is what happens when the stuck process receives SIGINT. It would be useful, I think, to check the value of the global variable InterruptHoldoffCount in the stuck process by attaching to it with gdb. I would also try running "strace -p $PID" on the stuck process and then try terminating it again with pg_terminate_backend(). Either the system call in which it's currently stuck returns and then it makes the same system call again and hangs again ... or the signal doesn't dislodge it from the system call in which it's stuck in the first place. It would be useful to know which of those two things is happening. One thing I find a bit curious is that the top of the stack in your case is ioctl(). And there are no calls to ioctl() anywhere in latch.c, nor have there ever been. What operating system is this? We have 4 different versions of WaitEventSetWaitBlock() that call epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() respectively. I wonder which of those we're using, and whether one of those calls is showing up as ioctl() in the stacktrace, or whether there's some other function being called in here that is somehow resulting in ioctl() getting called. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, May 14, 2022 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 13, 2022 at 6:16 AM Japin Li <japinli@hotmail.com> wrote: > > The process cannot be terminated by pg_terminate_backend(), although > > it returns true. > One thing I find a bit curious is that the top of the stack in your > case is ioctl(). And there are no calls to ioctl() anywhere in > latch.c, nor have there ever been. What operating system is this? We > have 4 different versions of WaitEventSetWaitBlock() that call > epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() > respectively. I wonder which of those we're using, and whether one of > those calls is showing up as ioctl() in the stacktrace, or whether > there's some other function being called in here that is somehow > resulting in ioctl() getting called. I guess this is really illumos (née OpenSolaris), not Solaris, using our epoll build mode, with illumos's emulation of epoll, which maps epoll onto Sun's /dev/poll driver: https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/sys/epoll.c#L230 That'd explain: fffffb7fef216f4a ioctl (d, d001, fffffb7fffdfa0e0) That matches the value DP_POLL from: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/devpoll.h#L44 Or if it's really Solaris, huh, are people moving illumos code back into closed Solaris these days? As for why it's hanging, I don't know, but one thing that we changed in 14 was that we started using signalfd() to receive latch signals on systems that have it, and illumos also has an emulation of signalfd() that our configure script finds: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/signalfd.c There were in fact a couple of unexplained hangs on the illumos build farm animals, and then they were changed to use -DWAIT_USE_POLL so that they wouldn't automatically choose epoll()/signalfd(). That is not very satisfactory, but as far as I know there is a bug in either epoll() or signalfd(), or at least some difference compared to the Linux implementation they are emulating. spent quite a bit of time ping ponging emails back and forth with the owner of a hanging BF animal trying to get a minimal repro for a bug report, without success. I mean, it's possible that the bug is in PostgreSQL (though no complaint has ever reached me about this stuff on Linux), but while trying to investigate it a kernel panic happened[1], which I think counts as a point against that theory... (For what it's worth, WSL1 also emulates these two Linux interfaces and also apparently doesn't do so well enough for our purposes, also for reasons not understood by us.) In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that we could do something to prevent the selection of WAIT_USE_EPOLL on that platform, or that we should have a halfway option epoll() but not signalfd() (= go back to using the self-pipe trick), patches welcome, but that feels kinda strange and would be very niche combination that isn't fun to maintain... the real solution is to fix the bug. [1] https://www.illumos.org/issues/13700
On Sat, May 14, 2022 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that > we could do something to prevent the selection of WAIT_USE_EPOLL on > that platform, or that we should have a halfway option epoll() but not > signalfd() (= go back to using the self-pipe trick), patches welcome, > but that feels kinda strange and would be very niche combination that > isn't fun to maintain... the real solution is to fix the bug. I felt a bit sad about writing that, so I took a crack at trying to write a patch that separates the signalfd/self-pipe choice from the epoll/poll choice. Maybe it's not too bad. Japin, are you able to reproduce the problem reliably? Did I guess right, that you're on illumos? Does this help? I used defined(__sun__) to select the option, but I don't remember if that's the right way to detect that OS family, could you confirm that, or adjust as required?
Attachment
On Sat, May 14, 2022 at 10:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Japin, are you able to reproduce the problem reliably? Did I guess > right, that you're on illumos? Does this help? I used > defined(__sun__) to select the option, but I don't remember if that's > the right way to detect that OS family, could you confirm that, or > adjust as required? Better version. Now you can independently set -DWAIT_USE_{POLL,EPOLL} and -DWAIT_USE_{SELF_PIPE,SIGNALFD} for testing, and it picks a sensible default.
Attachment
On Sat, 14 May 2022 at 11:01, Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, May 14, 2022 at 10:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> Japin, are you able to reproduce the problem reliably? Did I guess >> right, that you're on illumos? Does this help? I used >> defined(__sun__) to select the option, but I don't remember if that's >> the right way to detect that OS family, could you confirm that, or >> adjust as required? > > Better version. Now you can independently set -DWAIT_USE_{POLL,EPOLL} > and -DWAIT_USE_{SELF_PIPE,SIGNALFD} for testing, and it picks a > sensible default. Sorry for the late reply. My bad! It actually SmartOS, which is based on illumos. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On Fri, 13 May 2022 at 22:08, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 13, 2022 at 6:16 AM Japin Li <japinli@hotmail.com> wrote: >> The process cannot be terminated by pg_terminate_backend(), although >> it returns true. > > pg_terminate_backend() just sends SIGINT. What I'm wondering is what > happens when the stuck process receives SIGINT. It would be useful, I > think, to check the value of the global variable InterruptHoldoffCount > in the stuck process by attaching to it with gdb. I would also try > running "strace -p $PID" on the stuck process and then try terminating > it again with pg_terminate_backend(). Either the system call in which > it's currently stuck returns and then it makes the same system call > again and hangs again ... or the signal doesn't dislodge it from the > system call in which it's stuck in the first place. It would be useful > to know which of those two things is happening. > > One thing I find a bit curious is that the top of the stack in your > case is ioctl(). And there are no calls to ioctl() anywhere in > latch.c, nor have there ever been. What operating system is this? We > have 4 different versions of WaitEventSetWaitBlock() that call > epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() > respectively. I wonder which of those we're using, and whether one of > those calls is showing up as ioctl() in the stacktrace, or whether > there's some other function being called in here that is somehow > resulting in ioctl() getting called. Thanks for your advice. I will try this on Monday. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On Sat, 14 May 2022 at 11:01, Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, May 14, 2022 at 10:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> Japin, are you able to reproduce the problem reliably? Did I guess >> right, that you're on illumos? Does this help? I used >> defined(__sun__) to select the option, but I don't remember if that's >> the right way to detect that OS family, could you confirm that, or >> adjust as required? > > Better version. Now you can independently set -DWAIT_USE_{POLL,EPOLL} > and -DWAIT_USE_{SELF_PIPE,SIGNALFD} for testing, and it picks a > sensible default. Thanks for your patch! The illumos already defined the following macros. $ gcc -dM -E - </dev/null | grep -e 'illumos' -e 'sun' #define __sun 1 #define __illumos__ 1 #define sun 1 #define __sun__ 1 Maybe use the __illumos__ macro more accurity. +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ + !defined(__sun__) -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
On Mon, May 16, 2022 at 3:45 PM Japin Li <japinli@hotmail.com> wrote: > Maybe use the __illumos__ macro more accurity. > > +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ > + !defined(__sun__) Thanks, updated, and with a new commit message. I don't know much about these OSes (though I used lots of Sun machines during the Jurassic period). I know that there are three distributions of illumos: OmniOS, SmartOS and OpenIndiana, and they share the same kernel and base system. The off-list reports I received about hangs and kernel panics were from OpenIndiana animals hake and haddock, which are not currently reporting (I'll ask why), and then their owner defined -DWAIT_USE_POLL to clear that up while we waited for progress on his kernel panic bug report. I see that OmniOS animal pollock is currently reporting and also uses -DWAIT_USE_POLL, but I couldn't find any discussion about that. Of course, you might be hitting some completely different problem, given the lack of information. I'd be interested in the output of "p *MyLatch" (= to see if the latch has already been set), and whether "kill -URG PID" dislodges the stuck process. But given the open kernel bug report that I've now been reminded of, I'm thinking about pushing this anyway. Then we could ask the animal owners to remove -DWAIT_USE_POLL so that they'd effectively be running with -DWAIT_USE_EPOLL and -DWAIT_USE_SELF_PIPE, which would be more like PostgreSQL 13, but people who want to reproduce the problem on the illumos side could build with -DWAIT_USE_SIGNALFD.
Attachment
On Tue, May 17, 2022 at 3:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, May 16, 2022 at 3:45 PM Japin Li <japinli@hotmail.com> wrote: > > Maybe use the __illumos__ macro more accurity. > > > > +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ > > + !defined(__sun__) > > Thanks, updated, and with a new commit message. Pushed to master and REL_14_STABLE. I'll email the illumos build farm animal owners to say that they should be able to remove -DWAIT_USE_POLL. Theoretically, it might be useful that we've separated the WAIT_USE_SELF_PIPE code from WAIT_USE_POLL if someone eventually wants to complete the set of possible WaitEventSet implementations by adding /dev/poll (Solaris, HPUX) and pollset (AIX) support. I don't think those have a nicer way to receive race-free signal wakeups. Realistically no one's likely to show up with a patch for those old proprietary Unixen at this point on the timeline, I just think it's interesting that every OS had something better than poll(), we just need that fallback for lack of patches, not lack of kernel features. Ironically the typical monster AIX systems I've run into in the wild are probably much more capable of suffering from poll() contention than all these puny x86 systems, with oodles of CPUs and NUMA nodes. If someone *is* still interested in scalability on AIX, I'd recommend looking at pollset for latch.c, and also the stalled huge pages thing[1]. [1] https://www.postgresql.org/message-id/CA%2BhUKGJE4dq%2BNZHrm%3DpNSNCYwDCH%2BT6HtaWm5Lm8vZzygknPpA%40mail.gmail.com
On Sun, Jun 26, 2022 at 11:18 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, May 17, 2022 at 3:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, May 16, 2022 at 3:45 PM Japin Li <japinli@hotmail.com> wrote: > > > Maybe use the __illumos__ macro more accurity. > > > > > > +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ > > > + !defined(__sun__) > > > > Thanks, updated, and with a new commit message. > > Pushed to master and REL_14_STABLE. FTR: I noticed that https://www.illumos.org/issues/13700 had been marked fixed, so I asked if we should remove our check[1]. Nope, another issue was opened at https://www.illumos.org/issues/14892, which I'll keep an eye on. It seems we're pretty good at hitting poll/event-related kernel bugs in various OSes. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ab4fc5dcf30ebc90a23ad878342dc528e2d25ce
On Sun, Aug 28, 2022 at 11:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Jun 26, 2022 at 11:18 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, May 17, 2022 at 3:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Mon, May 16, 2022 at 3:45 PM Japin Li <japinli@hotmail.com> wrote: > > > > Maybe use the __illumos__ macro more accurity. > > > > > > > > +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ > > > > + !defined(__sun__) > > > > > > Thanks, updated, and with a new commit message. > > > > Pushed to master and REL_14_STABLE. > > FTR: I noticed that https://www.illumos.org/issues/13700 had been > marked fixed, so I asked if we should remove our check[1]. Nope, > another issue was opened at https://www.illumos.org/issues/14892, > which I'll keep an eye on. It seems we're pretty good at hitting > poll/event-related kernel bugs in various OSes. I happened to notice in the release notes for OmniOS that Stephen posted in the nearby GSSAPI thread that this has now been fixed. I think there's no point in changing the back branches (hard to synchronise with kernel upgrades), but I also don't want to leave this weird wart in the code forever. Shall we remove it in 16? I don't personally care if it's 16 or 17, but I wanted to make a note about the cleanup opportunity either way, and will add this to the open commitfest.