Thread: make check crashes on POWER8 machine
Hi, I've encountered a problem with Postgres on PowerPC machine. Sometimes make check on REL_12_STABLE branch crashes with segmentation fault. It seems that problem is in errors.sql when executed select infinite_recures(); statement so stack trace, produced by gdb is too long to post here. Problem is rare and doesn't occur on all runs of make check. When I run make check repeatedly it occurs once a several hundreds runs. It seems that problem is architecture-dependent, because I cannot reproduce it on x86_64 CPU with more than thousand runs of make check. Machine is KVM virtual server on POWER8 system with following CPU: $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 8 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-31 Running RedHat 7.6. I've collected all relevant information i've can think of (including 210Mb core file, git commit id, configure and backend logs, list of installed RPMs) and put it into Google Drive https://drive.google.com/file/d/1Xs7DixBhMPEmViGUt5wAMewB6_xbZirY/view Hope that somebody more experienced with POWER CPUs can suggest something about this problem. --
On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote: > Hi, > > I've encountered a problem with Postgres on PowerPC machine. Sometimes Is it related to https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us https://bugzilla.kernel.org/show_bug.cgi?id=205183 (My initial report on that thread was unrelated user-error on my part) > It seems that problem is in errors.sql when executed > > select infinite_recures(); statement > > so stack trace, produced by gdb is too long to post here. > > Problem is rare and doesn't occur on all runs of make check. > When I run make check repeatedly it occurs once a several hundreds runs. > > It seems that problem is architecture-dependent, because I cannot > reproduce it on x86_64 CPU with more than thousand runs of make check. That's all consistent with the above problem. > Running RedHat 7.6. -- Justin
On Fri, 13 Mar 2020 07:43:59 -0500 Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote: > > Hi, > > > > I've encountered a problem with Postgres on PowerPC machine. > > Sometimes > > Is it related to > https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us > https://bugzilla.kernel.org/show_bug.cgi?id=205183 I don't think so. At least I cannot see any signal handler-related stuff in the trace, but see lots of calls to stored procedure executor instead. Although several different stack traces show completely different parts of code when signal SIGSEGV arrives, which may point to asynchronous nature of the problem. Unfortunately I've not kept all the cores I've seen. It rather looks like that in some rare circumstances Postgres is unable to properly determine end of stack condition. --
Victor Wagner <vitus@wagner.pp.ru> writes: > Justin Pryzby <pryzby@telsasoft.com> wrote: >> On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote: >>> I've encountered a problem with Postgres on PowerPC machine. >> Is it related to >> https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us >> https://bugzilla.kernel.org/show_bug.cgi?id=205183 > I don't think so. At least I cannot see any signal handler-related stuff > in the trace, but see lots of calls to stored procedure executor > instead. Read the whole thread. We fixed the issue with recursion in the postmaster (9abb2bfc0); but the intermittent failure in infinite_recurse is exactly the same as what we've been seeing for a long time in the buildfarm, and there is zero doubt that it's that kernel bug. In the other thread I'd suggested that we could quit running errors.sql in parallel with other tests, but that would slow down parallel regression testing for everybody. I'm disinclined to do that now, since the buildfarm problem is intermittent and easily recognized. regards, tom lane
В Fri, 13 Mar 2020 10:56:15 -0400 Tom Lane <tgl@sss.pgh.pa.us> пишет: > Victor Wagner <vitus@wagner.pp.ru> writes: > > Justin Pryzby <pryzby@telsasoft.com> wrote: > >> On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote: > >>> I've encountered a problem with Postgres on PowerPC machine. > > >> Is it related to > >> https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us > >> https://bugzilla.kernel.org/show_bug.cgi?id=205183 > > > I don't think so. At least I cannot see any signal handler-related > > stuff in the trace, but see lots of calls to stored procedure > > executor instead. > > Read the whole thread. We fixed the issue with recursion in the > postmaster (9abb2bfc0); but the intermittent failure in > infinite_recurse is exactly the same as what we've been seeing for a > long time in the buildfarm, and there is zero doubt that it's that > kernel bug. I've tried to cherry-pick commit 9abb2bfc8 into REL_12_STABLE and rerun make check in loop. Oops, on 543 run it segfaults with same symptoms as before. Here is link to new core and logs https://drive.google.com/file/d/1oF-0fKHKvFn6FaJ3u-v36p9W0EBAY9nb/view?usp=sharing I'll try to do this simple test (run make check repeatedly) with master. There is some time until end of weekend when this machine is non needed by anyone else, so I have time to couple of thousands runs. -- Victor Wagner <vitus@wagner.pp.ru>
Victor Wagner <vitus@wagner.pp.ru> writes: > Tom Lane <tgl@sss.pgh.pa.us> пишет: >> Read the whole thread. We fixed the issue with recursion in the >> postmaster (9abb2bfc0); but the intermittent failure in >> infinite_recurse is exactly the same as what we've been seeing for a >> long time in the buildfarm, and there is zero doubt that it's that >> kernel bug. > I've tried to cherry-pick commit 9abb2bfc8 into REL_12_STABLE and rerun > make check in loop. Oops, on 543 run it segfaults with same symptoms > as before. Unsurprising, because it's a kernel bug. Maybe you could try cherry-picking the patch proposed at kernel.org (see other thread). regards, tom lane