Thread: buildfarm animal shoveler failing with "Illegal instruction"
Hi Mark, shoveler has been failing for a while with an odd error. E.g. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2020-09-18%2014%3A01%3A48 Illegal instruction pg_dumpall: error: pg_dump failed on database "template1", exiting waiting for server to shut down.... done None of the changes in that time frame look like they're likely causing illegal instructions to be emitted that weren't before. So I am wondering if anything changed on that machine around 2020-09-18 14:01:48 ? Greetings, Andres Freund
On Thu, Oct 01, 2020 at 12:12:44PM -0700, Andres Freund wrote: > Hi Mark, > > shoveler has been failing for a while with an odd error. E.g. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2020-09-18%2014%3A01%3A48 > > Illegal instruction > pg_dumpall: error: pg_dump failed on database "template1", exiting > waiting for server to shut down.... done > > None of the changes in that time frame look like they're likely causing > illegal instructions to be emitted that weren't before. So I am > wondering if anything changed on that machine around 2020-09-18 > 14:01:48 ? It looks like the last package update was 2020-06-10 06:59:26, according to the apt logs. I'm getting Tom set up with access too, in case he has time before me to get a stack trace to see what's happening... Regards, Mark
Mark Wong <mark@2ndquadrant.com> writes: > I'm getting Tom set up with access too, in case he has time before me to > get a stack trace to see what's happening... tl;dr: it's hard to conclude that this is anything but a compiler bug. I was able to reproduce this on shoveler's host, but only when using the compiler shoveler uses (clang-3.9), not the 6.3 gcc that's also on there and is of similar vintage. Observations: * You don't need any complicated test case; "pg_dump template1" is enough. * Reverting 1ed6b8956's addition of a "postfix operators are not supported anymore" warning to dumpOpr() makes it go away. This despite the fact that that code is never reached when dumping template1. (We do enter dumpOpr, but the oprinfo->dobj.dump test always fails.) * Reducing the optimization level to -O1 or -O0 makes it go away. * Inserting a debugging fprintf in dumpOpr makes it go away. Since clang 3.9 is several years old, maybe we could move shoveler up to a newer version? Or dial it down to -O1 optimization? regards, tom lane
On Thu, Oct 01, 2020 at 09:12:53PM -0400, Tom Lane wrote: > Mark Wong <mark@2ndquadrant.com> writes: > > I'm getting Tom set up with access too, in case he has time before me to > > get a stack trace to see what's happening... > > tl;dr: it's hard to conclude that this is anything but a compiler bug. > > I was able to reproduce this on shoveler's host, but only when using > the compiler shoveler uses (clang-3.9), not the 6.3 gcc that's also > on there and is of similar vintage. Observations: > > * You don't need any complicated test case; "pg_dump template1" > is enough. > > * Reverting 1ed6b8956's addition of a "postfix operators are not supported > anymore" warning to dumpOpr() makes it go away. This despite the fact > that that code is never reached when dumping template1. (We do enter > dumpOpr, but the oprinfo->dobj.dump test always fails.) > > * Reducing the optimization level to -O1 or -O0 makes it go away. > > * Inserting a debugging fprintf in dumpOpr makes it go away. > > Since clang 3.9 is several years old, maybe we could move shoveler > up to a newer version? Or dial it down to -O1 optimization? There is ayu, same system with clang 4.0, so covered on that front. I went ahead and stopped the jobs to run with clang 3.9. This is also the same system that was running clang 3.8 too. I tried looking for EOL dates, but had trouble finding anything... But I can change the optimization flag if we want it back. Regards, Mark -- Mark Wong 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
On 2020-10-02 10:45:58 -0700, Mark Wong wrote: > I went ahead and stopped the jobs to run with clang 3.9. This is also > the same system that was running clang 3.8 too. I tried looking for EOL > dates, but had trouble finding anything... But I can change the > optimization flag if we want it back. llvm officially only supports the last minor version, and only does one or two point releases for them. 3.9 and 3.8 are long past EOL.
Andres Freund <andres@anarazel.de> writes: > On 2020-10-02 10:45:58 -0700, Mark Wong wrote: >> I went ahead and stopped the jobs to run with clang 3.9. This is also >> the same system that was running clang 3.8 too. I tried looking for EOL >> dates, but had trouble finding anything... But I can change the >> optimization flag if we want it back. > llvm officially only supports the last minor version, and only does one > or two point releases for them. 3.9 and 3.8 are long past EOL. I thought about asking Mark to re-enable it at -O1, but we have recent experience reminding us that non-default optimization levels are likely to be even buggier than the default [1]. So that's probably not a productive answer. We might as well just retire the animal. regards, tom lane [1] https://www.postgresql.org/message-id/1934344.1596305790@sss.pgh.pa.us