Thread: gharial segfaulting on REL_12_STABLE only
Hi, This is apparently an EDB-owned machine but I have no access to it currently (I could ask if necessary). For some reason it's been failing for a week, but only on REL_12_STABLE, with this in the log: 2019-08-20 04:31:48.886 MDT [13421:4] LOG: server process (PID 13871) was terminated by signal 11: unrecognized signal 2019-08-20 04:31:48.886 MDT [13421:5] DETAIL: Failed process was running: SET default_table_access_method = ''; Apparently HPUX's sys_siglist doesn't recognise that most popular of signals, 11, but by googling I see that it has its traditional meaning there. That's clearly in the create_am test: 019-08-20 04:31:22.404 MDT [13871:31] pg_regress/create_am HINT: Use DROP ... CASCADE to drop the dependent objects too. 2019-08-20 04:31:22.404 MDT [13871:32] pg_regress/create_am STATEMENT: DROP ACCESS METHOD gist2; 2019-08-20 04:31:22.405 MDT [13871:33] pg_regress/create_am LOG: statement: DROP ACCESS METHOD gist2 CASCADE; 2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG: statement: SET default_table_access_method = ''; Perhaps it was really running the next statement. It's hard to see how cdc8d371e2, the only non-doc commit listed on the first failure, could have anything to do with that. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > This is apparently an EDB-owned machine but I have no access to it > currently (I could ask if necessary). For some reason it's been > failing for a week, but only on REL_12_STABLE, with this in the log: Yeah, I've been puzzling over that to little avail. > It's hard to see how cdc8d371e2, the only non-doc commit listed on the > first failure, could have anything to do with that. Exactly :-(. It seems completely reproducible since then, but how could that have triggered a failure over here? And why only in this branch? The identical patch went into HEAD. > 2019-08-20 04:31:48.886 MDT [13421:4] LOG: server process (PID 13871) > was terminated by signal 11: unrecognized signal > 2019-08-20 04:31:48.886 MDT [13421:5] DETAIL: Failed process was > running: SET default_table_access_method = ''; > Apparently HPUX's sys_siglist doesn't recognise that most popular of > signals, 11, but by googling I see that it has its traditional meaning > there. HPUX hasn't *got* sys_siglist, nor strsignal() which is what we're actually relying on these days (cf. pgstrsignal.c). I was puzzled by that too to start with, though. I wonder if we shouldn't rearrange pg_strsignal so that the message in the !HAVE_STRSIGNAL case is something like "signal names not available on this platform" rather than something that looks like we should've recognized it and didn't. > 2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG: > statement: SET default_table_access_method = ''; > Perhaps it was really running the next statement. Hard to see how, because this should have reported ERROR: invalid value for parameter "default_table_access_method": "" DETAIL: default_table_access_method cannot be empty. but it didn't get that far. It seems like it must have died either in the (utterly trivial) check that leads to the above-quoted complaint, or somewhere in the ereport mechanism. Neither theory seems very credible. The seeming action-at-a-distance nature of the failure has me speculating about compiler or linker bugs, but I dislike jumping to that type of conclusion without hard evidence. A stack trace would likely be really useful right about now. regards, tom lane
On Tue, Aug 27, 2019 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > A stack trace would likely be really useful right about now. Yeah. Looking into how to get that. -- Thomas Munro https://enterprisedb.com
On Tue, Aug 27, 2019 at 2:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Aug 27, 2019 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > A stack trace would likely be really useful right about now. > > Yeah. Looking into how to get that. Erm. I heard the system was in a very unhappy state and couldn't be logged into. After it was rebooted, the problem appears to have gone away. That is quite unsatisfying. "anole" runs on the same host, and occasionally fails to launch any parallel workers, and it seems to be pretty unhappy too -- very long runtimes (minutes where my smaller machines take seconds). So the machine may be massively overloaded and swapping or something like that, something to be looked into, but that doesn't explain how we get to a segfault without an underlying hard to reach bug in our code... -- Thomas Munro https://enterprisedb.com