Thread: gharial segfaulting on REL_12_STABLE only

gharial segfaulting on REL_12_STABLE only

From
Thomas Munro
Date:
Hi,

This is apparently an EDB-owned machine but I have no access to it
currently (I could ask if necessary).  For some reason it's been
failing for a week, but only on REL_12_STABLE, with this in the log:

2019-08-20 04:31:48.886 MDT [13421:4] LOG:  server process (PID 13871)
was terminated by signal 11: unrecognized signal
2019-08-20 04:31:48.886 MDT [13421:5] DETAIL:  Failed process was
running: SET default_table_access_method = '';

Apparently HPUX's sys_siglist doesn't recognise that most popular of
signals, 11, but by googling I see that it has its traditional meaning
there.  That's clearly in the create_am test:

019-08-20 04:31:22.404 MDT [13871:31] pg_regress/create_am HINT:  Use
DROP ... CASCADE to drop the dependent objects too.
2019-08-20 04:31:22.404 MDT [13871:32] pg_regress/create_am STATEMENT:
 DROP ACCESS METHOD gist2;
2019-08-20 04:31:22.405 MDT [13871:33] pg_regress/create_am LOG:
statement: DROP ACCESS METHOD gist2 CASCADE;
2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG:
statement: SET default_table_access_method = '';

Perhaps it was really running the next statement.

It's hard to see how cdc8d371e2, the only non-doc commit listed on the
first failure, could have anything to do with that.

-- 
Thomas Munro
https://enterprisedb.com



Re: gharial segfaulting on REL_12_STABLE only

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> This is apparently an EDB-owned machine but I have no access to it
> currently (I could ask if necessary).  For some reason it's been
> failing for a week, but only on REL_12_STABLE, with this in the log:

Yeah, I've been puzzling over that to little avail.

> It's hard to see how cdc8d371e2, the only non-doc commit listed on the
> first failure, could have anything to do with that.

Exactly :-(.  It seems completely reproducible since then, but how
could that have triggered a failure over here?  And why only in this
branch?  The identical patch went into HEAD.

> 2019-08-20 04:31:48.886 MDT [13421:4] LOG:  server process (PID 13871)
> was terminated by signal 11: unrecognized signal
> 2019-08-20 04:31:48.886 MDT [13421:5] DETAIL:  Failed process was
> running: SET default_table_access_method = '';

> Apparently HPUX's sys_siglist doesn't recognise that most popular of
> signals, 11, but by googling I see that it has its traditional meaning
> there.

HPUX hasn't *got* sys_siglist, nor strsignal() which is what we're
actually relying on these days (cf. pgstrsignal.c).  I was puzzled
by that too to start with, though.  I wonder if we shouldn't rearrange
pg_strsignal so that the message in the !HAVE_STRSIGNAL case is
something like "signal names not available on this platform" rather
than something that looks like we should've recognized it and didn't.

> 2019-08-20 04:31:22.422 MDT [13871:34] pg_regress/create_am LOG:
> statement: SET default_table_access_method = '';

> Perhaps it was really running the next statement.

Hard to see how, because this should have reported

ERROR:  invalid value for parameter "default_table_access_method": ""
DETAIL:  default_table_access_method cannot be empty.

but it didn't get that far.  It seems like it must have died either
in the (utterly trivial) check that leads to the above-quoted
complaint, or somewhere in the ereport mechanism.  Neither theory
seems very credible.

The seeming action-at-a-distance nature of the failure has me
speculating about compiler or linker bugs, but I dislike
jumping to that type of conclusion without hard evidence.

A stack trace would likely be really useful right about now.

            regards, tom lane



Re: gharial segfaulting on REL_12_STABLE only

From
Thomas Munro
Date:
On Tue, Aug 27, 2019 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A stack trace would likely be really useful right about now.

Yeah.  Looking into how to get that.


--
Thomas Munro
https://enterprisedb.com



Re: gharial segfaulting on REL_12_STABLE only

From
Thomas Munro
Date:
On Tue, Aug 27, 2019 at 2:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Aug 27, 2019 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A stack trace would likely be really useful right about now.
>
> Yeah.  Looking into how to get that.

Erm.  I heard the system was in a very unhappy state and couldn't be
logged into.  After it was rebooted, the problem appears to have gone
away.  That is quite unsatisfying.

"anole" runs on the same host, and occasionally fails to launch any
parallel workers, and it seems to be pretty unhappy too -- very long
runtimes (minutes where my smaller machines take seconds).  So the
machine may be massively overloaded and swapping or something like
that, something to be looked into, but that doesn't explain how we get
to a segfault without an underlying hard to reach bug in our code...

-- 
Thomas Munro
https://enterprisedb.com