Re: stress test for parallel workers - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: stress test for parallel workers
Date
Msg-id CA+hUKGL6cDyb2maq2P60cEsjFK=3saBCAj7sDzE3jysL-PRwqg@mail.gmail.com
Whole thread Raw
In response to Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: stress test for parallel workers
Re: stress test for parallel workers
List pgsql-hackers
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Do you have an example to hand?  Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster exited during a parallel transaction" failures is accurate,
> there is a very strong correlation I'd not noticed: it's just a few
> buildfarm critters that are producing those.  To wit, I find that
> string in these recent failures (checked all runs in the past 3 months):
>
>   sysname  |    branch     |      snapshot
> -----------+---------------+---------------------
>  lorikeet  | HEAD          | 2019-06-16 20:28:25
>  lorikeet  | HEAD          | 2019-07-07 14:58:38
>  lorikeet  | HEAD          | 2019-07-02 10:38:08
>  lorikeet  | HEAD          | 2019-06-14 14:58:24
>  lorikeet  | HEAD          | 2019-07-04 20:28:44
>  lorikeet  | HEAD          | 2019-04-30 11:00:49
>  lorikeet  | HEAD          | 2019-06-19 20:29:27
>  lorikeet  | HEAD          | 2019-05-21 08:28:26
>  lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
>  lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
>  lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
>  lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
>  lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
>  vulpes    | HEAD          | 2019-06-14 09:18:18
>  vulpes    | HEAD          | 2019-06-27 09:17:19
>  vulpes    | HEAD          | 2019-07-21 09:01:45
>  vulpes    | HEAD          | 2019-06-12 09:11:02
>  vulpes    | HEAD          | 2019-07-05 08:43:29
>  vulpes    | HEAD          | 2019-07-15 08:43:28
>  vulpes    | HEAD          | 2019-07-19 09:28:12
>  wobbegong | HEAD          | 2019-06-09 20:43:22
>  wobbegong | HEAD          | 2019-07-02 21:17:41
>  wobbegong | HEAD          | 2019-06-04 21:06:07
>  wobbegong | HEAD          | 2019-07-14 20:43:54
>  wobbegong | HEAD          | 2019-06-19 21:05:04
>  wobbegong | HEAD          | 2019-07-08 20:55:18
>  wobbegong | HEAD          | 2019-06-28 21:18:46
>  wobbegong | HEAD          | 2019-06-02 20:43:20
>  wobbegong | HEAD          | 2019-07-04 21:01:37
>  wobbegong | HEAD          | 2019-06-14 21:20:59
>  wobbegong | HEAD          | 2019-06-23 21:36:51
>  wobbegong | HEAD          | 2019-07-18 21:31:36
> (32 rows)
>
> We already knew that lorikeet has its own peculiar stability
> problems, and these other two critters run different compilers
> on the same Fedora 27 ppc64le platform.
>
> So I think I've got to take back the assertion that we've got
> some lurking generic problem.  This pattern looks way more
> like a platform-specific issue.  Overaggressive OOM killer
> would fit the facts on vulpes/wobbegong, perhaps, though
> it's odd that it only happens on HEAD runs.

chipmunk also:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16

I wondered if the build farm should try to report OOM kill -9 or other
signal activity affecting the postmaster.

On some systems (depending on sysctl kernel.dmesg_restrict on Linux,
security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run
dmesg as a non-root user, and there the OOM killer's footprints or
signaled exit statuses for processes under init might normally be found,
but that seems a bit invasive for the host system (I guess you'd
filter it carefully).  Unfortunately it isn't enabled on many common
systems anyway.

Maybe there is a systemd-specific way to get the info we need without
being root?

Another idea: start the postmaster under a subreaper (Linux 3.4+
prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+
procctl(PROC_REAP_ACQUIRE)) that exists just to report on its
children's exit status, so the build farm could see "pid XXX was
killed by signal 9" message if it is nuked by the OOM killer.  Perhaps
there is a common subreaper wrapper out there that would wait, print
messages like that, rince and repeat until it has no children and then
exit, or perhaps pg_ctl or even a perl script could do somethign like
that if requested.  Another thought, not explored, is the brand new
Linux pidfd stuff that can be used to wait and get exit status for a
non-child process (or the older BSD equivalent), but the paint isn't
even dry on that stuff anwyay.

--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: no default hash partition
Next
From: Michael Paquier
Date:
Subject: Re: Refactoring code stripping trailing \n and \r from strings