Re: stress test for parallel workers - Mailing list pgsql-hackers

From Tom Lane
Subject Re: stress test for parallel workers
Date
Msg-id 17389.1563945314@sss.pgh.pa.us
Whole thread Raw
In response to Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection.  We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of postmaster failures not involving that is epsilon.

> I don't have access to the build farm history in searchable format
> (I'll go and ask for that).

Yeah, it's definitely handy to be able to do SQL searches in the
history.  I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.

> Do you have an example to hand?  Is this
> failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those.  To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

  sysname  |    branch     |      snapshot       
-----------+---------------+---------------------
 lorikeet  | HEAD          | 2019-06-16 20:28:25
 lorikeet  | HEAD          | 2019-07-07 14:58:38
 lorikeet  | HEAD          | 2019-07-02 10:38:08
 lorikeet  | HEAD          | 2019-06-14 14:58:24
 lorikeet  | HEAD          | 2019-07-04 20:28:44
 lorikeet  | HEAD          | 2019-04-30 11:00:49
 lorikeet  | HEAD          | 2019-06-19 20:29:27
 lorikeet  | HEAD          | 2019-05-21 08:28:26
 lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
 lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
 lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
 lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
 lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
 vulpes    | HEAD          | 2019-06-14 09:18:18
 vulpes    | HEAD          | 2019-06-27 09:17:19
 vulpes    | HEAD          | 2019-07-21 09:01:45
 vulpes    | HEAD          | 2019-06-12 09:11:02
 vulpes    | HEAD          | 2019-07-05 08:43:29
 vulpes    | HEAD          | 2019-07-15 08:43:28
 vulpes    | HEAD          | 2019-07-19 09:28:12
 wobbegong | HEAD          | 2019-06-09 20:43:22
 wobbegong | HEAD          | 2019-07-02 21:17:41
 wobbegong | HEAD          | 2019-06-04 21:06:07
 wobbegong | HEAD          | 2019-07-14 20:43:54
 wobbegong | HEAD          | 2019-06-19 21:05:04
 wobbegong | HEAD          | 2019-07-08 20:55:18
 wobbegong | HEAD          | 2019-06-28 21:18:46
 wobbegong | HEAD          | 2019-06-02 20:43:20
 wobbegong | HEAD          | 2019-07-04 21:01:37
 wobbegong | HEAD          | 2019-06-14 21:20:59
 wobbegong | HEAD          | 2019-06-23 21:36:51
 wobbegong | HEAD          | 2019-07-18 21:31:36
(32 rows)

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem.  This pattern looks way more
like a platform-specific issue.  Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Paul A Jungwirth
Date:
Subject: Re: range_agg
Next
From: Andres Freund
Date:
Subject: Re: Change atoi to strtol in same place