Re: stress test for parallel workers - Mailing list pgsql-hackers

From Tom Lane
Subject Re: stress test for parallel workers
Date
Msg-id 27924.1571068231@sss.pgh.pa.us
Whole thread Raw
In response to Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: stress test for parallel workers
List pgsql-hackers
I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...

Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform.  There were only three:

 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_11_STABLE | 2019-08-11 02:10:12 |
InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | REL_12_STABLE | 2019-08-11 09:52:46 |
pg_upgradeCheck| 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse(); 
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | HEAD          | 2019-08-11 11:29:27 |
pg_upgradeCheck| 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL:  Failed process was running: select infinite_recurse(); 

Looking closer at these, though, they were *not* SIGSEGV failures,
but SIGKILLs.  Seeing that they were all on the same machine on the
same day, I'm thinking we can write them off as a transiently
misconfigured OOM killer.

So, pending some other theory emerging from the kernel hackers, we're
down to it's-a-PPC64-kernel-bug.  That leaves me wondering what if
anything we want to do about it.  Even if it's fixed reasonably promptly
in Linux upstream, and then we successfully nag assorted vendors to
incorporate the fix quickly, that's still going to leave us with frequent
buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux
versions.

Should we move the infinite_recurse test to happen alone in a parallel
group just to stop these failures?  That's annoying from a parallelism
standpoint, but I don't see any other way to avoid these failures.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Fix most -Wundef warnings
Next
From: vignesh C
Date:
Subject: Re: Non-Active links being referred in our source code