Re: stress test for parallel workers - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: stress test for parallel workers |
Date | |
Msg-id | CA+hUKGL6cDyb2maq2P60cEsjFK=3saBCAj7sDzE3jysL-PRwqg@mail.gmail.com Whole thread Raw |
In response to | Re: stress test for parallel workers (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: stress test for parallel workers
Re: stress test for parallel workers |
List | pgsql-hackers |
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Do you have an example to hand? Is this > > failure always happening on Linux? > > I dug around a bit further, and while my recollection of a lot of > "postmaster exited during a parallel transaction" failures is accurate, > there is a very strong correlation I'd not noticed: it's just a few > buildfarm critters that are producing those. To wit, I find that > string in these recent failures (checked all runs in the past 3 months): > > sysname | branch | snapshot > -----------+---------------+--------------------- > lorikeet | HEAD | 2019-06-16 20:28:25 > lorikeet | HEAD | 2019-07-07 14:58:38 > lorikeet | HEAD | 2019-07-02 10:38:08 > lorikeet | HEAD | 2019-06-14 14:58:24 > lorikeet | HEAD | 2019-07-04 20:28:44 > lorikeet | HEAD | 2019-04-30 11:00:49 > lorikeet | HEAD | 2019-06-19 20:29:27 > lorikeet | HEAD | 2019-05-21 08:28:26 > lorikeet | REL_11_STABLE | 2019-07-11 08:29:08 > lorikeet | REL_11_STABLE | 2019-07-09 08:28:41 > lorikeet | REL_12_STABLE | 2019-07-16 08:28:37 > lorikeet | REL_12_STABLE | 2019-07-02 21:46:47 > lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14 > vulpes | HEAD | 2019-06-14 09:18:18 > vulpes | HEAD | 2019-06-27 09:17:19 > vulpes | HEAD | 2019-07-21 09:01:45 > vulpes | HEAD | 2019-06-12 09:11:02 > vulpes | HEAD | 2019-07-05 08:43:29 > vulpes | HEAD | 2019-07-15 08:43:28 > vulpes | HEAD | 2019-07-19 09:28:12 > wobbegong | HEAD | 2019-06-09 20:43:22 > wobbegong | HEAD | 2019-07-02 21:17:41 > wobbegong | HEAD | 2019-06-04 21:06:07 > wobbegong | HEAD | 2019-07-14 20:43:54 > wobbegong | HEAD | 2019-06-19 21:05:04 > wobbegong | HEAD | 2019-07-08 20:55:18 > wobbegong | HEAD | 2019-06-28 21:18:46 > wobbegong | HEAD | 2019-06-02 20:43:20 > wobbegong | HEAD | 2019-07-04 21:01:37 > wobbegong | HEAD | 2019-06-14 21:20:59 > wobbegong | HEAD | 2019-06-23 21:36:51 > wobbegong | HEAD | 2019-07-18 21:31:36 > (32 rows) > > We already knew that lorikeet has its own peculiar stability > problems, and these other two critters run different compilers > on the same Fedora 27 ppc64le platform. > > So I think I've got to take back the assertion that we've got > some lurking generic problem. This pattern looks way more > like a platform-specific issue. Overaggressive OOM killer > would fit the facts on vulpes/wobbegong, perhaps, though > it's odd that it only happens on HEAD runs. chipmunk also: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-08-06%2014:16:16 I wondered if the build farm should try to report OOM kill -9 or other signal activity affecting the postmaster. On some systems (depending on sysctl kernel.dmesg_restrict on Linux, security.bsd.unprivileged_read_msgbuf on FreeBSD etc) you can run dmesg as a non-root user, and there the OOM killer's footprints or signaled exit statuses for processes under init might normally be found, but that seems a bit invasive for the host system (I guess you'd filter it carefully). Unfortunately it isn't enabled on many common systems anyway. Maybe there is a systemd-specific way to get the info we need without being root? Another idea: start the postmaster under a subreaper (Linux 3.4+ prctl(PR_SET_CHILD_SUBREAPER), FreeBSD 10.2+ procctl(PROC_REAP_ACQUIRE)) that exists just to report on its children's exit status, so the build farm could see "pid XXX was killed by signal 9" message if it is nuked by the OOM killer. Perhaps there is a common subreaper wrapper out there that would wait, print messages like that, rince and repeat until it has no children and then exit, or perhaps pg_ctl or even a perl script could do somethign like that if requested. Another thought, not explored, is the brand new Linux pidfd stuff that can be used to wait and get exit status for a non-child process (or the older BSD equivalent), but the paint isn't even dry on that stuff anwyay. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: