Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id 31674.1496780737@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm.  With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?

> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it.

Yeah, me either, but we're really theorizing in advance of the data here.
Andrew, could you apply the attached patch on lorikeet and run the
regression tests enough times to get a couple of failures?  Then grepping
the postmaster log for 'parallel worker' should give you results like

2017-06-06 16:20:12.393 EDT [31216] LOG:  starting PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.400 EDT [31216] LOG:  stopping PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.406 EDT [31217] LOG:  starting PID 31217, parallel worker for PID 31215, worker number 3
2017-06-06 16:20:12.406 EDT [31218] LOG:  starting PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.406 EDT [31219] LOG:  starting PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.406 EDT [31220] LOG:  starting PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31218] LOG:  stopping PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.412 EDT [31219] LOG:  stopping PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.412 EDT [31220] LOG:  stopping PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31217] LOG:  stopping PID 31217, parallel worker for PID 31215, worker number 3
... etc etc ...

If it looks different from that in a crash case, we'll have something
to go on.

(I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
or so.)

            regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index cb22174..d3cb26c 100644
*** a/src/backend/access/transam/parallel.c
--- b/src/backend/access/transam/parallel.c
*************** ParallelWorkerMain(Datum main_arg)
*** 950,955 ****
--- 950,961 ----
      Assert(ParallelWorkerNumber == -1);
      memcpy(&ParallelWorkerNumber, MyBgworkerEntry->bgw_extra, sizeof(int));

+     /* Log parallel worker startup. */
+     ereport(LOG,
+             (errmsg("starting PID %d, %s, worker number %d",
+                     MyProcPid, MyBgworkerEntry->bgw_name,
+                     ParallelWorkerNumber)));
+
      /* Set up a memory context and resource owner. */
      Assert(CurrentResourceOwner == NULL);
      CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel");
*************** ParallelWorkerMain(Datum main_arg)
*** 1112,1117 ****
--- 1118,1129 ----

      /* Report success. */
      pq_putmessage('X', NULL, 0);
+
+     /* Log parallel worker shutdown. */
+     ereport(LOG,
+             (errmsg("stopping PID %d, %s, worker number %d",
+                     MyProcPid, MyBgworkerEntry->bgw_name,
+                     ParallelWorkerNumber)));
  }

  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: [HACKERS] Re: [BUGS] BUG #14682: row level security not work with partitionedtable
Next
From: Kevin Grittner
Date:
Subject: Re: [HACKERS] PG10 transition tables, wCTEs and multiple operationson the same table