Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id 31838.1497537132@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But we know, from the subsequent failed assertion, that the leader was
>> still trying to launch parallel workers.  So that particular theory
>> doesn't hold water.

> Is there any chance that it's already trying to launch parallel
> workers for the *next* query?

Oh!  Yeah, you might be right, because the trace includes a statement
LOG entry from the leader in between:

2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement: EXPLAIN (analyze, timing off, summary off, costs off)
SELECT* FROM tenk1; 
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel worker for PID 10072 (PID 11896) exited
withexit code 1 
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count
<=1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 10072) was terminated by signal 6: Aborted

It's fairly hard to read this other than as telling us that the worker was
launched for the EXPLAIN (although really? why aren't we skipping that if
EXEC_FLAG_EXPLAIN_ONLY?), and then we see a new LOG entry for the next
statement before the leader hits its assertion failure.

> Could be -- but it could also be timing-related.  If we are in fact
> using cygwin's fork emulation, the documentation for it explains that
> it's slow: https://www.cygwin.com/faq.html#faq.api.fork
> Interestingly, it also mentions that making it work requires
> suspending the parent while the child is starting up, which probably
> does not happen on any other platform.  Of course it also makes my
> theory that the child doesn't reach dsm_attach() before the parent
> finishes the query pretty unlikely.

Well, if this was a worker launched during InitPlan() for an EXPLAIN,
the leader would have shut down the query almost immediately after
launching the worker.  So it does fit pretty well as long as you're
willing to believe that the leader got to run before the child.

But what this theory doesn't explain is: why haven't we seen this before?
It now seems like it ought to come up often, since there are several
EXPLAINs for parallel queries in that test.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] memory fields from getrusage()
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests