Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id 24435.1496773316@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> Buildfarm member lorikeet is failing occasionally with a failed
>> assertion during the select_parallel regression tests like this:

> I don't *think* we've made any relevant code changes lately.  The only
> thing that I can see as looking at all relevant is
> b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
> like it can be to blame.

Yeah, I don't believe that either.  That could have introduced a hard
failure (if something were relying on initializing a field before where
I put the memsets) but it's hard to see how it could produce an
intermittent and platform-specific one.

> One thought is that the only places where shm_mq_set_sender() should
> be getting invoked during the main regression tests are
> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
> places using ParallelWorkerNumber to figure out what address to pass.
> So if ParallelWorkerNumber were getting set to the same value in two
> different parallel workers - e.g. because the postmaster went nuts and
> launched two processes instead of only one - or if
> ParallelWorkerNumber were not getting initialized at all or were
> getting initialized to some completely bogus value, it could cause
> this symptom.

Hmm.  With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Should we standardize on a type for signal handler flags?
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] UPDATE of partition key