Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id CAA4eK1LmF+ra1iCf+7AjcV0YuRmt6hcR=+m9q39jGG-o7CQOvQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> But surely the silent treatment should only apply to DSM_OP_CREATE?
>
> Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
>
> The lack of any other message before the 'could not map' failure must,
> then, mean that dsm_attach() couldn't find an entry in shared memory
> that it wanted to attach to.
>

Yes, I also think so.

>  But how could that happen?
>

I could think of a couple of reasons due to which it can happen. (a)
the value of segment handle passed by the master backend to worker
backend somehow got messed up. (b) all other workers along with master
backend exited before one of the workers try to attach. (c) the master
backend has not actually created any dsm segment (like when max
segments have reached)  but still invoked some workers. (d) some
corner case bug in dsm code due to which it can't attach to a valid
segment handle.

Now, of these, I have checked that (c) can't happen because we ensure
that if the segment is not created than we make workers as zero.  I
think (b) shouldn't happen because we wait for all workers to exit
before the query is finished.  Now, I think (a) and (d) are slightly
related and I have looked around in the relevant code but didn't find
any obvious problem, however, it seems to me that it might be
something which happens on Cygwin environment differently.  For
example, I think the value of seg->handle can be different from what
we expect in dsm_create.  Basically, random returns long and we are
storing it in dsm_handle (uint32), so considering long is 8 bytes on
Cygwin [1] and 4 bytes on Windows the value could wrap but even if
that happens it is not clear how that can cause what we are seeing in
this case.


[1] - https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Detection of IPC::Run presence in SSL TAP tests
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Adding connection id in the startup message