Re: Some 9.5beta2 backend processes not terminating properly? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Some 9.5beta2 backend processes not terminating properly?
Date
Msg-id 20160102144003.g7kqdvcakfhoftgg@alap3.anarazel.de
Whole thread Raw
In response to Re: Some 9.5beta2 backend processes not terminating properly?  (Andres Freund <andres@anarazel.de>)
Responses Re: Some 9.5beta2 backend processes not terminating properly?  (Andres Freund <andres@anarazel.de>)
Re: Some 9.5beta2 backend processes not terminating properly?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 2016-01-02 14:26:47 +0100, Andres Freund wrote:
> On 2016-01-02 18:40:38 +0530, Amit Kapila wrote:
> > If we
> > remember the closed socket event and then take appropriate action,
> > then this problem won't happen.  Attached patch which by no-means
> > a complete fix shows what I wanted to say and after this the problem
> > mentioned by Shay doesn't happen, although I get LOG message
> > which is due to the reason that proper handling for socket closure
> > needs to be done in this path.  This patch is based on the code
> > after commit 387da18874afa17156ee3af63766f17efb53c4b9.  I
> > will do testing and refine the fix based on HEAD later as I am done
> > for the today.
> 
> It's weird that this fixes the problem. As we were previously, according
> to Shay, not busy looping, this seems to indicate that FD_CLOSE is only
> reported once or somesuch?
> 
> It'd be very interesting to add a debug elog() into the
>             if (resEvents.lNetworkEvents & FD_CLOSE)
>             {
>                 if (wakeEvents & WL_SOCKET_READABLE)
>                     result |= WL_SOCKET_READABLE;
>                 if (wakeEvents & WL_SOCKET_WRITEABLE)
>                     result |= WL_SOCKET_WRITEABLE;
>             }
> 
> path in WaitLatchOrSocket. If it actually returns with the current code,
> we have a better idea where to look for problems.


I wonder if the following is the problem: The docs for WSAEventSelect()
says:
"Having successfully recorded the occurrence of the network event (by
setting the corresponding bit in the internal network event record) and
signaled the associated event object, no further actions are taken for
that network event until the application makes the function call that
implicitly reenables the setting of that network event and signaling of
the associated event object."
and also notes specifically for FD_CLOSE that there's no re-enabling
functions.

See
https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576%28v=vs.85%29.aspx
which goes on to talk about some level triggered events (FD_READ, ...)
and others being edge triggered. It's not clear to me from that whether
FD_CLOSE is supposed to be edge or level triggered.

If FD_CLOSE is indeed edge and not level triggered - which imo would be
supremely insane - we'd be in trouble. It'd explain why some failures
are noticed and others not.

ISTM this should relatively easily be debuggable by adding a few debug
elogs.

Andres



pgsql-hackers by date:

Previous
From: Fabrízio de Royes Mello
Date:
Subject: Re: Patch: fix lock contention for HASHHDR.mutex
Next
From: Andreas Seltenreich
Date:
Subject: Re: [sqlsmith] Failing assertions in spgtextproc.c