Bugs in our Windows socket code - Mailing list pgsql-hackers

From Tom Lane
Subject Bugs in our Windows socket code
Date
Msg-id 4351.1336927207@sss.pgh.pa.us
Whole thread Raw
Responses Re: Bugs in our Windows socket code
Re: Bugs in our Windows socket code
List pgsql-hackers
I've been trying to figure out why my recent attempt to latch-ify the
stats collector didn't work well on the Windows buildfarm machines.
After a good deal of staring at our code and Microsoft's documentation
I have a theory, which I intend to try out shortly.  However, it appears
to me that this code is riddled with problems that are going to take a lot
of work to solve properly.  I have neither the interest nor the Windows
development environment to do more than getting the pgstats code to work,
so somebody else is going to have to step up here.

The first issue is that the patched stats collector code expected recv()
on a non-blocking socket to fail with EWOULDBLOCK or a related error code.
pgwin32_recv() gets this exactly backwards, and instead waits for the
socket to come ready using pgwin32_waitforsinglesocket().  I initially
thought this was broken by commit 215cbc90f8db3fc8a70af5572b156f49216c2f70
which introduced the "pgwin32_noblock" hack, but on checking the git
history I think it was wrong before then too.

This wouldn't actually be fatal in itself, as long as the recv call exits
with EINTR on receipt of a signal, which it does; it doesn't much matter
whether the stats collector waits at the recv() or the WaitLatchOrSocket()
call when it has nothing to do.

However, pgwin32_waitforsinglesocket creates an event object and attaches
it to the socket, and leaves things that way on exit.  Assuming that we
do sometimes get to the WaitLatchOrSocket() call, that code creates a
different event object and attaches it to the socket.  The WSAEventSelect
man page
http://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
warns against this, saying
 Issuing a WSAEventSelect for a socket cancels any previous WSAAsyncSelect or WSAEventSelect for the same socket and
clearsthe internal network event record. For example, to associate an event object with both reading and writing
networkevents, the application must call WSAEventSelect with both FD_READ and FD_WRITE, as follows:rc =
WSAEventSelect(s,hEventObject, FD_READ|FD_WRITE); It is not possible to specify different event objects for different
networkevents. The following code will not work; the second call will cancel the effects of the first, and only the
FD_WRITEnetwork event will be associated with hEventObject2:rc = WSAEventSelect(s, hEventObject1, FD_READ);rc =
WSAEventSelect(s,hEventObject2, FD_WRITE); //bad
 

So the socket's attachment is getting moved back and forth between two
different event objects.  What I suspect is happening is that an incoming
FD_READ event is getting "lost" when it happens concurrently with such a
switch; probably the wrong one of the two event objects gets marked
signaled, and then we just sit and wait.  The phraseology about "clearing
the internal network event record" is pretty scary too, even though
elsewhere on the page it's claimed that FD_READ is level-triggered and
so should be re-asserted against a newly selected event object.  Another
point is that googling suggests that it's unwise to ResetEvent an event
that's attached to a socket, which is what will happen during repeated
calls to pgwin32_waitforsinglesocket.

What I intend to try doing about this is making
pgwin32_waitforsinglesocket detach its event object from the socket before
returning, ie WSAEventSelect(socket, NULL, 0).  This will hopefully
decouple it a bit better from WaitLatchOrSocket.

Other interesting things I learned from the WSAEventSelect man page:
 The WSAEventSelect function automatically sets socket s to nonblocking mode, regardless of the value of
lNetworkEvents.To set socket s back to blocking mode, it is first necessary to clear the event record associated with
sockets via a call to WSAEventSelect with lNetworkEvents set to zero and the hEventObject parameter set to NULL. You
canthen call ioctlsocket or WSAIoctl to set the socket back to blocking mode.
 

I imagine this is the reason for the otherwise seemingly brain-dead
decision that pgwin32_recv() should act like non-blocking sockets are
blocking.  However, it's still extremely distressing that code that
*wants* to operate a socket in non-blocking mode is not able to do so.
This ought to be fixed, and not with a global flag like pgwin32_noblock
--- it needs to be per-socket.

Also, although it's stated that WSAEventSelect will generally set the
event to signaled state when any requested network event has already
happened, we read this:
 The FD_WRITE network event is handled slightly differently. An FD_WRITE network event is recorded when a socket is
firstconnected with a call to the connect, ConnectEx, WSAConnect, WSAConnectByList, or WSAConnectByName function or
whena socket is accepted with accept, AcceptEx, or WSAAccept function and then after a send fails with WSAEWOULDBLOCK
andbuffer space becomes available. Therefore, an application can assume that sends are possible starting from the first
FD_WRITEnetwork event setting and lasting until a send returns WSAEWOULDBLOCK. After such a failure the application
willfind out that sends are again possible when an FD_WRITE network event is recorded and the associated event object
isset.
 

That is, if you request FD_WRITE events for a pre-existing socket with
WSAEventSelect, you will not get one until the outbound network buffer has
been filled and then has partially emptied.  (This is incredibly broken,
but Microsoft evidently has no intention of fixing it.)  This behavior
probably explains the bogus-looking code in pgwin32_waitforsinglesocket
and pgwin32_select where we actually issue a zero-length send() to see if
the socket is write-ready.  However, it doesn't mean that that code is
right or properly documented.  AFAICS the appropriate behavior is to try
*one* send, without any preceding wait, and if that fails with
WSAEWOULDBLOCK then wait normally for FD_WRITE to be signaled.
The claimed connection to UDP-ness of the connection looks like it's
probably a figment of our imagination, as well.

More generally, it seems clear to me that Microsoft's code is designed
around the assumption that an event object remains attached to a socket
for the lifetime of the socket.  This business of transiently associating
event objects with sockets looks quite inefficient and is evidently
triggering a lot of unpleasant corner-case behaviors.  I wonder whether we
should not try to make "pgsocket" encapsulate a socket and an associated
event object as a single entity on Windows.  (Such a struct would be a
good place to keep a per-socket noblock flag, too.)  I'm not going to
tackle that myself though.

One other bug I noted is that if pgwin32_waitforsinglesocket gets a
timeout condition, it returns 0 without bothering to set errno.  This
seems only latent at the moment since most callers pass INFINITE timeout,
but it probably ought to get cleaned up.  Or maybe we should just remove
the timeout option, since once I get pgstat.c fixed there will be no
callers not passing INFINITE timeout.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Analyzing foreign tables & memory problems
Next
From: Jeff Janes
Date:
Subject: Foreign keys in pgbench