Thread: Possible explanation for Win32 stats regression test failures

Possible explanation for Win32 stats regression test failures

From

Tom Lane

Date:

16 July 2006, 13:24:35

The latest buildfarm report from trout,
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=trout&dt=2006-07-16%2014:36:19
shows a failure mode that we've seen recently on snake, but not for a
long time on any non-Windows machines: the stats test fails with
symptoms suggesting that the stats counters aren't getting incremented.

Dave Page spotted the reason for this during the recent code sprint.
The stats collector is dying with

FATAL:  could not read statistics message: A blocking operation was interrupted by a call to WSACancelBlockingCall.

If you look through the above-mentioned report's postmaster log, you'll
see several occurrences of this, indicating that the stats collector is
being restarted by the postmaster and then dying again.

After a bit of digging in our code, I realized that the above text is
probably the system's translation of WSAEINTR, which we equate EINTR
to, and thus that what's happening is just "recv() returned EINTR,
even though the socket had already tested read-ready".  I'm not sure
whether that's considered normal behavior on Unixen but it is clearly
possible with our Win32 implementation of recv() --- any pending signal
will make it happen.  So it seems an appropriate fix for the stats
collector is
           len = recv(pgStatSock, (char *) &msg,                      sizeof(PgStat_Msg), 0);           if (len < 0)
+           {
+               if (errno == EINTR)
+                   continue;               ereport(ERROR,                       (errcode_for_socket_access(),
             errmsg("could not read statistics message: %m")));
 
+           }

and we had better look around to make sure all other calls of send()
and recv() treat EINTR as expected too.

But ... AFAICS the only signal that could plausibly be arriving at the
stats collector is SIGALRM from its own use of setitimer() to schedule
stats file writes.  So it seems that this failure occurs when the alarm
fires between the select() and recv() calls; which is possible but it
seems a mighty narrow window.  So I'm not 100% convinced that this is
the correct explanation of the problem --- we've seen snake fail this
way repeatedly, and here we have trout doing it three times within one
regression run.  Can anyone think of a reason why the timing might fall
just so with a higher probability than one would expect?  Perhaps
pgwin32_select() has got a problem that makes it not dispatch signals
as it seems to be trying to do?
        regards, tom lane

Re: Possible explanation for Win32 stats regression test failures

From

Tom Lane

Date:

16 July 2006, 15:34:53

I wrote:
> But ... AFAICS the only signal that could plausibly be arriving at the
> stats collector is SIGALRM from its own use of setitimer() to schedule
> stats file writes.  So it seems that this failure occurs when the alarm
> fires between the select() and recv() calls; which is possible but it
> seems a mighty narrow window.  So I'm not 100% convinced that this is
> the correct explanation of the problem --- we've seen snake fail this
> way repeatedly, and here we have trout doing it three times within one
> regression run.  Can anyone think of a reason why the timing might fall
> just so with a higher probability than one would expect?  Perhaps
> pgwin32_select() has got a problem that makes it not dispatch signals
> as it seems to be trying to do?

Ah-hah, I see it.  pgwin32_select() uses WaitForMultipleObjectsEx() with
an event for the socket read-ready plus an event for signal arrival.
It returns EINTR if the return code from WaitForMultipleObjectsEx shows
the signal-arrival event as fired.  However, WaitForMultipleObjectsEx is
defined to return the number of the *first* event in the list that is
fired.  This means that if the socket comes read-ready at the same time
the SIGALRM arrives, pgwin32_select() will ignore the signal, and it'll
be processed by the subsequent pgwin32_recv().

Now I don't know anything about the Windows scheduler, but I suppose it
gives processes time quantums like everybody else does.  So "at the same
time" really means "within the same scheduler clock tick", which is not
so unlikely after all.  In short, before the just-committed patch, the
Windows stats collector would fail if a stats message arrived during the
same clock tick that its SIGALRM timeout expired.

I think this explains not only the intermittent stats regression
failures, but the reports we've heard from Merlin and others about the
stats collector being unstable under load on Windows.  The heavier the
load of stats messages, the more likely one is to arrive during the tick
when the timeout expires.
        regards, tom lane

Re: Possible explanation for Win32 stats regression test

From

korry

Date:

17 July 2006, 15:45:01

<br /><blockquote type="CITE"><pre>
<font color="#000000">Ah-hah, I see it.  pgwin32_select() uses WaitForMultipleObjectsEx() with</font>
<font color="#000000">an event for the socket read-ready plus an event for signal arrival.</font>
<font color="#000000">It returns EINTR if the return code from WaitForMultipleObjectsEx shows</font>
<font color="#000000">the signal-arrival event as fired.  However, WaitForMultipleObjectsEx is</font>
<font color="#000000">defined to return the number of the *first* event in the list that is</font>
<font color="#000000">fired.  This means that if the socket comes read-ready at the same time</font>
<font color="#000000">the SIGALRM arrives, pgwin32_select() will ignore the signal, and it'll</font>
<font color="#000000">be processed by the subsequent pgwin32_recv().</font>

<font color="#000000">Now I don't know anything about the Windows scheduler, but I suppose it</font>
<font color="#000000">gives processes time quantums like everybody else does.  So "at the same</font>
<font color="#000000">time" really means "within the same scheduler clock tick", which is not</font>
<font color="#000000">so unlikely after all.  In short, before the just-committed patch, the</font>
<font color="#000000">Windows stats collector would fail if a stats message arrived during the</font>
<font color="#000000">same clock tick that its SIGALRM timeout expired.</font>

<font color="#000000">I think this explains not only the intermittent stats regression</font>
<font color="#000000">failures, but the reports we've heard from Merlin and others about the</font>
<font color="#000000">stats collector being unstable under load on Windows.  The heavier the</font>
<font color="#000000">load of stats messages, the more likely one is to arrive during the tick</font>
<font color="#000000">when the timeout expires.</font>
</pre></blockquote><br /> There's a second problem in pgwin32_waitforsinglesocket() that may be getting in your way.<br
/><br/> Inside of pgwin32_waitforsingleselect(), we create a kernel synchronization object (an Event) and associate
thatEvent with the socket.  When the TCP/IP stack detects interesting traffic on the socket, it signals the Event
object(interesting in this case is READ, WRITE, CLOSE, or ACCEPT, depending on the caller) and that wakes up the call
toWaitForMultipleObjectsEx().  <br /><br /> That all works fine, unless you have two or more sockets in the backend
(theimportant part is that src/include/port/win32.h #define's select() and other socket-related function - if you
compilea piece of network code that happens to #include port/win32.h, you'll get the pgwin32_xxx() versions).<br /><br
/>The problem is that, each time you go through pgwin32_waitforsinglesocket(), you tie the *same* kernel object
(waiteventis static) to each socket.  If you have more than one socket, you'll tie each socket to the same kernel
event. The kernel will signal that Event whenever interesting traffic appears on *any* of the sockets. The net effect
isthat, if you are waiting for activity on socket A, any activity on socket B will also awaken
WaitForMultipleObjects(). If you then try to read from socket A, you'll get an "operation would block error" because
nothinghappened on socket A.<br /><br /> The fix is pretty simple - just call WSAEventSelect( s, waitevent, 0 ) after
WaitForMultipleObjectsEx()returns.  That disassociates the socket from the Event (it will get re-associated the next
timepgwin32_waitforsingleselect() is called.  <br /><br /> I ran into this problem working on the PL/pgSQL debugger and
Ihaven't gotten around to posting a patch yet, sorry.<br /><br />             -- Korry (<a
href="mailto:korryd@enterprisedb.com">korryd@enterprisedb.com</a>)<br/><br /><br />

Re: Possible explanation for Win32 stats regression test

From

Tom Lane

Date:

17 July 2006, 18:27:53

korry <korry@appx.com> writes:
> The problem is that, each time you go through
> pgwin32_waitforsinglesocket(), you tie the *same* kernel object
> (waitevent is static) to each socket.

> The fix is pretty simple - just call WSAEventSelect( s, waitevent, 0 )
> after WaitForMultipleObjectsEx() returns.  That disassociates the socket
> from the Event (it will get re-associated the next time
> pgwin32_waitforsingleselect() is called.  

Hmm.  Presumably we don't do this a whole lot (use multiple sockets) or
we'd have noticed before.  Perhaps better would be to keep an additional
static variable saying which socket the event is currently associated
to, and only issue the extra WSAEventSelect calls if we need to change
it.  Or is WSAEventSelect fast enough that it doesn't matter?

Anyway, someone with a Windows machine needs to code and test this ...
        regards, tom lane

Re: Possible explanation for Win32 stats regression test

From

Bruce Momjian

Date:

29 July 2006, 00:00:03

Is anyone working on this?

---------------------------------------------------------------------------

Tom Lane wrote:
> korry <korry@appx.com> writes:
> > The problem is that, each time you go through
> > pgwin32_waitforsinglesocket(), you tie the *same* kernel object
> > (waitevent is static) to each socket.
> 
> > The fix is pretty simple - just call WSAEventSelect( s, waitevent, 0 )
> > after WaitForMultipleObjectsEx() returns.  That disassociates the socket
> > from the Event (it will get re-associated the next time
> > pgwin32_waitforsingleselect() is called.  
> 
> Hmm.  Presumably we don't do this a whole lot (use multiple sockets) or
> we'd have noticed before.  Perhaps better would be to keep an additional
> static variable saying which socket the event is currently associated
> to, and only issue the extra WSAEventSelect calls if we need to change
> it.  Or is WSAEventSelect fast enough that it doesn't matter?
> 
> Anyway, someone with a Windows machine needs to code and test this ...
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Possible explanation for Win32 stats regression test

From

"korryd@enterprisedb.com"

Date:

29 July 2006, 09:19:33

Is anyone working on this?

Tom Lane wrote:
> korry <korry@appx.com> writes:
> > The problem is that, each time you go through
> > pgwin32_waitforsinglesocket(), you tie the *same* kernel object
> > (waitevent is static) to each socket.
> 
> > The fix is pretty simple - just call WSAEventSelect( s, waitevent, 0 )
> > after WaitForMultipleObjectsEx() returns.  That disassociates the socket
> > from the Event (it will get re-associated the next time
> > pgwin32_waitforsingleselect() is called.  
> 
> Hmm.  Presumably we don't do this a whole lot (use multiple sockets) or
> we'd have noticed before.  Perhaps better would be to keep an additional
> static variable saying which socket the event is currently associated
> to, and only issue the extra WSAEventSelect calls if we need to change
> it.  Or is WSAEventSelect fast enough that it doesn't matter?
>

Here's a simple patch that fixes the problem (I haven't explored the performance of this patch compared to Tom's suggestion).

-- Korry

Re: [PATCHES] Possible explanation for Win32 stats regression

From

Andrew Dunstan

Date:

29 July 2006, 11:22:04

heh. I was just doing it the way Tom suggested - see attached. With a
little more trouble we could also keep track if the listened for events
and sometimes save ourselves a second call to WSAEventSelect, but I'm
not sure it's worth it.

cheers

andrew


korryd@enterprisedb.com wrote:
>> Is anyone working on this?
>>
>> Tom Lane wrote:
>> > korry <korry@appx.com <mailto:korry@appx.com>> writes:
>> > > The problem is that, each time you go through
>> > > pgwin32_waitforsinglesocket(), you tie the *same* kernel object
>> > > (waitevent is static) to each socket.
>> >
>> > > The fix is pretty simple - just call WSAEventSelect( s, waitevent, 0 )
>> > > after WaitForMultipleObjectsEx() returns.  That disassociates the socket
>> > > from the Event (it will get re-associated the next time
>> > > pgwin32_waitforsingleselect() is called.
>> >
>> > Hmm.  Presumably we don't do this a whole lot (use multiple sockets) or
>> > we'd have noticed before.  Perhaps better would be to keep an additional
>> > static variable saying which socket the event is currently associated
>> > to, and only issue the extra WSAEventSelect calls if we need to change
>> > it.  Or is WSAEventSelect fast enough that it doesn't matter?
>> >
>>
>
> Here's a simple patch that fixes the problem (I haven't explored the
> performance of this patch compared to Tom's suggestion).
>
>

Index: backend/port/win32/socket.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/port/win32/socket.c,v
retrieving revision 1.11
diff -c -r1.11 socket.c
*** backend/port/win32/socket.c    5 Mar 2006 15:58:35 -0000    1.11
--- backend/port/win32/socket.c    29 Jul 2006 14:19:23 -0000
***************
*** 106,111 ****
--- 106,112 ----
  pgwin32_waitforsinglesocket(SOCKET s, int what)
  {
      static HANDLE waitevent = INVALID_HANDLE_VALUE;
+     static SOCKET current_socket = -1;
      HANDLE        events[2];
      int            r;

***************
*** 121,126 ****
--- 122,136 ----
          ereport(ERROR,
                  (errmsg_internal("Failed to reset socket waiting event: %i", (int) GetLastError())));

+     /*
+      * make sure we don't multiplex this with a different socket
+      * from a previous call
+      */
+
+     if (current_socket != s && current_socket != -1)
+         WSAEventSelect(current_socket, waitevent, 0);
+
+     current_socket = s;

      if (WSAEventSelect(s, waitevent, what) == SOCKET_ERROR)
      {

Re: [PATCHES] Possible explanation for Win32 stats

From

"korryd@enterprisedb.com"

Date:

29 July 2006, 16:36:25

heh. I was just doing it the way Tom suggested - see attached. With a 
little more trouble we could also keep track if the listened for events 
and sometimes save ourselves a second call to WSAEventSelect, but I'm 
not sure it's worth it.

It all depends on the overhead of WSAEventSelect().  I'm sure your version would run faster, but I just don't know if "slower" would be measurable.

BTW: I would suggest changing your comment to:

       /*
        * make sure we don't multiplex this kernel event object with a different socket
        * from a previous call
        */

Thanks for tackling this problem too.

    -- Korry

----
Korry Douglas korryd@enterprisedb.com
EnterpriseDB http://www.enterprisedb.com

Re: [PATCHES] Possible explanation for Win32 stats regression

From

Andrew Dunstan

Date:

29 July 2006, 17:16:36

korryd@enterprisedb.com wrote:
>> heh. I was just doing it the way Tom suggested - see attached. With a
>> little more trouble we could also keep track if the listened for events
>> and sometimes save ourselves a second call to WSAEventSelect, but I'm
>> not sure it's worth it.
>>
>
> It all depends on the overhead of WSAEventSelect().  I'm sure your version would run faster, but I just don't know if
"slower"would be measurable. 
>
>
> BTW:  I would suggest changing your comment to:
>
>        /*
>         * make sure we don't multiplex this kernel event object with a
> different socket
>         * from a previous call
>         */
>
> Thanks for tackling this problem too.
>

Ok. Applied to HEAD and 8.1 and 8.0 branches.

cheers

andrew