Possible explanation for Win32 stats regression test failures - Mailing list pgsql-hackers

From Tom Lane
Subject Possible explanation for Win32 stats regression test failures
Date
Msg-id 599.1153067067@sss.pgh.pa.us
Whole thread Raw
Responses Re: Possible explanation for Win32 stats regression test failures  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
The latest buildfarm report from trout,
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=trout&dt=2006-07-16%2014:36:19
shows a failure mode that we've seen recently on snake, but not for a
long time on any non-Windows machines: the stats test fails with
symptoms suggesting that the stats counters aren't getting incremented.

Dave Page spotted the reason for this during the recent code sprint.
The stats collector is dying with

FATAL:  could not read statistics message: A blocking operation was interrupted by a call to WSACancelBlockingCall.

If you look through the above-mentioned report's postmaster log, you'll
see several occurrences of this, indicating that the stats collector is
being restarted by the postmaster and then dying again.

After a bit of digging in our code, I realized that the above text is
probably the system's translation of WSAEINTR, which we equate EINTR
to, and thus that what's happening is just "recv() returned EINTR,
even though the socket had already tested read-ready".  I'm not sure
whether that's considered normal behavior on Unixen but it is clearly
possible with our Win32 implementation of recv() --- any pending signal
will make it happen.  So it seems an appropriate fix for the stats
collector is
           len = recv(pgStatSock, (char *) &msg,                      sizeof(PgStat_Msg), 0);           if (len < 0)
+           {
+               if (errno == EINTR)
+                   continue;               ereport(ERROR,                       (errcode_for_socket_access(),
             errmsg("could not read statistics message: %m")));
 
+           }

and we had better look around to make sure all other calls of send()
and recv() treat EINTR as expected too.

But ... AFAICS the only signal that could plausibly be arriving at the
stats collector is SIGALRM from its own use of setitimer() to schedule
stats file writes.  So it seems that this failure occurs when the alarm
fires between the select() and recv() calls; which is possible but it
seems a mighty narrow window.  So I'm not 100% convinced that this is
the correct explanation of the problem --- we've seen snake fail this
way repeatedly, and here we have trout doing it three times within one
regression run.  Can anyone think of a reason why the timing might fall
just so with a higher probability than one would expect?  Perhaps
pgwin32_select() has got a problem that makes it not dispatch signals
as it seems to be trying to do?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Dave Page"
Date:
Subject: Re: Windows buildfarm support, or lack of it
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Windows buildfarm support, or lack of it