Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram - Mailing list pgsql-bugs

From Luke Koops
Subject Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
Date
Msg-id A3144629B5AC714A8BF27806EBFA70575146229F@sottexch7.corp.ad.entrust.com
Whole thread Raw
In response to Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram  (Nikhil Sontakke <nikhil.sontakke@enterprisedb.com>)
Responses Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
List pgsql-bugs
There was no firewall in place, or more correctly the Windows Firewall is c=
onfigured to be off.  There is no other firewall installed on the system.

To get to this point in the code, the return value from WSARecv() was WSAEW=
OULDBLOCK.  The socket is set for overlapped IO and is a datagram socket.  =
MSDN documentation says that means there are too many outstanding overlappe=
d IO requests.  I don't know if "too many" applies to just this socket or t=
o the system as a whole.  The documentation isn't clear about how to handle=
 the return code in this situation.

We don't need to know if this is a Kernel issue, a bug in winsock, or an un=
documented behaviour.  Regardless, it can be treated as a fault.

Knowing that it is possible for WaitForMultipleObjectsEx to lock up means t=
hat it is not safe to call with an INFINITE timeout.  The workaround that's=
 being discussed is beginning to look like the one at line 172 of socket.c.=
  It's bad enough that there is a WSASend in pgwin32_waitforsinglesocket().=
  I doubt you also want to add a WSARecv.  There should be a cleaner way to=
 handle both of these situations.

I am planning to eventually kill the stats collector and see if that clears=
 up the hanging issue, but I want to keep the system state in place for a b=
it longer in case there is some other diagnostic steps I should try.  I've =
exhausted everything I could think of.

-Luke


-----Original Message-----
From: Nikhil Sontakke [mailto:nikhil.sontakke@enterprisedb.com]
Sent: Monday, August 03, 2009 10:38 AM
To: Magnus Hagander
Cc: Alvaro Herrera; Luke Koops; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjec=
tsEx while attempting to recv a datagram

Hi,

>>>>>
>>>>> Maybe. I'm unsure if it's enough to just try another
>>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>>> WSARecv() on it as well. Maybe it would be enough to just change
>>>>> the INIFINTE on line 318 of socket.c to a fixed value. That will
>>>>> loop down to WSARecv() which should exit with WSAEWOULDBLOCK which
>>>>> will cause us to do a short sleep and come back. But we'd have to
>>>>> change the limit of 5 somehow then, since in theory we should wait
>>>>> forever. Maybe that outer loop should just be a for(;;), what do you =
think?
>>>>>
>>>>
>>>> Yes, line 318 seems to be a much better location to me. If Windows
>>>> and it's socket logic behaves properly most of the times :), most
>>>> of the calls should come out within the first few tries, so
>>>> changing 5 to an infinite loop shouldn't hurt those normal use cases i=
n theory.
>>>>
>>>> OTOH, I was wondering what if we kill the stats collector and on a
>>>> restart the socket communication resumes properly. Would that
>>>> conclusively mean that it is a flaw in our code?
>>>
>>> No, if we kill the stats collector that will destroy all sockets,
>>> and when the new one starts all the sockets it operates on are fresh
>>> and new. So it doesn't show that the flaw is in our code - but it
>>> also doesn't show that it's in the kernel or runtime libraries.
>>>
>>
>> AFAICS in the code, the inherited pgStatSock socket FD remains the
>> same across the restart of the stats collector process...
>
> Partially correct, I think.
>
> Each backend has it's own handle on win32, since we use EXEC_BACKEND
> (this includes the "utility processes" like the stats collector). When
> we start the new one, we are going to use DuplicateHandle() in
> save_backend_variables(). This will therefor get it a new handle, but
> they are both pointing to the same kernel object. I don't know if
> WaitForMultipleObjectsEx() is going to see these as two different
> objects or not, but I think it does.
>

Hmm, got it. Nothing like adding more confusion into the mix :)

Regards,
Nikhils
--
http://www.enterprisedb.com

pgsql-bugs by date:

Previous
From: wader2
Date:
Subject: Re: BUG #4961: pg_standby.exe crashes with no args
Next
From: "William Crawford"
Date:
Subject: BUG #4963: Selecting timestamp without timezone at timezone gives wrong output