Thread: "pgstat wait timeout" just got a lot more common on Windows
Last night I changed the stats collector process to use WaitLatchOrSocket instead of a periodic forced wakeup to see whether the postmaster has died. This morning I observe that several Windows buildfarm members are showing regression test failures caused by unexpected "pgstat wait timeout" warnings. Everybody else is fine. This suggests that there is something broken in the Windows implementation of WaitLatchOrSocket. I wonder whether it also tells us something we did not know about the underlying cause of those messages. Not sure what though. Ideas? Can anyone who knows Windows take another look at WaitLatchOrSocket? regards, tom lane
I wrote: > Last night I changed the stats collector process to use > WaitLatchOrSocket instead of a periodic forced wakeup to see whether > the postmaster has died. This morning I observe that several Windows > buildfarm members are showing regression test failures caused by > unexpected "pgstat wait timeout" warnings. Everybody else is fine. > This suggests that there is something broken in the Windows > implementation of WaitLatchOrSocket. I wonder whether it also > tells us something we did not know about the underlying cause of > those messages. Not sure what though. Ideas? Can anyone who > knows Windows take another look at WaitLatchOrSocket? Anybody have any clues about that? If not, I think I'll have to revert the pgstat changes for beta1, which isn't really forward progress. I spent some time staring at the Windows WaitLatchOrSocket code myself. The only thing I could find that seemed wrong is that in the event array, we list the latch's event before pgwin32_signal_event. The Microsoft documentation I looked at says that if more than one event is ready, WaitforMultipleObjects reports the first such array member. This means that if the latch is already set when control gets here, signal handlers will not be serviced. That doesn't match what would happen on a Unix machine, so it seems like at least a violation of the POLA. Hence I think we oughta swap the order of those two array elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of pgwin32_select.) I do not however see a way that that would explain the pgstat failures, because the stats collector's latch really shouldn't ever get set during normal regression test runs. regards, tom lane
<p><br /> On May 10, 2012 4:59 PM, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br/> ><br /> > I wrote:<br /> > > Last night I changed the stats collector process to use<br /> > >WaitLatchOrSocket instead of a periodic forced wakeup to see whether<br /> > > the postmaster has died. This morningI observe that several Windows<br /> > > buildfarm members are showing regression test failures caused by<br/> > > unexpected "pgstat wait timeout" warnings. Everybody else is fine.<br /> ><br /> > > This suggeststhat there is something broken in the Windows<br /> > > implementation of WaitLatchOrSocket. I wonder whetherit also<br /> > > tells us something we did not know about the underlying cause of<br /> > > those messages. Not sure what though. Ideas? Can anyone who<br /> > > knows Windows take another look at WaitLatchOrSocket?<br/> ><br /> > Anybody have any clues about that? If not, I think I'll have to revert<br /> >the pgstat changes for beta1, which isn't really forward progress.<p>Haven't had time to look at the code itself, andwon't before wrap time. Sorry. <br /><p>> I spent some time staring at the Windows WaitLatchOrSocket code myself.<br/> > The only thing I could find that seemed wrong is that in the event<br /> > array, we list the latch'sevent before pgwin32_signal_event. The<br /> > Microsoft documentation I looked at says that if more than oneevent<br /> > is ready, WaitforMultipleObjects reports the first such array member.<br /> > This means that if thelatch is already set when control gets here,<br /> > signal handlers will not be serviced.<p>Yeah, that does seem wrong.<p>> That doesn't match what would<br /> > happen on a Unix machine, so it seems like at least a violation ofthe<br /> > POLA. Hence I think we oughta swap the order of those two array<br /> > elements. (Same issue in PGSemaphoreLock,btw, and I'm suspicious of<br /> > pgwin32_select.) I do not however<p>Maybe we need a loop that checksfor all events? <p>> see a way that that would explain the<br /> > pgstat failures, because the stats collector'slatch really shouldn't<br /> > ever get set during normal regression test runs.<br /><p>So could there be somethingwrong in the other end, meaning the latch *does* get set? <p>/Magnus <br />
Magnus Hagander <magnus@hagander.net> writes: > On May 10, 2012 4:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> I spent some time staring at the Windows WaitLatchOrSocket code myself. >> The only thing I could find that seemed wrong is that in the event >> array, we list the latch's event before pgwin32_signal_event. The >> Microsoft documentation I looked at says that if more than one event >> is ready, WaitforMultipleObjects reports the first such array member. >> This means that if the latch is already set when control gets here, >> signal handlers will not be serviced. > Yeah, that does seem wrong. >> That doesn't match what would >> happen on a Unix machine, so it seems like at least a violation of the >> POLA. Hence I think we oughta swap the order of those two array >> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of >> pgwin32_select.) I do not however > Maybe we need a loop that checks for all events? I don't think so. It's already the case that WaitLatch doesn't guarantee that all possible flags are set in its result. In connection with Peter G's observation that we could simplify the API by rechecking PostmasterIsAlive for WL_POSTMASTER_DEATH, I was planning to clarify the API spec as "result bits that are set are guaranteed to reflect reality, but it's not guaranteed that we set every bit that could possibly be set". This should not break any caller since the same result could occur given a slight change in timing anyway; the caller has to be prepared to come back and check for more conditions after it services whatever WaitLatch does report. However, signal service is not a condition the caller is supposed to deal with, so I think we want a guarantee that that happens inside WaitLatch. >> see a way that that would explain the >> pgstat failures, because the stats collector's latch really shouldn't >> ever get set during normal regression test runs. > So could there be something wrong in the other end, meaning the latch > *does* get set? Even if it did, it'd get cleared at the top of the loop, so that the next call ought to handle things. Tis a puzzlement. AFAICS the only condition WaitforMultipleObjects is going to see in these tests is read-ready on the socket; surely it wouldn't fail to notice that? regards, tom lane
I wrote: > Hence I think we oughta swap the order of those two array > elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of > pgwin32_select.) Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously* nasty bug: it does not reset ImmediateInterruptOK before returning. How is it that Windows machines aren't falling over constantly? regards, tom lane
On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Hence I think we oughta swap the order of those two array >> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of >> pgwin32_select.) > > Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously* > nasty bug: it does not reset ImmediateInterruptOK before returning. > How is it that Windows machines aren't falling over constantly? Hmm. the commit you made to fix it says it changes how ImmediateInterruptOK is handled, but there was not a single line of code that actually changed that? Or am I misreading this completely? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously* >> nasty bug: it does not reset ImmediateInterruptOK before returning. >> How is it that Windows machines aren't falling over constantly? > Hmm. the commit you made to fix it says it changes how > ImmediateInterruptOK is handled, but there was not a single line of > code that actually changed that? Or am I misreading this completely? Exit is now out the bottom of the loop, not by a raw "return;". regards, tom lane
On Fri, May 11, 2012 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously* >>> nasty bug: it does not reset ImmediateInterruptOK before returning. >>> How is it that Windows machines aren't falling over constantly? > >> Hmm. the commit you made to fix it says it changes how >> ImmediateInterruptOK is handled, but there was not a single line of >> code that actually changed that? Or am I misreading this completely? > > Exit is now out the bottom of the loop, not by a raw "return;". oh, d'uh. Sorry, missed that one completely. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/