Re: "pgstat wait timeout" just got a lot more common on Windows - Mailing list pgsql-hackers

From Tom Lane
Subject Re: "pgstat wait timeout" just got a lot more common on Windows
Date
Msg-id 426.1336661906@sss.pgh.pa.us
Whole thread Raw
In response to "pgstat wait timeout" just got a lot more common on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: "pgstat wait timeout" just got a lot more common on Windows
Re: "pgstat wait timeout" just got a lot more common on Windows
List pgsql-hackers
I wrote:
> Last night I changed the stats collector process to use
> WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> the postmaster has died.  This morning I observe that several Windows
> buildfarm members are showing regression test failures caused by
> unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

> This suggests that there is something broken in the Windows
> implementation of WaitLatchOrSocket.  I wonder whether it also
> tells us something we did not know about the underlying cause of
> those messages.  Not sure what though.  Ideas?  Can anyone who
> knows Windows take another look at WaitLatchOrSocket?

Anybody have any clues about that?  If not, I think I'll have to revert
the pgstat changes for beta1, which isn't really forward progress.

I spent some time staring at the Windows WaitLatchOrSocket code myself.
The only thing I could find that seemed wrong is that in the event
array, we list the latch's event before pgwin32_signal_event.  The
Microsoft documentation I looked at says that if more than one event
is ready, WaitforMultipleObjects reports the first such array member.
This means that if the latch is already set when control gets here,
signal handlers will not be serviced.  That doesn't match what would
happen on a Unix machine, so it seems like at least a violation of the
POLA.  Hence I think we oughta swap the order of those two array
elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
pgwin32_select.)  I do not however see a way that that would explain the
pgstat failures, because the stats collector's latch really shouldn't
ever get set during normal regression test runs.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: incorrect handling of the timeout in pg_receivexlog
Next
From: Tom Lane
Date:
Subject: Re: Draft release notes complete